diff --git a/specs/newton/cheesecake-promote-backend.rst b/specs/newton/cheesecake-promote-backend.rst deleted file mode 100644 index 9307e9eb..00000000 --- a/specs/newton/cheesecake-promote-backend.rst +++ /dev/null @@ -1,161 +0,0 @@ -.. - This work is licensed under a Creative Commons Attribution 3.0 Unported - License. - - http://creativecommons.org/licenses/by/3.0/legalcode - -===================================================== -Promote Secondary Backend After Failover (Cheesecake) -===================================================== - -https://blueprints.launchpad.net/cinder/+spec/replication-backend-promotion - -Problem description -=================== - -After failing backend A over to backend B, there is not a mechanism in -Cinder to promote backend B to the master backend in order to then replicate -to a backend C. - -Current Workflow ----------------- -1. Setup Replication -2. Failure Occurs -3. Fail over -4. Promote Secondary Backend - a. Freeze backend to prevent manage operations - b. Stop cinder volume service - c. Update cinder.conf to have backend A replaced with B and B with C - d. *Hack db to set backend to no longer be in 'failed-over' state* - * This is the step this spec is concerned with - * Example: - :: - - update services set disabled=0, - disabled_reason=NULL, - replication_status='enabled', - active_backend_id=NULL - where id=3; - e. Start cinder volume service - f. Unfreeze backend - -Use Cases -========= -There was a fire in my data center and my primary backend (A) was destroyed. -Luckily, I was replicating that backend to backend (B). After failing over -to backend B and repairing the data center, we installed backend C to be a -new replication target for B. - -Proposed change -=============== - -Add the following commands to reset the active backend for a host. -:: - - cinder reset-active-backend - PUT /os-services/reset_active_backend {"host": } - -Equivalent to: -:: - - update services set disabled=0, - disabled_reason=NULL, - replication_status='disabled', - active_backend_id=NULL - where host='; - -The following to re-enable replication from backend B -> C. -:: - - cinder replication-enable - PUT /os-services/replication_enable {"host": } - -Equivalent to: -:: - - update services set replication_status='enabled' - where host='; - -Alternatives ------------- - -Instead of 'admin' APIs, we add cinder-manage commands. -:: - - cinder-manage reset-active-backend - cinder-manage replication-enable - -Data model impact ------------------ - -None - -REST API impact ---------------- - -None - -Security impact ---------------- - -None - -Notifications impact --------------------- - -None - -Other end user impact ---------------------- - -None - -Performance Impact ------------------- - -None - -Other deployer impact ---------------------- - -None - -Developer impact ----------------- - -None - -Implementation -============== - -Assignee(s) ------------ - -Primary assignee: - ? - -Work Items ----------- - -* Implement cinder reset-active-backend API -* Implement cinder replication-enable API -* Document post-fail-over recovery in Admin guide - -Dependencies -============ -None - -Testing -======= - -None - -Documentation Impact -==================== - -Documentation in the Admin guide for how to perform a backend promotion. - -References -========== - -None diff --git a/specs/queens/cheesecake-promote-backend.rst b/specs/queens/cheesecake-promote-backend.rst new file mode 100644 index 00000000..aba4402e --- /dev/null +++ b/specs/queens/cheesecake-promote-backend.rst @@ -0,0 +1,235 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +===================================================== +Promote Replication Target (Cheesecake) +===================================================== + +https://blueprints.launchpad.net/cinder/+spec/replication-backend-promotion + +Problem description +=================== + +After failing backend A over to backend B, there is not a mechanism in +Cinder to promote backend B to the master backend in order to then replicate +to a backend C. We also lack some management commands to help rebuild after a +disaster if states get out of sync. + +Current Workflow +---------------- +1. Setup Replication +2. Failure Occurs +3. Fail over +4. Promote Secondary Backend + a. Freeze backend to prevent manage operations + b. Stop cinder volume service + c. Update cinder.conf to have backend A replaced with B and B with C + d. *Hack db to set backend to no longer be in 'failed-over' state* + * This is the step this spec is concerned with + * Example: + :: + + update services set disabled=0, + disabled_reason=NULL, + replication_status='enabled', + active_backend_id=NULL + where id=3; + e. Start cinder volume service + f. Unfreeze backend + +Use Cases +========= +There was a fire in my data center and my primary backend (A) was destroyed. +Luckily, I was replicating that backend to backend (B). After failing over +to backend B and repairing the data center, we installed backend C to be a +new replication target for B. + +There is also a case where, for whatever reason, the replication state gets out +of sync with reality. In a situation where the status and active backend id +need to be adjusted manually by the cloud admin while recovering from a +disaster. + +Proposed change +=============== + +To handle the case where the Cinder DB just needs to be synchronized with the +real world we will add the following cinder manage commands to reset the active +backend for a host. Similar to reset-state for volumes this will just do DB +operations. The assumption being that the cinder.conf has already been updated, +and the volume service is stopped, in a disabled state, and probably frozen. +The manage command will verify that the service is stopped, disabled, and +frozen. + +:: + + cinder-manage reset-active-backend replication_status= \ + + +Equivalent to: +:: + + update services set replication_status='', + active_backend_id='', + where host=''; + +Where the defaults for `status` will be disabled and the `active_backend_id` +will be None. The target for this could also be a cluster in A-A deployments. + +Note: It will be up to the Admin to re-enabled the service. + +That gives us the ability to avoid having an admin go manually run DB commands, +but does *NOT* allow for doing this in an "online" recovery. The volume service +must be offline for this to work safely. + +Making this change will require drivers to make adjustments to their +replication states upon initialization. As-is we don't really have much +definition around what is allowed to change in cinder.conf and how much the +driver should support for changes to replication targets and things like that. +There is sort of an implied contract that when `__init__` or `do_setup` is +called the driver *should* make their current replication state match what they +are given in the config. This is a little bit problematic today though as there +isn't a mechanism for them to update DB entries like +`replication_extended_status` or `replication_driver_data`. To solve that gap, +and allow for this new officially supported method of modifying replication +status when offline, we will introduce a new driver api method +:: + + update_host_replication(replication_status, active_backend_id, volumes, groups) + + +This method will be called immediately following `do_setup` and will return a +model update for the service as well as a list of updates for the volumes and +groups if the driver supports replication groups. The `group` parameter will +default to None if not defined. + +The drivers will need to be able to take appropriate steps to get the system +into the desired state based on the current DB state (which was theoretically +modifed before startup by the new cinder-manage command) and the current +cinder.conf. + +If not implemented things will continue to work as they do today, and require +that the admin potentially take more drastic measures to recover after +performing a failover. The goal here is to not break any existing +functionality, but to add enough infrastructure for drivers to behave better. + +When we do this we might require some way of fencing to prevent multiple A-A +driver instances from doing the setup at the same time as it will more than +likely be problematic for some backends, and could risk data loss if more than +one is modifying replication states at the same time. As a simple solution we +could use a new replication_status like `updating` for the service, and only +allow them to call setup_replication if the status is not set to that. This +status can also be beneficial for an admin to know what is happening if the +process of updating via the driver takes noticeable amounts of time. + +Doing it this way should also allow for later on making "online" updates where +we can utilize that same driver hook to modify replication states. This spec +and initial implementation does not aim to cover that scenario. + +Alternatives +------------ + +We could add admin API's to Cinder. Those API's could do the DB updates and +ping the drivers. The downside is that it requires the API and volume service +to be online, which may be problematic in the scenario that you are picking up +pieces after a disaster. + +Later on we can look into doing "online" promotions where the volume service +does not need to be offline. Similar code in the drivers would be required, but +the complexity gets increased rapidly by trying to support this. + +There was also discussion about using new admin API's which would modify a db +state that tracks replication info. The downside to this is that we will move +into a scenario where the running config and state doesn't match the configured +state. + +Following on that path of tracking replication state in the db, we could go to +the extreme and move all of the replication configuration to be done via API's. +We can then track state, and provide drivers with diffs as the state changes. +In the longer term that addresses the runtime vs config state disparity, but it +will be a significant change in workflow and deployments. Not to mention would +require somewhat major changes to drivers implementing replication. + +Data model impact +----------------- + +A new status for the service/cluster will be added. + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +Volume service startup will probably take a performance hit, depending on the +backend and how many replicated volumes need to be modified and updated. + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +Driver maintainers will need to potentially implement this new functionality, +and be aware of the implications of how/when replication configuration and +status can be adjusted. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + jbernard + +Work Items +---------- + +* Implement cinder-manage reset-active-backend command +* Implement volume manager changes to allow for `update_host_replication` to be + used at startup by drivers. +* Open a bug against each backend that supports replication and needs an + update as a result of this change. + +Dependencies +============ +None + +Testing +======= + +None + +Documentation Impact +==================== + +Documentation in the Admin guide for how to perform a backend promotion, and +updating the devref for driver developers to explain the expectations of +drivers implementing replication. + +References +========== + +None