Spec to introduce a backup_status field for volumes and a split-up of the

backup status away from the volume_status. This is intened to allow for less serialization or blocking of volume actions such as re-attachments, resizing, ... by backup tasks. Co-Author: Niklas Schwarz <niklas.schwarz@inovex.de> Change-Id: I43a9fb73150e3738554459a459d63b8891418ad0
2022-12-28 15:09:06 +01:00 · 2022-12-28 15:09:06 +01:00 · d09f77bc62
parent 52cf890584
commit d09f77bc62
1 changed files with 353 additions and 0 deletions
--- a/specs/2023.2/dedicated-volume-backup-status-field.rst
+++ b/specs/2023.2/dedicated-volume-backup-status-field.rst
@ -0,0 +1,353 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+Dedicated backup status field for volumes
+==========================================
+
+This spec proposes to introduce a new `backup_status` field for volumes
+to remove the blocking or serialization that active backup tasks impose
+on volume tasks such as re-attachments currently.
+
+
+
+Problem description
+===================
+
+Currently all cinder tasks use the `volume_status` field to check for suitable
+volume status prior to being launched. During the active phase of a task, its
+status is also held and updated via the same volume status field. And finally
+also errors thrown by a task are communicated back via this field.
+This single field in essence creates a locking or synchronisation mechanism to
+have only one task act on a volume at any one time.
+
+While this is helpful for coordinating tasks affecting the volume itself,
+applying the same logic to backups is actually not required or helpful:
+
+* Actions on the volume itself (such as `resize` or `attaching`) and backups do
+  not techically relate to each other. The backup-driver and block device-driver
+  act independently and backups are read off a dedicated volume snapshot not
+  affecting the volume itself.
+
+* A backup might take quite a long time to finish and is blocking any other task
+  happing for the volume in the meantime. If we assume a 8TB volume is being
+  backed up in full and even if the backup was running at 1 GB/s the
+  volume backup will still take ~2.5 hrs to complete. Decoupling this from a state
+  machine perspective (since it already is for most drivers / implementations
+  which work via snapshots) seems be quite beneficial.
+
+
+
+Use Cases
+=========
+
+There are two sides to the use-cases of decoupling backup tasks from other tasks
+on a volume:
+
+1. For cloud operators - Currently the re-attachment change of a volume as it
+happens during an instance live-migration is blocked by a concurrently run volume
+backup. To make matters worse, the potentially long running backup (task) could
+also have been triggered by the user and then be blocking administrative actions
+such as the live-migration of all instances to others hosts to take a hypervisor
+down for maintainance.
+
+Provided sufficiently large volumes or slow backup transfer rates this could
+cause users to "lock out" administrative tasks indefinately.
+
+2. For cloud users - Urgent operational tasks are blocked by an active backup
+task currently. A running backup blocking the users ability to quickly resize a
+volume that is running low of available space or attach it to another instance.
+This issue is worsend if the backup tasks is even triggered by the cloud
+provider or some automatic scheduling.
+
+
+
+Proposed change
+===============
+
+For volumes there shall be field `backup_status` to hold the backup related
+values currently stored in `volume_status`:
+
+* 'backing-up'
+* 'error_backing-up'
+* 'restoring-backup'
+* 'error_restoring'
+
+Those shall then be removed from the list of (valid) values of `volume_status`
+and be values for `backup_status`.
+
+There will be some changes to the conditional checks for certain tasks to be
+started, but most usually depend on those volume status values that would remain
+in the volume_status field anyways.
+
+
+Alternatives
+------------
+
+There was a discussion around a spec [1] moving all task
+status to a new field. This ended up being way to complex and not really
+suitable for the described use-cases of decoupling volume backups.
+
+
+Data model impact
+-----------------
+
+* An additional field `backup_status` would have to be added to the volume table,
+  together with a change to the list of valid values it might hold.
+
+* This will be introduced via a schema change to the database first,
+  to add the new field.
+
+* This change is then followed by an online update/upgrade to split up the
+  "moved out" status related to backups in their newly dedicated fields.
+
+* The valid values for volume_status would then also have to be reduced.
+  https://opendev.org/openstack/cinder/src/commit/5c23c9fbe41baef22a71eac4406fd9db269d1271/cinder/objects/fields.py#L168-L190
+  As the status `backing-up`, `restoring-backup`, `error_backing-up` and
+  `error_restoring` are only to be stored in the backup status.
+
+* The method `conditional_update` needs to support different versions for the
+  volume data model during the update process of the database. In addition all
+  calls to the method in the cinder project have to be updated to use the new
+  field.
+
+
+REST API impact
+---------------
+
+Due to the introduction of a new field and the following split up of the status
+field for a volume a new API microversion is required. To serve older
+microversions an API-Layer to translate between the two data-models is to be
+introduced.
+
+In general the translation layer will translate from
+(status, backup_status) -> status and status -> (status, backup_status) to
+maintain compatibility with older clients expecting all states to be in just one
+status field.
+
+The additional field of a backup_status would need to be sent to the user. The
+valid status of a volume would only allow the reduced set. In addition
+endpoints have to be created to update the backup_status (by the admin only,
+like the current update status endpoints).
+
+**NOTE**: The list of endpoints to be changed is based on the current proposed
+change and is subject to change.
+
+* Show volume
+
+* Update volume
+
+* List volumes
+
+  * show additional field for the task_status
+
+  * additional filter for the task_status
+
+* List detailed volumes
+
+  * show additional field for the task_status
+
+  * additional filter for the task_status
+
+* Set volume status
+
+* Unset volume status
+
+Additional endpoints (only available with the use of the new micro version)
+
+* Set volume backup_status
+
+* Unset volume backup_status
+
+
+Security impact
+---------------
+
+None
+
+
+Notifications impact
+--------------------
+
+None
+
+
+Other end user impact
+---------------------
+
+
+Performance Impact
+------------------
+
+While not a performance impact per se, having the backup state-machine
+decoupled from the volume_status will reduce the serialization of tasks
+happening for a volume.
+
+
+Other deployer impact
+---------------------
+
+
+Developer impact
+----------------
+
+The state of a volume can only be set to the reduced list of status.
+All other status have to be set to the task_status field.
+In addition a backup_status can be added, updated or removed to only the
+list of valid backup_status.
+
+The change above has the impact that all methods have to be checked which allow
+concurrent interaction with a volume and use the backup_status field instead of
+the status field to indicate a running process.
+This change should be communicated to other developer teams that rely on the
+cinder api to check on the status to either use an old microversion or update
+to use the status and the backup_status.
+
+Because the status is currently used as a locking mechanism to prevent actions
+to start if an invalid status is reached, the method calls in the api have to
+be updated to also include a check for the backup_status if necessary. Some of
+the work here is currently done by the conditional_update method which needs to
+receive support for the updated volume model. This versioning is only needed if
+an old database model is received. The API always sends the new model even if
+an older API version is used due to the translation layer. This translation
+layer guarantees backwards compatibility with older API versions and translates
+(status, backup_status) -> status and status -> (status, backup_status).
+
+To be able to perform an online migration of the database for an update of
+openstack a method to remap the status -> (status, backup_status) and save it in
+the database is necessary. This method should only be called if the schema
+update is done. This method will allow older openstack versions to be able to
+perform as before and set or check on their status as needed. These methods
+should be removed in further releases.
+
+
+Upgrade impact
+--------------
+
+none
+
+
+Implementation
+==============
+
+The following existing restrictions on which actions / tasks can happen
+on a volume have to be maintained with the changes implemented:
+
+* Reject volume deletion while volume is currently being backed up
+* Reject concurrent backups of a volume if one is already in progress
+* Reject volume or "block storage" migration if a backup is currently running
+* Return a volume status of `restoring` when there is a backup being restored to
+  this particular volume.
+
+
+
+
+
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Christian Rohmann (IRC: crohmann)
+
+Other contributors:
+
+
+Work Items
+----------
+
+* Split up the status filed to the above mentioned status and backup_status
+  fields
+
+  * Data model in python
+
+  * Sql data model and constraints
+
+  * additional method(s) for the model to allow a online migration of the database
+
+* Introduce validation methods for the two new fields to guarantee that no
+  wrong status is set to the fields
+
+* Add versioning to the conditional_update method for `old` database models
+
+* Update method calls to conditional_update to use status and backup_status
+
+* Introduce the API-Layer as a translator to serve older micro versions
+
+* Change the API-endpoints including query parameters
+
+* Documentation
+
+  * Breaking changes
+
+  * API Documentation
+
+    * Models
+
+    * Endpoints
+
+  * Upgrade guide/scripts for older database models
+
+    * online migration
+
+    * schema updates
+
+
+Dependencies
+============
+
+Dependencies to other developer teams have to be communicated to ensure they use
+the old microversion to avoid breaking changes and to switch to the new split up
+fields. This change should especially be communicated to the nova team which
+checks regularly for the status of an attached volume.
+
+
+Testing
+=======
+
+* Since only a reduced set of states are handled via the `volume_status` field
+  and with `backup_status` newly introduced field, some functional tests
+  have to be adapted to use the new data model.
+
+* Further tests have to be added to ensure the translation layer for older API
+  microversions work as expected. E.g. the backup status is presented via either
+  `volume_status` for an older microversion and then via `backup_status` for the
+  new version.
+
+* Because the `conditional_update`` method needs to support versioning in this
+  release, test should be written to verify that the versioning happens
+  correctly.
+
+
+Documentation Impact
+====================
+
+* Add a release note explaining the motivation and effect of the change
+
+* Document the state-machines for the volume itself and backup and
+  restore tasks.
+
+* Document the translation layer for the older microversions and how the
+  translation behaves.
+
+
+
+References
+==========
+
+[1] Previously propoesed spec to add a task status: https://review.opendev.org/c/openstack/cinder-specs/+/818551
+
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - 2023.02
+     - Introduced