Merge "Spec for cross-cell resize"
This commit is contained in:
commit
4d4be39199
|
@ -0,0 +1,772 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=================
|
||||
Cross-cell resize
|
||||
=================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/cross-cell-resize
|
||||
|
||||
Expand resize (cold migration) support across multiple cells.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Multi-cell support was added to the controller services (API, conductor,
|
||||
scheduler) in the Pike release. However, server move operations, like resize,
|
||||
are restricted to the cell in which the instance currently lives. Since
|
||||
it is common for deployments to shard cells by hardware types, and therefore
|
||||
flavors isolated to that hardware, the inability to resize across cells is
|
||||
problematic when a deployment wants to move workloads off the old hardware
|
||||
(flavors) and onto new hardware.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As a large multi-cell deployment which shards cells by hardware generation,
|
||||
I want to decommission old hardware in older cells and have new and existing
|
||||
servers move to newer cells running newer hardware using newer flavors without
|
||||
users having to destroy and recreate their workloads.
|
||||
|
||||
As a user, I want to my servers to retain their IPs, volumes and UUID
|
||||
while being migrated to another cell.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
This is a complicated change, which the proof of concept patch [1]_ shows.
|
||||
As such, I will break down this section into sub-sections to cover the various
|
||||
aspects in what needs to be implemented and why.
|
||||
|
||||
Keep in mind that at a high level, this is mostly a large data migration from
|
||||
one cell to another.
|
||||
|
||||
This spec attempts to provide a high level design based on prototypes using
|
||||
both a shelve-based approach and, after initial spec review [2]_, an approach
|
||||
modeled closer to the traditional resize flow. This version of the spec focuses
|
||||
on the latter approach (without directly calling shelve methods). There will be
|
||||
unforeseen issues that will arise during implementation so the spec tries to
|
||||
not get into too low a level of implementation details and instead focuses on
|
||||
the general steps needed and known issues. Open questions are called out
|
||||
as necessary.
|
||||
|
||||
Why resize?
|
||||
-----------
|
||||
|
||||
We are doing resize because cells can be sharded by flavors and resize is the
|
||||
only non-admin way (by default) for users to opt into migrating from one cell
|
||||
with an old flavor (gen1) to a new flavor (gen2) in a new cell. This eases up
|
||||
admins/operators to drain old cells with old hardware.
|
||||
|
||||
Terms
|
||||
-----
|
||||
|
||||
Common terms used throughout the spec.
|
||||
|
||||
* Source cell: this is the cell in which the instance "lives" when the resize
|
||||
is initiated.
|
||||
|
||||
* Target cell: this is the cell in which the instance moves during a cross-cell
|
||||
resize.
|
||||
|
||||
* Resized instance: an instance with status ``VERIFY_RESIZE``.
|
||||
|
||||
* Super conductor: in a `split-MQ`_ deployment, the super conductor is running
|
||||
at the "top" and has access to the API database and thus can communicate with
|
||||
the cells over RPC and directly to the cell databases.
|
||||
|
||||
.. _split-MQ: https://docs.openstack.org/nova/latest/user/cellsv2-layout.html#multiple-cells
|
||||
|
||||
Assumptions
|
||||
-----------
|
||||
|
||||
* There is no SSH access between compute hosts in different cells.
|
||||
|
||||
* The image service (glance), persistent volume storage (cinder) and tenant
|
||||
networks (neutron) span cells.
|
||||
|
||||
Goals
|
||||
-----
|
||||
|
||||
* Minimal changes to the overall resize flow as seen from both an external
|
||||
(API user, notification consumer) and internal (nova developer) perspective.
|
||||
|
||||
* Maintain the ability to easily rollback to the source cell in case the
|
||||
resize fails.
|
||||
|
||||
Policy rule
|
||||
-----------
|
||||
|
||||
A new policy rule ``compute:servers:resize:cross_cell`` will be added. It will
|
||||
default to ``!`` which means no users are allowed. This is both backward
|
||||
compatible and flexible so that operators can determine which users in their
|
||||
cloud are allowed to perform a cross-cell resize. For example, it probably
|
||||
makes sense for operators to allow only system-level admins or test engineers
|
||||
to perform a cross-cell resize initially.
|
||||
|
||||
Resize flow
|
||||
-----------
|
||||
|
||||
This describes the flow of a resize up until the point that the server
|
||||
goes to ``VERIFY_RESIZE`` status.
|
||||
|
||||
API
|
||||
~~~
|
||||
|
||||
The API will check if the user is allowed, by the new policy rule, to perform
|
||||
a cross-cell resize *and* if the ``nova-compute`` service on the source host
|
||||
is new enough to support the cross-cell resize flow. If so, the API will
|
||||
modify the RequestSpec to tell the scheduler to not restrict hosts to the
|
||||
source cell, but the source cell will be "preferred" by default.
|
||||
|
||||
There are two major reasons why we perform this check in the API:
|
||||
|
||||
1. The `2.56 microversion`_ allows users with the admin role to specify a
|
||||
target host during a cold migration. Currently, the API validates that the
|
||||
`target host exists`_ which will only work for hosts in the same cell in
|
||||
which the instance lives (because the request context is targeted to that
|
||||
cell). If the request is allowed to perform a cross-cell resize then we
|
||||
will adjust the target host check to allow for other cells as well.
|
||||
|
||||
2. Currently, the resize/migrate API actions are synchronous until conductor
|
||||
RPC casts to ``prep_resize()`` on the selected target host. This could be
|
||||
problematic during a cross-cell resize if the conductor needs to validate
|
||||
potential target hosts since the REST API response could timeout. Until the
|
||||
`2.34 microversion`_, the live migrate API had the same problem.
|
||||
If the request is allowed to perform a cross-cell resize then we will RPC
|
||||
cast from API to conductor.
|
||||
|
||||
.. _2.56 microversion: https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id51
|
||||
.. _target host exists: https://github.com/openstack/nova/blob/c295e395d/nova/compute/api.py#L3570
|
||||
.. _2.34 microversion: https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id31
|
||||
|
||||
Scheduler
|
||||
~~~~~~~~~
|
||||
|
||||
A new ``CrossCellWeigher`` will be introduced which will prefer hosts from the
|
||||
source cell by default. A configurable multiplier will be added to control the
|
||||
weight in case an operator wants to prefer cross cell migrations. This weigher
|
||||
will be a noop for all non-cross-cell move operations.
|
||||
|
||||
Note that once the scheduler picks a primary selected host, all alternate hosts
|
||||
come from the `same cell`_.
|
||||
|
||||
.. _same cell: https://github.com/openstack/nova/blob/c295e395d/nova/scheduler/filter_scheduler.py#L399
|
||||
|
||||
(Super)Conductor
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
The role of conductor will be to synchronously orchestrate the resize between
|
||||
the two cells. Given the assumption that computes in different cells do not
|
||||
have SSH access to each other, the traditional resize flow of transferring
|
||||
disks over SSH will not work.
|
||||
|
||||
The ``MigrationTask`` will check the selected destinations from the scheduler
|
||||
to see if they are in another cell and if so, call off to a new set of
|
||||
conductor tasks to orchestrate the cross-cell resize. Conductor will set
|
||||
``Migration.cross_cell_move=True`` which will be used in the API to control
|
||||
confirm/revert logic.
|
||||
|
||||
A new ``CrossCellMigrationTask`` will orchestrate the following sub-tasks which
|
||||
are meant to mimic the traditional resize flow and will leverage new compute
|
||||
service methods.
|
||||
|
||||
**Target DB Setup**
|
||||
|
||||
Before we can perform any checks in the destination host, we have to first
|
||||
populate the target cell database with the instance and its related data, e.g.
|
||||
block device mappings, network info cache, instance actions, etc.
|
||||
|
||||
In order to hide the target cell instance from the API when listing servers,
|
||||
the instance in the target cell will be created with a ``hidden=True`` field
|
||||
which will be used to filter out these types of instances from the API.
|
||||
Remember that at this point, the instance mapping in the API points at the
|
||||
source cell, so ``GET /servers/{server_id}`` would still only show details
|
||||
about the instance in the source cell. We use the new ``hidden`` field to
|
||||
prevent leaking out the wrong instance to ``GET /servers/detail``. We may also
|
||||
do this for the related ``migrations`` table record to avoid returning multiple
|
||||
instances of the same migration record to ``GET /os-migrations``
|
||||
(coincidentally the ``migrations`` table already has an unused ``hidden``
|
||||
column).
|
||||
|
||||
**Prep Resize at Dest**
|
||||
|
||||
Conductor will make a synchronous RPC call (using ``long_rpc_timeout``) to a
|
||||
new method ``prep_snapshot_based_resize_at_dest`` on the dest compute service
|
||||
which will:
|
||||
|
||||
* Call ``ResourceTracker.resize_claim()`` on the potential dest host in the
|
||||
target cell to claim resources prior to starting the resize. Note that
|
||||
VCPU, MEMORY_MB and DISK_GB resources will actually be claimed (allocated)
|
||||
via placement during scheduling, but we need to make the ``resize_claim()``
|
||||
for NUMA/PCI resources which are not yet modeled in placement, and in order
|
||||
to create the ``MigrationContext`` record.
|
||||
|
||||
* Verify the selected target host to ensure ports and volumes will work.
|
||||
This validation will include creating port bindings on the target host
|
||||
and ensuring volume attachments can be connected to the host.
|
||||
|
||||
If either of these steps fail, the target host will be rejected. At that point,
|
||||
the conductor task will loop through alternate hosts looking for one that
|
||||
works. If the migration fails at this point (runs out of hosts), then the
|
||||
migration status changes to ``error`` and the instance status goes back to
|
||||
its previous state (either ``ACTIVE`` or ``ERROR``).
|
||||
|
||||
Copy the ``instance.migration_context`` from the target DB to the source DB.
|
||||
This is necessary for the API to route ``network-vif-plugged`` events later
|
||||
when spawning the guest in the target cell.
|
||||
|
||||
**Prep Resize at Source**
|
||||
|
||||
Conductor will make a synchronous RPC call (using ``long_rpc_timeout``) to a
|
||||
new method ``prep_snapshot_based_resize_at_source`` on the source compute
|
||||
service which will behave very similar to how shelve works, but also coincides
|
||||
with how the ``resize_instance`` method works during a traditional resize:
|
||||
|
||||
* Power off the instance.
|
||||
|
||||
* For non-volume-backed instances, create and upload a snapshot image of the
|
||||
root disk. Like shelve, this snapshot image will be used temporarily during
|
||||
the resize and upon successful completion will be deleted. The old/new
|
||||
image_ref will be stored in the migration_context.
|
||||
|
||||
* Destroy the guest on the hypervisor but retain disks, i.e. call
|
||||
``self.driver.destroy(..., destroy_disks=False)``. This is necessary to
|
||||
disconnect volumes and unplug VIFs from the source host, and is actually
|
||||
very similar to the ``migrate_disk_and_power_off`` method called on the
|
||||
source host during a normal resize. Note that we do not free up tracked
|
||||
resources on the source host at this point nor change the instance host/node
|
||||
values in the database in case we revert or need to recover from a failed
|
||||
migration.
|
||||
|
||||
* Delete old volume attachments and update the BlockDeviceMapping records
|
||||
with new placeholder volume attachments which will be used on the dest host.
|
||||
|
||||
* Open question: at this point we may want to activate port bindings for the
|
||||
dest host, but that may not be necessary (that is not done as part of
|
||||
``resize_instance`` on the source host during traditional resize today).
|
||||
If the ports are bound to the dest host and the migration fails, trying to
|
||||
recover the instance in the source cell via rebuild may not work (see
|
||||
`bug 1659062`_) so maybe port binding should be delayed, or we have to be
|
||||
careful about rolling those back to the source host.
|
||||
|
||||
.. _bug 1659062: https://bugs.launchpad.net/nova/+bug/1659062
|
||||
|
||||
If the migration fails at this point, any snapshot image created should be
|
||||
deleted. Recovering the guest on the source host should be as simple as
|
||||
hard rebooting the server (which is allowed with servers in ``ERROR`` status).
|
||||
|
||||
**Finish Resize at Dest**
|
||||
|
||||
At this point we are going to switch over to the dest host in the target cell
|
||||
so we need to make sure any DB updates required from the source cell to the
|
||||
target cell are made, for example, task_state, power_state, availability_zone
|
||||
values, instance action events, etc
|
||||
|
||||
Conductor will make a synchronous RPC call (using ``long_rpc_timeout``) to a
|
||||
new method ``finish_snapshot_based_resize_at_dest`` on the dest compute service
|
||||
which will behave very similar to how unshelve works, but also coincides with
|
||||
how the ``finish_resize`` method works during a traditional resize:
|
||||
|
||||
* Apply the migration context and update the instance record for the new
|
||||
flavor and host/node information.
|
||||
|
||||
* Update port bindings / PCI mappings for the dest host.
|
||||
|
||||
* Prepare block devices (attach volumes).
|
||||
|
||||
* Spawn the guest on the hypervisor which will connect volumes and plug VIFs.
|
||||
The new flavor will be used and if a snapshot image was previously created
|
||||
for a non-volume-backed instance, that image will be used for the root disk.
|
||||
At this point, the virt driver should wait for the ``network-vif-plugged``
|
||||
event to be routed from the API before continuing.
|
||||
|
||||
* Delete the temporary snapshot image (if one was created). This is similar to
|
||||
how unshelve works where the shelved snapshot image is deleted. At this point
|
||||
deleting the snapshot image is OK since the guest is spawned on the dest host
|
||||
and in the event of a revert or recovery needed on the source, the source
|
||||
disk is still on the source host.
|
||||
|
||||
* Mark the instance as resized.
|
||||
|
||||
Back in conductor, we need to:
|
||||
|
||||
* Mark the target cell instance record as ``hidden=False`` so it will show
|
||||
up when listing servers. Note that because of how the `API filters`_
|
||||
duplicate instance records, even if the user is listing servers at this exact
|
||||
moment only one copy of the instance will be returned.
|
||||
|
||||
* Update the instance mapping to point at the target cell. This is so that
|
||||
the confirm/revert actions will be performed on the resized instance in the
|
||||
target cell rather than the destroyed guest in the source cell.
|
||||
Note that we could do this before finishing the resize on the dest host, but
|
||||
it makes sense to defer this until the instance is successfully resized
|
||||
in the dest host because if that fails, we want to be able to rebuild in the
|
||||
source cell to recover the instance.
|
||||
|
||||
* Mark the source cell instance record as ``hidden=True`` to hide it from the
|
||||
user when listing servers.
|
||||
|
||||
.. _API filters: https://github.com/openstack/nova/blob/c295e395d/nova/compute/api.py#L2684
|
||||
|
||||
Confirm flow
|
||||
------------
|
||||
|
||||
When confirming a resized server, if the ``Migration.cross_cell_move`` value
|
||||
is True, the API will:
|
||||
|
||||
* RPC call to the source compute to destroy the guest (including disks)
|
||||
similar to the ``driver.confirm_migration`` method and drop the move claim
|
||||
(free up tracked resource usage for the source node).
|
||||
|
||||
* Delete migration-based resource allocations against the source compute node
|
||||
resource provider (this can happen in the source compute or the API).
|
||||
|
||||
* Delete the instance and its related records from the source cell database.
|
||||
|
||||
* Update the ``Migration.status`` to ``confirmed`` in the target cell DB.
|
||||
|
||||
* Drop the migration context on the instance in the target cell DB.
|
||||
|
||||
* Change the instance vm_state to ``ACTIVE`` or ``STOPPED`` based on its
|
||||
current power_state in the target cell DB (the user may have manually powered
|
||||
on the guest to verify it before confirming the resize).
|
||||
|
||||
Revert flow
|
||||
-----------
|
||||
|
||||
Similar to the confirm flow, a cross-cell revert resize will be identified
|
||||
via the ``Migration.cross_cell_move`` field in the API. If True, the API will
|
||||
RPC cast to a new conductor method ``revert_cross_cell_resize`` which will
|
||||
execute a new ``CrossCellRevertResizeTask``. That task will:
|
||||
|
||||
* Update the instance and its related records in the source cell database
|
||||
based on the contents of the target cell database. This is especially
|
||||
important for things like:
|
||||
|
||||
* BDMs because you can attach/detach volumes to/from a resized server.
|
||||
* The ``REVERT_RESIZE`` instance action record created by the API in the
|
||||
target cell. That is needed to track events during the revert in the
|
||||
source cell compute.
|
||||
|
||||
Thankfully the API does not allow attaching/detaching ports or changing
|
||||
server tags on a resized server so we do not need to copy those back across
|
||||
to the source cell database.
|
||||
|
||||
* Update the instance mapping to point at the source cell. This needs to happen
|
||||
before spawning in the source cell so that the ``network-vif-plugged``
|
||||
event from neutron is routed properly.
|
||||
|
||||
* Mark the target cell DB instance as ``hidden=True`` to hide it from the API
|
||||
while listing servers as we revert.
|
||||
|
||||
* RPC call the dest compute to terminate the instance (destroy the guest,
|
||||
disconnect volumes and ports, free up tracked resources).
|
||||
|
||||
* Destroy the instance and its related records from the target cell database.
|
||||
|
||||
* Update the ``Migration.status`` to ``reverted`` in the source cell DB.
|
||||
|
||||
* RPC call the source compute to revert the migration context, apply the old
|
||||
flavor and original image, attach volumes and update port bindings, power on
|
||||
the guest (like in ``driver.finish_revert_migration``) and swap source node
|
||||
allocations held by the migration record in placement to the instance record.
|
||||
|
||||
Note that an alternative to keeping the source disk during resize is to
|
||||
use the snapshot image during revert and just spawn from that (rather than
|
||||
power on from the retained disk). However, that means needing to potentially
|
||||
download the snapshot image back to the source host and ensure the snapshot
|
||||
image is cleaned up for both confirm and revert rather than just at the end
|
||||
of the resize. It would also complicate the ability to recover the guest
|
||||
on the source host by simply hard rebooting it in case the resize fails.
|
||||
|
||||
Limitations
|
||||
-----------
|
||||
|
||||
1. The `_poll_unconfirmed_resizes`_ periodic task, which can be configured to
|
||||
automatically confirm pending resizes on the target host, will not support
|
||||
cross-cell resizes because doing so would require an up-call to the API to
|
||||
confirm the resize and cleanup the source cell database. Orchestrating
|
||||
automatic cross-cell resize confirm could be a new periodic task written in
|
||||
the conductor service as a future enhancement.
|
||||
|
||||
.. _\_poll_unconfirmed_resizes: https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7082
|
||||
|
||||
Known issues
|
||||
------------
|
||||
|
||||
1. Rather than conductor making synchronous RPC calls during the resize with
|
||||
the ``long_rpc_timeout`` configuration option, a new option could be added
|
||||
specifically for cross-cell (snapshot-based) resize operations. Given a
|
||||
snapshot of a large disk could take a long time to upload (or download) it
|
||||
might be better to add new options for controlling those timeouts. For the
|
||||
initial version of this feature we will re-use ``long_rpc_timeout`` and we
|
||||
can add more granular options in the future if necessary.
|
||||
|
||||
2. One semantic difference in the API will be different events under the
|
||||
instance actions records during a resize, since the events are created via
|
||||
the ``wrap_instance_event`` decorator on the compute methods, and when using
|
||||
new methods with new names there will be new events compared to a normal
|
||||
resize. This could maybe be countered by passing a specific name to
|
||||
the decorator rather than just use the function name as it does today.
|
||||
Given there are no API guarantees about the events that show up under an
|
||||
action record, and this has always been internal details that leak out of
|
||||
the API, we will not try to overwrite the new function/event names, e.g.
|
||||
recording a ``compute_prep_resize`` event when calling the
|
||||
``prep_snapshot_based_resize_at_dest`` method.
|
||||
|
||||
Edge cases
|
||||
----------
|
||||
|
||||
1. If the user deletes a server in ``VERIFY_RESIZE`` status, the API confirms
|
||||
the resize to clean up the source host before deleting the server from the
|
||||
dest host [3]_. This code will need to take into account a cross-cell resize
|
||||
and cleanup appropriately (cleanup the source host and delete records from
|
||||
the source cell).
|
||||
|
||||
2. When `routing network events`_ in the API, if the instance has a migration
|
||||
context it will lookup the migration record based on id rather than uuid
|
||||
which may be wrong if the migration context was created in a different cell
|
||||
database where the id primary key on the migration record is different.
|
||||
It is not clear if this will be a problem but it can be dealt with in a few
|
||||
ways:
|
||||
|
||||
* Store the migration.uuid on the migration context and lookup the migration
|
||||
record using the uuid rather than the id.
|
||||
* When copying the migration context from the target cell DB to the source
|
||||
cell DB, update the ``MigrationContext.migration_id`` to match the
|
||||
``Migration.id`` of the source cell migration record.
|
||||
|
||||
3. It is possible to attach/detach volumes to/from a resized server. Because of
|
||||
this, mirroring those block device mapping changes from the target cell DB
|
||||
to the source cell DB during revert adds complication but it is
|
||||
manageable [4]_. The ability to do this to resized servers is not well
|
||||
known and arguably may not be officially supported to preserve any volumes
|
||||
attached during the revert, but because that is what works today we should
|
||||
try and support it for cross-cell resize.
|
||||
|
||||
.. _routing network events: https://github.com/openstack/nova/blob/c295e395d/nova/compute/api.py#L4883
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Lift and shift
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
Users (or cloud operators) could force existing servers to be snapshot,
|
||||
destroyed and then re-created from snapshot with a new flavor in a new cell.
|
||||
It is assumed that deployments already have some kind of tooling like this for
|
||||
moving resources across sites or regions. While normal resize is already
|
||||
disruptive to running workloads, this alternative is especially problematic if
|
||||
specific volumes and ports are attached, i.e. the IP(s) and server UUID would
|
||||
change. In addition, it would require all multi-cell deployments to orchestrate
|
||||
their own cross-cell migration tooling.
|
||||
|
||||
Shelve orchestration
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
An alternative design to this spec is found in the PoC [1]_ and initial version
|
||||
of this spec [2]_. That approach opted to try and re-use the existing
|
||||
shelve and unshelve functions to:
|
||||
|
||||
* Snapshot and shelve offload out of the source cell.
|
||||
* Unshelve from snapshot in the target cell.
|
||||
* On revert, shelve offload from the target cell and then unshelve in the
|
||||
source cell.
|
||||
|
||||
The API, scheduler and database manipulation logic was similar *except* since
|
||||
shelve was used, the instance was offloaded from the source cell which could
|
||||
complicate getting the server *back* to the original source on revert and
|
||||
require rescheduling to a different host in the source cell.
|
||||
|
||||
In addition, that approach resulted in new task states and notifications
|
||||
related to shelve which would not be found in a normal resize, which could be
|
||||
confusing, and complicated the logic in the shelve/unshelve code since it had
|
||||
to deal with resize conditions.
|
||||
|
||||
Comparing what is proposed in this spec versus the shelve approach:
|
||||
|
||||
Pros:
|
||||
|
||||
- Arguably cleaner with new methods to control task states and notificiations;
|
||||
no complicated dual-purpose logic to shelve handling a resize, i.e. do not
|
||||
repeat the evacuate/rebuild debt.
|
||||
- The source instance is mostly untouched which should make revert and
|
||||
recover simpler.
|
||||
|
||||
Cons:
|
||||
|
||||
- Lots of new code, some of which is heavily duplicated with shelve/unshelve.
|
||||
|
||||
Long-term it should be better to try for a hybrid approach (what is in this
|
||||
spec) to have new compute methods to control notifications and task states to
|
||||
closer match a traditional resize flow, but mix in shelve/unshelve style
|
||||
operations, e.g. snapshot, guest destroy/spawn.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
* A ``cross_cell_move`` boolean column, which defaults to False, will be added
|
||||
to the ``migrations`` cell DB table and related versioned object.
|
||||
|
||||
* A ``hidden`` boolean column, which defaults to False, will be added to the
|
||||
``instances`` cell DB table and related versioned object.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
There will be no explicit request/response schema changes to the REST API.
|
||||
Normal resize semantics like maintaining the same task state transition and
|
||||
keeping the instance either ``ACTIVE`` or ``SHUTDOWN`` at the end will remain
|
||||
intact.
|
||||
|
||||
While the instance is resized and contains records in both cells, the API will
|
||||
have to take care to filter out duplicate instance and migration records while
|
||||
listing those across cells (using the ``hidden`` field).
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
As described in the `Policy rule`_ section, a new policy rule will be added
|
||||
to control which users can perform a cross-cell resize.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
Similar to task state transitions in the API, notifications should remain
|
||||
the same as much as possible. For example, the *Prep Resize at Dest* phase
|
||||
should emit the existing ``instance.resize_prep.start/end`` notifications.
|
||||
The *Prep Resize at Source* phase should emit the existing
|
||||
``instance.resize.start/end/error`` notifications.
|
||||
|
||||
The bigger impact will be to deployments that have a notification queue per
|
||||
cell because the notifications will stop from one cell and start in another,
|
||||
or be intermixed during the resize itself (prep at dest is in target cell while
|
||||
prep at source is in source cell). It is not clear what impact this could have
|
||||
on notification consumers like ceilometer though.
|
||||
|
||||
If desired, new versioned notifications (or fields to existing notifications)
|
||||
could be added to denote a cross-cell resize is being performed, either as
|
||||
part of this blueprint or as a future enhancement.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
As mentioned above, instance action events and versioned notification behavior
|
||||
may be different.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Clearly a cross-cell resize will perform less well than a normal resize
|
||||
given the database coordination involved and the need to snapshot an
|
||||
image-backed instance out of the source cell and download the snapshot image
|
||||
in the target cell.
|
||||
|
||||
Also, deployments which enable this feature may need to scale out their
|
||||
conductor workers which will be doing a lot of the orchestration work
|
||||
rather than inter-compute coordination like a normal resize. Similarly, the
|
||||
``rpc_conn_pool_size`` may need to be increased because of the synchronous
|
||||
RPC calls involved.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Deployers will be able to control who can perform a cross-cell resize in
|
||||
their cloud and also be able to tune parameters used during the resize,
|
||||
like the RPC timeout.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
A new ``can_connect_volume`` compute driver interface will be added with
|
||||
the following signature::
|
||||
|
||||
def can_connect_volume(self, context, connection_info, instance):
|
||||
|
||||
That will be used during the validation step to ensure volumes attached to
|
||||
the instance can connect to the destination host in the target cell. The code
|
||||
itself will be relatively minor and just involve parts of an existing volume
|
||||
attach/detach operation for the driver.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
There are three major upgrade considerations to support this feature.
|
||||
|
||||
* RPC: given the RPC interface changes to the compute and conductor services,
|
||||
those services will naturally need to be upgraded before a cross-cell resize
|
||||
can be performed.
|
||||
|
||||
* Cinder: because of the validation relying on volume attachments, cinder
|
||||
will need to be running at least Queens level code with the
|
||||
`3.44 microversion`_ available.
|
||||
|
||||
* Neutron: because of the validation relying on port bindings, neutron will
|
||||
need to be running at least Rocky level code with the
|
||||
``Port Bindings Extended`` API extension enabled.
|
||||
|
||||
.. _3.44 microversion: https://docs.openstack.org/cinder/latest/contributor/api_microversion_history.html#id41
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Matt Riedemann <mriedem.os@gmail.com> (irc: mriedem)
|
||||
|
||||
Other contributors:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
At a high level this is the proposed series of changes that need to be made
|
||||
in order, although realistically some of the control plane changes could be
|
||||
made in any order as long as the cold migrate task change comes at the end.
|
||||
|
||||
* DB model changes (``migrations.cross_cell_move``, ``instances.hidden``).
|
||||
|
||||
* Various versioned objects changes for tracking a cross-cell move in
|
||||
the RequestSpec, looking up a Migration by UUID, creating InstanceAction
|
||||
and InstanceActionEvent records from existing data, etc.
|
||||
|
||||
* Scheduler changes to select destination hosts from multiple cells during
|
||||
a cross-cell move and weighing them so the "source" cell is preferred by
|
||||
default.
|
||||
|
||||
* Possible changes to the ``MigrationContext`` object for new fields like
|
||||
``old_image_ref``, ``new_image_ref``, ``old_flavor``, ``new_flavor``,
|
||||
``old_vm_state`` (this will depend on implementation).
|
||||
|
||||
* nova-compute RPC interface changes for the prep/validate at dest, prep
|
||||
at source, and finish resize at source operations.
|
||||
|
||||
* Adding new conductor tasks for orchestrating a cross-cell resize including
|
||||
reverting a resize.
|
||||
|
||||
* API plumbing changes to handle confirming/reverting a cross-cell resize.
|
||||
|
||||
* Add the new policy rule and make the existing resize flow use it to tell the
|
||||
scheduler whether or not target hosts can come from another cell, and if the
|
||||
target host is from another cell, to run the new cross-cell resize conductor
|
||||
task to orchestrate the resize rather than the traditional
|
||||
compute-orchestrated flow (where the source and target nova-compute services
|
||||
SSH and RPC between each other).
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The existing functional tests in the PoC change should give a good idea of
|
||||
the types of wrinkles that need to be tested. Several obvious tests include:
|
||||
|
||||
* Resize both image-backed and volume-backed servers.
|
||||
|
||||
* Ensure allocations in the placement service, and resource reporting from
|
||||
the ``os-hypervisors`` API, are accurate at all points of the resize, i.e.
|
||||
while the server is in ``VERIFY_RESIZE`` status, after it is confirmed and
|
||||
reverted.
|
||||
|
||||
* Ensure volume attachments and port bindings are managed properly, i.e. no
|
||||
resources are leaked.
|
||||
|
||||
* Tags, both on the server and associated with virtual devices (volumes and
|
||||
ports) survive across the resize to the target cell.
|
||||
|
||||
* Volumes attached/detached to/from a server in ``VERIFY_RESIZE`` status are
|
||||
managed properly in the case of resize confirm/revert.
|
||||
|
||||
* During a resize, resources which span cells, like the server and its
|
||||
related migration, are not listed with duplicates out of the API.
|
||||
|
||||
* Perform a resize with at-capacity computes, meaning that when we revert
|
||||
we can only fit the instance with the old flavor back onto the source host
|
||||
in the source cell.
|
||||
|
||||
* Ensure start/end events/notifications are aligned with a normal same-cell
|
||||
resize.
|
||||
|
||||
* Resize from both an active and stopped server and assert the original
|
||||
status is retained after confirming and reverting the resize.
|
||||
|
||||
* Delete a resized server and assert resources and DB records are properly
|
||||
cleaned up from both the source and target cell.
|
||||
|
||||
* Test a failure scenario where the server is recovered via rebuild in the
|
||||
source cell.
|
||||
|
||||
Unit tests will be added for the various units of changes leading up to the
|
||||
end of the series where the functional tests cover the integrated flows.
|
||||
Negative/error/rollback scenarios will also be covered with unit tests and
|
||||
functional tests as appropriate.
|
||||
|
||||
Since there are no direct API changes, Tempest testing does not really fit
|
||||
this change. However, something we should really have, and arguably should
|
||||
have had since Pike, is a multi-cell CI job. Details on how a multi-cell CI
|
||||
job can be created though is unclear given the need for it to either
|
||||
integrate with legacy devstack-gate tooling or, if possible, new zuul v3
|
||||
tooling.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The compute admin `resize guide`_ will be updated to document cross-cell
|
||||
resize in detail from an operations perspective, including troubleshooting
|
||||
and fault recovery details.
|
||||
|
||||
The compute `configuration guide`_ will be updated for the new policy rule
|
||||
and any configuration options added.
|
||||
|
||||
The compute `server concepts guide`_ may also need to be updated for any
|
||||
user-facing changes to note, like the state transitions of a server during
|
||||
a cross-cell resize.
|
||||
|
||||
.. _resize guide: https://docs.openstack.org/nova/latest/admin/configuration/resize.html
|
||||
.. _configuration guide: https://docs.openstack.org/nova/latest/configuration/
|
||||
.. _server concepts guide: https://developer.openstack.org/api-guide/compute/server_concepts.html
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] Proof of concept: https://review.openstack.org/#/c/603930/
|
||||
.. [2] Shelve-based approach spec: https://review.openstack.org/#/c/616037/1/
|
||||
.. [3] API delete confirm resize: https://github.com/openstack/nova/blob/c295e395d/nova/compute/api.py#L2069
|
||||
.. [4] Mirror BDMs on revert: https://review.openstack.org/#/c/603930/20/nova/conductor/tasks/cross_cell_migrate.py@637
|
||||
|
||||
Stein PTG discussions:
|
||||
|
||||
* https://etherpad.openstack.org/p/nova-ptg-stein-cells
|
||||
* https://etherpad.openstack.org/p/nova-ptg-stein
|
||||
|
||||
Mailing list discussions:
|
||||
|
||||
* http://lists.openstack.org/pipermail/openstack-dev/2018-August/thread.html#133693
|
||||
* http://lists.openstack.org/pipermail/openstack-operators/2018-August/thread.html#15729
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
Loading…
Reference in New Issue