26 KiB
NUMA-aware live migration
https://blueprints.launchpad.net/nova/+spec/numa-aware-live-migration
When an instance with NUMA characteristics is live-migrated, those characteristics are not recalculated on the destination compute host. In the CPU pinning case, using the source host's pin mappings on the destination can lead to multiple instances being pinned to the same pCPUs. In the case of hugepage-backed instances, which are NUMA-localized, an instance needs to have its NUMA mapping recalculated on the destination compute host during a live migration.
Problem description
Note
In the following paragraphs the term NUMA is incorrectly used to
signify any guest characteristic that is expressed in the
InstanceNUMATopology
object, for example CPU pinning and
hugepages. CPU pinning can be achieved without a guest NUMA topology,
but the two concepts are unfortunately tightly coupled in Nova and
instance pinning is not possible without an instance NUMA topology. For
this reason, NUMA is used as a catchall term.
Note
This spec concentrates on the libvirt driver. Any higher level code (compute manager, conductor) will be as driver agnostic as possible.
The problem can best be described with three examples.
The first example is live migration with CPU pinning. An instance
with a hw:cpu_policy=dedicated
extra
spec and pinned CPUs is live-migrated. Its pin mappings are naively
copied over to the destination host. This creates two problems. First,
its pinned pCPUs aren't properly claimed on the destination. This means
that, should a second instance with pinned CPUs land on the destination,
both instances' vCPUs could be pinned to the same pCPUs. Second, any
existing pin mappings on the destination are ignored. If another
instance already exists on the destination, both instances's vCPUs could
be pinned to the same pCPUs. In both cases, the dedicated
CPU policy is violated, potentially leading to unpredictable performance
degradation.
The second example is instances with hugepages. There are two hosts, each with two NUMA nodes and 8 1GB hugepages per node. Two identical instances are booted on the two hosts. Their virtual NUMA topology is one virtual NUMA node and 8 1GB memory pages. They land on their respective host's NUMA node 0, consuming all 8 of its pages. One instance is live-migrated to the other host. The libvirt driver enforces strict NUMA affinity and does not regenerate the instance XML. Both instances end up on the hosts NUMA node 0, and the live-migrated instance fails to run.
The third example is an instance with a virtual NUMA topology (but without hugepages). If an instance affined to its host's NUMA node 2 is live migrated to a host with only two NUMA nodes, and thus without a NUMA node 2, it will fail to run.
The first two of these examples are known bugs12.
Use Cases
As a cloud administrator, I want to live migrate instances with CPU pinning without the pin mappings overlapping on the destination compute host.
As a cloud administrator, I want live migration of hugepage-backed instances to work and for the instances to successfully run on the destination compute host.
As a cloud administrator, I want live migration of instances with an explicit NUMA topology to work and for the instances to successfully run on the destination compute host.
Proposed change
There are five aspects to supporting NUMA live migration. First, the
instance's NUMA characteristics need to be recalculated to fit on the
new host. Second, the resources that the instance will consume on the
new host need to be claimed. Third, information about the instance's new
NUMA characteristics needs to be generated on the destination (an
InstanceNUMATopolgy
object is not enough, more on that
later). Fourth, this information needs to be sent from the destination
to the source, in order for the source to generate the correct XML for
the instance to be able to run on the destination. Finally, the
instance's resource claims need to "converge" to reflect the success or
failure of the live migration. If the live migration succeeded, the
usage on the source needs to be released. If it failed, the claim on the
destination needs to be rolled back.
Resource claims
Let's address the resource claims aspect first. An effort has begun to support NUMA resource providers in placement3 and to standardize CPU resource tracking4. However, placement can only track inventories and allocations of quantities of resources. It does not track which specific resources are used. Specificity is needed for NUMA live migration. Consider an instance that uses 4 dedicated CPUs in a future where the standard CPU resource tracking spec5 has been implemented. During live migration, the scheduler claims those 4 CPUs in placement on the destination. However, we need to prevent other instances from using those specific CPUs. Therefore, in addition to claiming quantities of CPUs in placement, we need to claim specific CPUs on the compute host. The compute resource tracker already exists for exactly this purpose, and it will continue to be used to claim specific resources on the destination, even in a NUMA-enabled placement future.
There is a time window between the scheduler picking a destination
for the live migration and the actual live migration RPC conversation
between the two compute hosts. Another instance could land on the
destination during that time window, using up NUMA resources that the
scheduler thought were free. This race leads to the resource claim
failing on the destination. This spec proposes to handle this claim
failure using the existing MigrationPreCheckError
exception
mechanism, causing the scheduler to pick a new host.
Fitting to the new host
An advantage of using the resource tracker is that it forces us to
use a MoveClaim
, thus giving us the instance new NUMA
topology for free (Claim._test_numa_topology
in
nova/compute/claims.py
).
Generating the new NUMA information on the destination
However, having the new instance NUMA topology in the claim isn't
enough for the source to generate the new XML. The simplest way to
generate the new XML fom the new instance NUMA topology would be to call
the libvirt driver's _get_guest_numa_config
method (which
handily accepts an instance_numa_topology
as an argument).
However, this needs to be done on the destination, as it depends on the
host NUMA topology. _get_guest_numa_config
returns a tuple
of LibvirtConfigObject
. The information contained therein
needs to somehow be sent to the source over the wire.
The naive way would be to send the objects directly, or perhaps to
call to_xml
and send the resulting XML blob of text. This
would be unversioned, and there would be no schema. This could cause
problems in the case of, for example, a newer libvirt driver, which has
dropped support for a particular element or attribute, talking to an
older libvirt driver, which still supports it.
Because of this, and sticking to the existing OpenStack best practice of sending oslo versionedobjects over the wire, this spec proposes encode the necessary NUMA-related information as Nova versioned objects. These new objects should be as virt driver independent as reasonnably possible, but as the use case is still libvirt talking to libvirt, abstraction for the sake of abstraction is not appropriate either.
Sending the new NUMA Nova objects
Once the superconductor has chosen and/or validated the destination host, the relevant parts of the current live migration flow can be summarized by the following oversimplified pseudo sequence diagram.:
+-----------+ +---------+ +-------------+ +---------+
| Conductor | | Source | | Destination | | Driver |
+-----------+ +---------+ +-------------+ +---------+
| | | |
| check_can_live_migrate_destination() | | |
|-------------------------------------------------------------------------->| |
| | | |
| | check_can_live_migrate_source() | |
| |<-----------------------------------| |
| | | |
| | migrate_data | |
| |----------------------------------->| |
| | | |
| | migrate_data | |
|<--------------------------------------------------------------------------| |
| | | |
| live_migration(migrate_data) | | |
|------------------------------------->| | |
| | | |
| | pre_live_migration(migrate_data) | |
| |----------------------------------->| |
| | | |
| | migrate_data | |
| |<-----------------------------------| |
| | | |
| | live_migration(migrate_data) | |
| |------------------------------------------------->|
| | | |
In the proposed new flow, the destination compute manager asks the
libvirt driver to calculate the new LibvirtGuestConfig
objects using the new instance NUMA topology obtained from the move
claim. The compute manager converts those
LibvirtGuestConfig
objecs to the new NUMA Nova objects, and
adds them as fields to the LibvirtLiveMigrateData
migrate_data
object. The latter eventually reaches the
source libvirt driver, which uses it to generate the new XML. The
proposed flow is summarised in the following diagram.:
+-----------+ +---------+ +-------------+ +---------+
| Conductor | | Source | | Destination | | Driver |
+-----------+ +---------+ +-------------+ +---------+
| | | |
| check_can_live_migrate_destination() | | |
|------------------------------------------------------------------------------------------->| |
| | | |
| | check_can_live_migrate_source() | |
| |<----------------------------------| |
| | | |
| | migrate_data | |
| |---------------------------------->| |
| | | +-----------------------------------+ |
| | |-| Obtain new_instance_numa_topology | |
| | | | from claim | |
| | | +-----------------------------------+ |
| | | |
| | | _get_guest_numa_config(new_instance_numa_topology) |
| | | ---------------------------------------------------->|
| | | |
| | | LibvirtConfigGuest objects |
| | |<-----------------------------------------------------|
| | | |
| | | +----------------------------------+ |
| | |-| Build new NUMA Nova objects from | |
| | | | LibvirtConfigGuest objects | |
| | | | and add to migrate_data | |
| | | +----------------------------------+ |
| | | |
| migrate_data + new NUMA Nova objects | |
|<-------------------------------------------------------------------------------------------| |
| | | |
| live_migration(migrate_data + new NUMA Nova objects) | | |
|------------------------------------------------------->| | |
| | | |
| | pre_live_migration() | |
| |---------------------------------->| |
| |<----------------------------------| |
| | | |
| | live_migration(migrate_data + new NUMA Nova objects) |
| |----------------------------------------------------------------------------------------->|
| | | |
| | | +-----------------------------------+ |
| | | | generate NUMA XML for destination |-|
| | | +-----------------------------------+ |
| | | |
Claim convergence
The claim object is a context manager, so it can in theory clean
itself up if any code within its context raises an unhandled exception.
However, live migration involves RPC casts between the compute hosts,
making it impractical to use the claim as a context manager. For that
reason, if the live migration fails, drop_move_claim
needs
to be called manually during the rollback to drop the claim from the
destination. Whether to do this on the source in
rollback_live_migration
or in
rollback_live_migration_at_destination
is left as an
implementation detail.
Similarly, if the live migration succeeds,
drop_move_claim
needs to be called to drop the claim from
the source, similar to how _confirm_resize
does it in the
compute manager. Whether to do this in post_live_migration
on the source or in post_live_migration_at_destination
is
left as an implementation detail.
Alternatives
Using move claims and the new instance NUMA topology calculated within essentially dictates the rest of the implementation.
When the superconductor calls the scheduler's
select_destination
method, that call eventually ends up
calling numa_fit_instance_to_host
(select_destinations
-> _schedule
->
_consume_selected_host
->
consume_from_request
->
_locked_consume_from_request
->
numa_fit_instance_to_host
). It would be conceivable to
reuse that result. However, the claim would still calculate its own new
instance NUMA topology.
Data model impact
New version objects are created to transmit cell, CPU, emulator
thread, and hugepage nodeset mappings from the destination to the
source. These objects are added to
LibvirtLiveMigrateData
.
REST API impact
None.
Security impact
None.
Notifications impact
None.
Other end user impact
None.
Performance Impact
None.
Other deployer impact
None.
Developer impact
None.
Upgrade impact
In the case of a mixed N/N+1 cloud, the possibilities for the exchange of information between the destination and the source are summarized in the following table. In it, no indicates that the new code is not present, old path indicates that the new code is present but choses to execute the old code for backwards compatibility, and yes indicates that the new functionality is used.
Old dest | New dest | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Old source |
|
|
||||||||||||||||||||
New source |
|
|
Implementation
Assignee(s)
- Primary assignee:
-
notartom
Work Items
- Fail live migration of instances with NUMA topology6 until this spec is fully implemented.
- Add NUMA Nova objects
- Add claim context to live migration
- Calculate new NUMA topology on the destination and send it to the source
- Source updates instance XML according to new NUMA topology calculated by the destination
Dependencies
None.
Testing
The libvirt/qemu driver used in the gate does not currently support NUMA features (though work is in progress7). Therefore, testing NUMA aware live migration in the upstream gate would require nested virt. In addition, the only assertable outcome of a NUMA live migration test (if it ever becomes possible) would be that the live migration succeeded. Examining the instance XML to assert things about its NUMA affinity or CPU pin mapping is explicitly out of tempest's scope. For these reasons, NUMA aware live migration is best tested in third party CI8 or other downstream test scenarios9.
Documentation Impact
Current live migration documentation does not mention the NUMA limitations anywhere. Therefore, a release note explaining the new NUMA capabilities of live migration should be enough.
References
History
Release Name | Description |
---|---|
Rocky | Introduced |
Stein | Re-proposed with modifications pertaining to claims and the exchange of information between destination and source. |