When resize instance, the flavors returned may not meet the image
minimum memory requirement, resizing instance ignores the minimum
memory limit of the image, which may cause the resizing be
successfully, but the instance fails to start because the memory is
too small to run the system.
Related-Bug: 2007968
Change-Id: I132e444eedc10b950a2fc9ed259cd6d9aa9bed65
Previously, live migrations completely ignored CPU power management.
This patch makes sure that we correctly:
* Power up the cores on the destination during pre_live_migration, as
we need them powered up before the instance starts on the
destination.
* If the live migration is successful, power down the vacated cores on
the source.
* In case of a rollback, power down the cores previously powered up on
pre_live_migration.
Closes-bug: 2056613
Change-Id: I787bd7807950370cd865f29b95989d489d4826d0
This was a TODO to remove delete attachment call from refresh after
remove_volume_connection call.
Remove volume connection process itself deletes attachment on passing
delete_attachment flag.
Bumps RPC API version.
Change-Id: I03ec3ee3ee1eeb6563a1dd6876094a7f4423d860
Many bugs around nova-compute rebalancing are focused around
problems when the compute node and placement resources are
deleted, and sometimes they never get re-created.
To limit this class of bugs, we add a check to ensure a compute
node is only ever deleted when it is known to have been deleted
in Ironic.
There is a risk this might leave orphaned compute nodes and
resource providers that need manual clean up because users
do not want to delete the node in Ironic, but are removing it
from nova management. But on balance, it seems safer to leave
these cases up to the operator to resolve manually, and collect
feedback on how to better help those users.
blueprint ironic-shards
Change-Id: I2bc77cbb77c2dd5584368563dc4250d71913906b
The destination lookups at the src mdev types and returns its own
mdevs using the same type. We also reserve them by an internal dict
and we make sure we can cleanup this dict if the live-migration aborts.
Partially-Implements: blueprint libvirt-mdev-live-migrate
Change-Id: I4a7e5292dd3df63943bd9f01803fa933e0466014
RDP console was only for HyperV driver so removing the
API. As API url stay same (because same used for other
console types API), RDP console API will return 400.
Cleaning up the related config options as well as moving its
API ref to obsolete seciton.
Keeping RPC method to avoid error when old controller is used
with new compute. It can be removed in next RPC version bump.
Change-Id: I8f5755009da4af0d12bda096d7a8e85fd41e1a8c
When a user ask for a reboot hard of a running instance while nova compute is
unavailable (service stopped or host down) it might happens under certain
conditions that the instance stays in rebooting_hard task_state after
nova-compute start again. This patch aims to fix that.
Closes-Bug: #1999674
Change-Id: I170e390fe4e467898a8dc7df6a446f62941d49ff
Now that the source knows that both the computes support the right
libvirt version, it passes to the destination the list of mdevs it has
for the instance. By this change, we'll verify if the types of those
mdevs are actually supported by the destination.
On the next change, we'll pass the destination mdevs back to the
source.
Partially-Implements: blueprint libvirt-mdev-live-migrate
Change-Id: Icb52fa5eb0adc0aa6106a90d87149456b39e79c2
we only need to verify if bdm has attachment id and it should be present in both nova and cinde DB.
For tests coverage, added tests for bfv server to test different bdm source type.
Closes-Bug: 2048154
Closes-Bug: 2048184
Change-Id: Icffcbad27d99a800e3f285565c0b823f697e388c
In the previous patch we changed the ordering of operations during
post_live_migration() to minimize guest networking downtime by
activating destination host port bindings as soon as possible.
Review of that patch led to the realization that exceptions during
notification sending can prevent the port binding activation from
happening. Instead of handling that in a localized try/catch, this
patch implements a general best_effort kwarg to our two notification
sending helpers to allow callers to indicate that any exceptions
during notification sending should not be fatal.
Change-Id: I01a15d6fffe98816ae019e67dc72784299fedfd3
Previously, we first called to Cinder to cleanp source volume
connections, then to Neutron to activate destination port bindings.
This means that, if Cinder or the storage backend were slow, the live
migrated instance would be without network connectivity during that
entire process.
This patch immediately activates the port bindings on the destination
(and thus gives the instance network connectivity). We just need to
get the old network_info first, in order to use it in notifications
and to stash it for the later call to
driver.post_live_migration_at_source().
This is a smaller and safer change than the parallelization attempt in
the subsequent patch, so it's done in its own patch because it might
be backportable, and would help with network downtime during live
migration.
To avoid any potential data leaks, we want to be certain to cleanup
volumes on the source. To that end we wrap the code that is being
moved before the source volume cleanup code in a try/finally block
in order to prevent any uncaught exception from blocking the cleanup.
Change-Id: I700943723a32e732e3e3be825f3fd44a9f923a0b
This chnage adds the pre-commit config and
tox targets to run codespell both indepenetly
and via the pep8 target.
This change correct all the final typos in the
codebase as detected by codespell.
Change-Id: Ic4fb5b3a5559bc3c43aca0a39edc0885da58eaa2
This bumps the version of flake8 and resolves some erroneous failures in
f-strings. A number of new E721 (do not compare types) class errors are
picked up, which are all addressed.
Change-Id: I7a1937b107ff3af8d1e5fe23fc32b120ef4697f7
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
By I0d889691de1af6875603a9f0f174590229e7be18 we broke rebuild for Yoga
or older computes.
By I9660d42937ad62d647afc6be965f166cc5631392 we broke rebuild for Zed
computes.
Fixing this by making the parameters optional.
Change-Id: I0ca04045f8ac742e2b50490cbe5efccaee45c5c0
Closed-Bug: #2040264
When we added the all_cells flag to this we just kinda hacked it
into place, leaving a big chunk of the method nested inside a
conditional. This refactors out that chunk into a helper, and also
corrects a naming error that was very confusing when reading the code
(a variable named "service" which was a list of services).
Change-Id: I41ff076864dce9ed826922f6609536ea4545a181
While debugging a field issue recently, we determined that computes
had been pointed at cell0 and created service and node records there.
This makes us warn during service list if we find compute services
in cell0 to tip off operators that they have a configuration problem.
Change-Id: Id95c0d02cc34348623b01997fcd1930628d48ccc
this mainly fixes typos in the tests and
one type in an exception message.
some addtional items are added to the dict based on
our usage of vars in test but we could remove them later
by doing minor test updates. They are intentionally not
fixed in this commit to limit scope creep.
Change-Id: Iacfbb0a5dc8ffb0857219c8d7c7a7d6e188f5980
If we are starting up and we think we are a new host (i.e. no pre-
existing service record was found for us), we expect to have no
instances on our hypervisor. If that is not the case, it is likely
that we got pointed at a new fresh database or the wrong database
(i.e. a different cell from our own). In that case, we should abort
startup to avoid creating new service, compute node, and resource
provider records.
This is a sort of follow-on additional improvement related to
work done in blueprint stable-compute-uuid.
Change-Id: Id817c51c90485119270f3b7f3c52858f42100637
This change ensure we only try to clean up dangling bdms if
cinder is installed and reachable.
Closes-Bug: #2033752
Change-Id: I0ada59d8901f8620fd1f3dc20d6be303aa7dabca
Many bugs around nova-compute rebalancing are focused around
problems when the compute node and placement resources are
deleted, and sometimes they never get re-created.
To limit this class of bugs, we add a check to ensure a compute
node is only ever deleted when it is known to have been deleted
in Ironic.
There is a risk this might leave orphaned compute nodes and
resource providers that need manual clean up because users
do not want to delete the node in Ironic, but are removing it
from nova management. But on balance, it seems safer to leave
these cases up to the operator to resolve manually, and collect
feedback on how to better help those users.
blueprint ironic-shards
Change-Id: I7cd9e5ab878cea05462cac24de581dca6d50b3c3
On reboot, check the instance volume status on the cinder side.
Verify if volume exists and cinder has an attachment ID, else
delete its BDMS data from nova DB and vice versa.
Updated existing test cases to use CinderFixture while rebooting as
reboot calls get_all_attachments
Implements: blueprint https://blueprints.launchpad.net/nova/+spec/cleanup-dangling-volume-attachments
Closes-Bug: 2019078
Change-Id: Ieb619d4bfe0a6472aefb118b58283d7ad8d24c29
This patch concerns the time when a VM is being unshelved and the
compute manager set the task_state to spawning, claimed resources of
the VM and then called driver.spawn(). So the instance is in vm_state
SHELVED_OFFLOADED, task_state spawning.
If at this point a new update_available_resource periodic job is
started that collects all the instances assigned to the node to
calculate resource usage. However the calculation assumed that a
VM in SHELVED_OFFLOADED state does not need resource allocation on
the node (probably being removed from the node as it is offloaded)
and deleted the resource claim.
Given all this we ended up with the VM spawned successfully but having
lost the resource claim on the node.
This patch changes what we do in vm_state SHELVED_OFFLOADED, task_state
spawning. We no longer delete the resource claim in this state and
keep tracking the resource in stats.
Change-Id: I8c9944810c09d501a6d3f60f095d9817b756872d
Closes-Bug: #2025480
The late anti-affinity check runs in the compute manager to avoid
parallel scheduling requests to invalidate the anti-affinity server
group policy. When the check fails the instance is re-scheduled.
However this failure counted as a real instance boot failure of the
compute host and can lead to de-prioritization of the compute host
in the scheduler via BuildFailureWeigher. As the late anti-affinity
check does not indicate any fault of the compute host itself it
should not be counted towards the build failure counter.
This patch adds new build results to handle this case.
Closes-Bug: #1996732
Change-Id: I2ba035c09ace20e9835d9d12a5c5bee17d616718
Signed-off-by: Yusuke Okada <okada.yusuke@fujitsu.com>
This migrates *existing* Instance records with incomplete compute_id
fields, where necessary, while updating node resources. This is done
separately from the earlier patch to do new instances just to
demonstrate in CI that we're still able to run in such a partially-
migrated state.
Related to blueprint compute-object-ids
Change-Id: Ie18342deeddb7f58564953e6b46ea0b0d7495595
This adds the compute_id field to the Instance object and adds
checks in save() and create() to make sure we no longer update node
without also updating compute_id.
Related to blueprint compute-object-ids
Change-Id: I0740a2e4e09a526da8565a18e6761b4dbdc4ec0b
This makes us store the compute_id of the destination node in the
Migration object. Since resize/cold-migration changes the node
affiliation of an instance *to* the destination node *from* the source
node, we need a positive record of the node id to be used. The
destination node set this to its own node when creating the migration,
and it is used by the source node when the switchover happens.
Because the migration may be backleveled for an older node involved
in that process and thus saved or passed without this field, this
adds a compatibility routine that falls back to looking up the node
by host/nodename.
Related to blueprint compute-object-ids
Change-Id: I362a40403d1094be36412f5f7afba00da8af8301