Commit Graph

7800 Commits

Author SHA1 Message Date
zhong.zhou f3eb76e57b Validate flavor image min ram when resize volume-backed instance
When resize instance, the flavors returned may not meet the image
minimum memory requirement, resizing instance ignores the minimum
memory limit of the image, which may cause the resizing be
successfully, but the instance fails to start because the memory is
too small to run the system.

Related-Bug: 2007968
Change-Id: I132e444eedc10b950a2fc9ed259cd6d9aa9bed65
2024-04-18 10:53:04 +08:00
Zuul 6bd99eb2ea Merge "Correctly reset instance task state in rebooting hard" 2024-03-20 13:34:22 +00:00
Sylvain Bauza a87c10afa7 Update compute rpc alias for caracal
This adds an alias for Caracal

Change-Id: I4a57cdac68cab4cda2a1928dd4346c9f2bca14c3
2024-03-14 16:59:45 +01:00
Zuul 45e5d213f8 Merge "Removed explicit call to delete attachment" 2024-03-14 10:47:58 +00:00
Artom Lifshitz c1ccc1a316 pwr mgmt: handle live migrations correctly
Previously, live migrations completely ignored CPU power management.
This patch makes sure that we correctly:

* Power up the cores on the destination during pre_live_migration, as
  we need them powered up before the instance starts on the
  destination.
* If the live migration is successful, power down the vacated cores on
  the source.
* In case of a rollback, power down the cores previously powered up on
  pre_live_migration.

Closes-bug: 2056613
Change-Id: I787bd7807950370cd865f29b95989d489d4826d0
2024-03-11 14:21:27 -04:00
Amit Uniyal dc6dac360c Removed explicit call to delete attachment
This was a TODO to remove delete attachment call from refresh after
remove_volume_connection call.
Remove volume connection process itself deletes attachment on passing
delete_attachment flag.

Bumps RPC API version.

Change-Id: I03ec3ee3ee1eeb6563a1dd6876094a7f4423d860
2024-03-01 06:26:48 +00:00
John Garbutt 947bb5f641 Make compute node rebalance safer
Many bugs around nova-compute rebalancing are focused around
problems when the compute node and placement resources are
deleted, and sometimes they never get re-created.

To limit this class of bugs, we add a check to ensure a compute
node is only ever deleted when it is known to have been deleted
in Ironic.

There is a risk this might leave orphaned compute nodes and
resource providers that need manual clean up because users
do not want to delete the node in Ironic, but are removing it
from nova management. But on balance, it seems safer to leave
these cases up to the operator to resolve manually, and collect
feedback on how to better help those users.

blueprint ironic-shards

Change-Id: I2bc77cbb77c2dd5584368563dc4250d71913906b
2024-02-25 13:25:27 -08:00
Zuul 9a9ab2128b Merge "Reserve mdevs to return to the source" 2024-02-23 15:46:08 +00:00
Zuul bdd7daffbb Merge "Check if destination can support the src mdev types" 2024-02-23 15:46:01 +00:00
Sylvain Bauza 2e1e12cd62 Reserve mdevs to return to the source
The destination lookups at the src mdev types and returns its own
mdevs using the same type. We also reserve them by an internal dict
and we make sure we can cleanup this dict if the live-migration aborts.

Partially-Implements: blueprint libvirt-mdev-live-migrate
Change-Id: I4a7e5292dd3df63943bd9f01803fa933e0466014
2024-02-16 16:05:48 +01:00
Ghanshyam Mann 0c1e1ccf03 HyperV: Remove RDP console API
RDP console was only for HyperV driver so removing the
API. As API url stay same (because same used for other
console types API), RDP console API will return 400.

Cleaning up the related config options as well as moving its
API ref to obsolete seciton.

Keeping RPC method to avoid error when old controller is used
with new compute. It can be removed in next RPC version bump.

Change-Id: I8f5755009da4af0d12bda096d7a8e85fd41e1a8c
2024-02-13 12:24:38 -08:00
Pierre-Samuel Le Stang aa3e8fef7b Correctly reset instance task state in rebooting hard
When a user ask for a reboot hard of a running instance while nova compute is
unavailable (service stopped or host down) it might happens under certain
conditions that the instance stays in rebooting_hard task_state after
nova-compute start again. This patch aims to fix that.

Closes-Bug: #1999674
Change-Id: I170e390fe4e467898a8dc7df6a446f62941d49ff
2024-02-06 10:04:01 +01:00
Sylvain Bauza 489aab934c Check if destination can support the src mdev types
Now that the source knows that both the computes support the right
libvirt version, it passes to the destination the list of mdevs it has
for the instance. By this change, we'll verify if the types of those
mdevs are actually supported by the destination.
On the next change, we'll pass the destination mdevs back to the
source.

Partially-Implements: blueprint libvirt-mdev-live-migrate
Change-Id: Icb52fa5eb0adc0aa6106a90d87149456b39e79c2
2024-02-05 15:10:55 +01:00
Amit Uniyal b5173b4192 Fixes: bfv vm reboot ends up in an error state.
we only need to verify if bdm has attachment id and it should be present in both nova and cinde DB.

For tests coverage, added tests for bfv server to test different bdm source type.

Closes-Bug: 2048154
Closes-Bug: 2048184
Change-Id: Icffcbad27d99a800e3f285565c0b823f697e388c
2024-01-18 05:53:51 +00:00
Zuul 39f560d673 Merge "pre-commit: Bump linter versions" 2023-12-21 05:33:02 +00:00
Zuul 7c2e79f762 Merge "Allow best effort sending of notifications" 2023-12-20 23:29:44 +00:00
Stephen Finucane 7116d8e5f1 pre-commit: Bump linter versions
Change-Id: I6825266702a7a4626b0c80bebdcb83cbb43849ea
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
2023-12-20 18:33:33 +00:00
Zuul d1309c4745 Merge "Call Neutron immediately upon _post_live_migration() start" 2023-12-20 09:04:50 +00:00
Zuul 5e914c27a0 Merge "Bump hacking version" 2023-12-18 21:20:36 +00:00
Artom Lifshitz 06d25926a1 Allow best effort sending of notifications
In the previous patch we changed the ordering of operations during
post_live_migration() to minimize guest networking downtime by
activating destination host port bindings as soon as possible.

Review of that patch led to the realization that exceptions during
notification sending can prevent the port binding activation from
happening. Instead of handling that in a localized try/catch, this
patch implements a general best_effort kwarg to our two notification
sending helpers to allow callers to indicate that any exceptions
during notification sending should not be fatal.

Change-Id: I01a15d6fffe98816ae019e67dc72784299fedfd3
2023-12-17 08:37:11 -05:00
Artom Lifshitz 26fbc9e8e7 Call Neutron immediately upon _post_live_migration() start
Previously, we first called to Cinder to cleanp source volume
connections, then to Neutron to activate destination port bindings.
This means that, if Cinder or the storage backend were slow, the live
migrated instance would be without network connectivity during that
entire process.

This patch immediately activates the port bindings on the destination
(and thus gives the instance network connectivity). We just need to
get the old network_info first, in order to use it in notifications
and to stash it for the later call to
driver.post_live_migration_at_source().

This is a smaller and safer change than the parallelization attempt in
the subsequent patch, so it's done in its own patch because it might
be backportable, and would help with network downtime during live
migration.

To avoid any potential data leaks, we want to be certain to cleanup
volumes on the source. To that end we wrap the code that is being
moved before the source volume cleanup code in a try/finally block
in order to prevent any uncaught exception from blocking the cleanup.

Change-Id: I700943723a32e732e3e3be825f3fd44a9f923a0b
2023-12-16 21:03:57 -05:00
Sean Mooney f4852f4c81 [codespell] fix final typos and enable ci
This chnage adds the pre-commit config and
tox targets to run codespell both indepenetly
and via the pep8 target.

This change correct all the final typos in the
codebase as detected by codespell.

Change-Id: Ic4fb5b3a5559bc3c43aca0a39edc0885da58eaa2
2023-12-15 12:32:42 +00:00
Stephen Finucane 3973fc393c Bump hacking version
This bumps the version of flake8 and resolves some erroneous failures in
f-strings. A number of new E721 (do not compare types) class errors are
picked up, which are all addressed.

Change-Id: I7a1937b107ff3af8d1e5fe23fc32b120ef4697f7
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
2023-12-14 10:54:26 +00:00
Zuul 17b7aa3926 Merge "[codespell] fix typos in tests" 2023-12-13 23:46:08 +00:00
Zuul 0d0600d62e Merge "Do not untrack resources of a server being unshelved" 2023-12-05 12:56:30 +00:00
Danylo Vodopianov eb8519d811 Packed virtqueue support was added.
1) Extend flavor/image extra spec.
2) New xml parameter for qemu command was added.
3) New request filter added for scheduler.
4) Unit and Functional tests were updated
5) Requirments was updated ( os-traits = 3.0.0 )
6) Releasnote was added

Nova spec: https://review.opendev.org/c/openstack/nova-specs/+/868377

Depends-On: https://review.opendev.org/c/openstack/os-traits/+/876069
Change-Id: I789eeae86947e9a3cbd7d5fcc58d2aabe3b8b84c
2023-11-29 16:06:33 +02:00
Sylvain Bauza ee9ed0f7c6 Fix rebuild compute RPC API exception for rolling-upgrades
By I0d889691de1af6875603a9f0f174590229e7be18 we broke rebuild for Yoga
or older computes.
By I9660d42937ad62d647afc6be965f166cc5631392 we broke rebuild for Zed
computes.

Fixing this by making the parameters optional.

Change-Id: I0ca04045f8ac742e2b50490cbe5efccaee45c5c0
Closed-Bug: #2040264
2023-10-30 10:17:42 -07:00
Dan Smith 190ecc6b8b Clean up service_get_all()
When we added the all_cells flag to this we just kinda hacked it
into place, leaving a big chunk of the method nested inside a
conditional. This refactors out that chunk into a helper, and also
corrects a naming error that was very confusing when reading the code
(a variable named "service" which was a list of services).

Change-Id: I41ff076864dce9ed826922f6609536ea4545a181
2023-10-09 07:45:08 -07:00
Dan Smith 86889b9182 Warn if we find compute services in cell0
While debugging a field issue recently, we determined that computes
had been pointed at cell0 and created service and node records there.
This makes us warn during service list if we find compute services
in cell0 to tip off operators that they have a configuration problem.

Change-Id: Id95c0d02cc34348623b01997fcd1930628d48ccc
2023-10-09 07:03:48 -07:00
Sean Mooney 2232ca95f2 [codespell] fix typos in tests
this mainly fixes typos in the tests and
one type in an exception message.
some addtional items are added to the dict based on
our usage of vars in test but we could remove them later
by doing minor test updates. They are intentionally not
fixed in this commit to limit scope creep.

Change-Id: Iacfbb0a5dc8ffb0857219c8d7c7a7d6e188f5980
2023-10-03 11:08:55 +01:00
Dan Smith abbac59e33 Sanity check that new hosts have no instances
If we are starting up and we think we are a new host (i.e. no pre-
existing service record was found for us), we expect to have no
instances on our hypervisor. If that is not the case, it is likely
that we got pointed at a new fresh database or the wrong database
(i.e. a different cell from our own). In that case, we should abort
startup to avoid creating new service, compute node, and resource
provider records.

This is a sort of follow-on additional improvement related to
work done in blueprint stable-compute-uuid.

Change-Id: Id817c51c90485119270f3b7f3c52858f42100637
2023-10-02 12:35:00 -07:00
Zuul 11843f249c Merge "Fix pep8 errors with new hacking" 2023-09-22 17:34:40 +00:00
Zuul 4b0515514d Merge "Revert "Make compute node rebalance safter"" 2023-09-13 21:32:20 +00:00
Sylvain Bauza 36a5740e2a Revert "Make compute node rebalance safter"
This reverts commit 772f5a1ae4.

Change-Id: I20e78dfafe19fc1e7dc7344238c01cb585f744dc
2023-09-13 19:24:20 +02:00
Sylvain Bauza f502a23600 Update compute rpc alias for bobcat
This adds an alias for Bobcat

Change-Id: I244c8b17c6c0483e9c9e4131e1c94db34a439e77
2023-09-05 16:23:42 +02:00
Zuul eee5b39b8e Merge "Make compute node rebalance safter" 2023-09-02 08:26:43 +00:00
Sean Mooney 68b2131d81 only attempt to clean up dangling bdm if cinder is installed
This change ensure we only try to clean up dangling bdms if
cinder is installed and reachable.

Closes-Bug: #2033752
Change-Id: I0ada59d8901f8620fd1f3dc20d6be303aa7dabca
2023-09-01 17:00:40 +00:00
John Garbutt 772f5a1ae4 Make compute node rebalance safter
Many bugs around nova-compute rebalancing are focused around
problems when the compute node and placement resources are
deleted, and sometimes they never get re-created.

To limit this class of bugs, we add a check to ensure a compute
node is only ever deleted when it is known to have been deleted
in Ironic.

There is a risk this might leave orphaned compute nodes and
resource providers that need manual clean up because users
do not want to delete the node in Ironic, but are removing it
from nova management. But on balance, it seems safer to leave
these cases up to the operator to resolve manually, and collect
feedback on how to better help those users.

blueprint ironic-shards

Change-Id: I7cd9e5ab878cea05462cac24de581dca6d50b3c3
2023-08-31 17:21:15 +00:00
Amit Uniyal 9d5935d007 Delete dangling bdms
On reboot, check the instance volume status on the cinder side.
Verify if volume exists and cinder has an attachment ID, else
delete its BDMS data from nova DB and vice versa.

Updated existing test cases to use CinderFixture while rebooting as
reboot calls get_all_attachments

Implements: blueprint https://blueprints.launchpad.net/nova/+spec/cleanup-dangling-volume-attachments
Closes-Bug: 2019078

Change-Id: Ieb619d4bfe0a6472aefb118b58283d7ad8d24c29
2023-08-31 14:19:58 +00:00
Bence Romsics f1dc4ec39b Do not untrack resources of a server being unshelved
This patch concerns the time when a VM is being unshelved and the
compute manager set the task_state to spawning, claimed resources of
the VM and then called driver.spawn(). So the instance is in vm_state
SHELVED_OFFLOADED, task_state spawning.

If at this point a new update_available_resource periodic job is
started that collects all the instances assigned to the node to
calculate resource usage. However the calculation assumed that a
VM in SHELVED_OFFLOADED state does not need resource allocation on
the node (probably being removed from the node as it is offloaded)
and deleted the resource claim.

Given all this we ended up with the VM spawned successfully but having
lost the resource claim on the node.

This patch changes what we do in vm_state SHELVED_OFFLOADED, task_state
spawning. We no longer delete the resource claim in this state and
keep tracking the resource in stats.

Change-Id: I8c9944810c09d501a6d3f60f095d9817b756872d
Closes-Bug: #2025480
2023-08-17 10:50:32 +02:00
Zuul bee1313240 Merge "Online migrate missing Instance.compute_id fields" 2023-07-28 18:36:58 +00:00
Zuul e5ee5e035c Merge "Add compute_id to Instance object" 2023-07-27 01:38:06 +00:00
Zuul 54de747c25 Merge "Add dest_compute_id to Migration object" 2023-07-26 02:04:49 +00:00
Zuul e02c5f0e7a Merge "Populate ComputeNode.service_id" 2023-07-14 22:41:39 +00:00
Zuul 1fe8c4becb Merge "Fix failed count for anti-affinity check" 2023-06-07 14:35:52 +00:00
Zuul fc8951efb9 Merge "Process unlimited exceptions raised by unplug_vifs" 2023-06-07 14:16:10 +00:00
Yusuke Okada 56d320a203 Fix failed count for anti-affinity check
The late anti-affinity check runs in the compute manager to avoid
parallel scheduling requests to invalidate the anti-affinity server
group policy. When the check fails the instance is re-scheduled.
However this failure counted as a real instance boot failure of the
compute host and can lead to de-prioritization of the compute host
in the scheduler via BuildFailureWeigher. As the late anti-affinity
check does not indicate any fault of the compute host itself it
should not be counted towards the build failure counter.
This patch adds new build results to handle this case.

Closes-Bug: #1996732
Change-Id: I2ba035c09ace20e9835d9d12a5c5bee17d616718
Signed-off-by: Yusuke Okada <okada.yusuke@fujitsu.com>
2023-06-06 10:15:16 +02:00
Dan Smith 84e7bed27e Online migrate missing Instance.compute_id fields
This migrates *existing* Instance records with incomplete compute_id
fields, where necessary, while updating node resources. This is done
separately from the earlier patch to do new instances just to
demonstrate in CI that we're still able to run in such a partially-
migrated state.

Related to blueprint compute-object-ids

Change-Id: Ie18342deeddb7f58564953e6b46ea0b0d7495595
2023-05-31 07:13:16 -07:00
Dan Smith 625fb569a7 Add compute_id to Instance object
This adds the compute_id field to the Instance object and adds
checks in save() and create() to make sure we no longer update node
without also updating compute_id.

Related to blueprint compute-object-ids

Change-Id: I0740a2e4e09a526da8565a18e6761b4dbdc4ec0b
2023-05-31 07:13:16 -07:00
Dan Smith 70516d4ff9 Add dest_compute_id to Migration object
This makes us store the compute_id of the destination node in the
Migration object. Since resize/cold-migration changes the node
affiliation of an instance *to* the destination node *from* the source
node, we need a positive record of the node id to be used. The
destination node set this to its own node when creating the migration,
and it is used by the source node when the switchover happens.

Because the migration may be backleveled for an older node involved
in that process and thus saved or passed without this field, this
adds a compatibility routine that falls back to looking up the node
by host/nodename.

Related to blueprint compute-object-ids

Change-Id: I362a40403d1094be36412f5f7afba00da8af8301
2023-05-31 07:13:16 -07:00