Commit Graph

911 Commits

Author SHA1 Message Date
Zuul 6bd99eb2ea Merge "Correctly reset instance task state in rebooting hard" 2024-03-20 13:34:22 +00:00
John Garbutt 947bb5f641 Make compute node rebalance safer
Many bugs around nova-compute rebalancing are focused around
problems when the compute node and placement resources are
deleted, and sometimes they never get re-created.

To limit this class of bugs, we add a check to ensure a compute
node is only ever deleted when it is known to have been deleted
in Ironic.

There is a risk this might leave orphaned compute nodes and
resource providers that need manual clean up because users
do not want to delete the node in Ironic, but are removing it
from nova management. But on balance, it seems safer to leave
these cases up to the operator to resolve manually, and collect
feedback on how to better help those users.

blueprint ironic-shards

Change-Id: I2bc77cbb77c2dd5584368563dc4250d71913906b
2024-02-25 13:25:27 -08:00
Zuul bdd7daffbb Merge "Check if destination can support the src mdev types" 2024-02-23 15:46:01 +00:00
Ghanshyam Mann 0c1e1ccf03 HyperV: Remove RDP console API
RDP console was only for HyperV driver so removing the
API. As API url stay same (because same used for other
console types API), RDP console API will return 400.

Cleaning up the related config options as well as moving its
API ref to obsolete seciton.

Keeping RPC method to avoid error when old controller is used
with new compute. It can be removed in next RPC version bump.

Change-Id: I8f5755009da4af0d12bda096d7a8e85fd41e1a8c
2024-02-13 12:24:38 -08:00
Pierre-Samuel Le Stang aa3e8fef7b Correctly reset instance task state in rebooting hard
When a user ask for a reboot hard of a running instance while nova compute is
unavailable (service stopped or host down) it might happens under certain
conditions that the instance stays in rebooting_hard task_state after
nova-compute start again. This patch aims to fix that.

Closes-Bug: #1999674
Change-Id: I170e390fe4e467898a8dc7df6a446f62941d49ff
2024-02-06 10:04:01 +01:00
Sylvain Bauza 489aab934c Check if destination can support the src mdev types
Now that the source knows that both the computes support the right
libvirt version, it passes to the destination the list of mdevs it has
for the instance. By this change, we'll verify if the types of those
mdevs are actually supported by the destination.
On the next change, we'll pass the destination mdevs back to the
source.

Partially-Implements: blueprint libvirt-mdev-live-migrate
Change-Id: Icb52fa5eb0adc0aa6106a90d87149456b39e79c2
2024-02-05 15:10:55 +01:00
Zuul ef08dcb5f3 Merge "Reproducer test of bug #1999674" 2023-12-20 03:50:36 +00:00
Zuul 17b7aa3926 Merge "[codespell] fix typos in tests" 2023-12-13 23:46:08 +00:00
Sean Mooney 2232ca95f2 [codespell] fix typos in tests
this mainly fixes typos in the tests and
one type in an exception message.
some addtional items are added to the dict based on
our usage of vars in test but we could remove them later
by doing minor test updates. They are intentionally not
fixed in this commit to limit scope creep.

Change-Id: Iacfbb0a5dc8ffb0857219c8d7c7a7d6e188f5980
2023-10-03 11:08:55 +01:00
Dan Smith abbac59e33 Sanity check that new hosts have no instances
If we are starting up and we think we are a new host (i.e. no pre-
existing service record was found for us), we expect to have no
instances on our hypervisor. If that is not the case, it is likely
that we got pointed at a new fresh database or the wrong database
(i.e. a different cell from our own). In that case, we should abort
startup to avoid creating new service, compute node, and resource
provider records.

This is a sort of follow-on additional improvement related to
work done in blueprint stable-compute-uuid.

Change-Id: Id817c51c90485119270f3b7f3c52858f42100637
2023-10-02 12:35:00 -07:00
Zuul 11843f249c Merge "Fix pep8 errors with new hacking" 2023-09-22 17:34:40 +00:00
Sylvain Bauza 36a5740e2a Revert "Make compute node rebalance safter"
This reverts commit 772f5a1ae4.

Change-Id: I20e78dfafe19fc1e7dc7344238c01cb585f744dc
2023-09-13 19:24:20 +02:00
John Garbutt 772f5a1ae4 Make compute node rebalance safter
Many bugs around nova-compute rebalancing are focused around
problems when the compute node and placement resources are
deleted, and sometimes they never get re-created.

To limit this class of bugs, we add a check to ensure a compute
node is only ever deleted when it is known to have been deleted
in Ironic.

There is a risk this might leave orphaned compute nodes and
resource providers that need manual clean up because users
do not want to delete the node in Ironic, but are removing it
from nova management. But on balance, it seems safer to leave
these cases up to the operator to resolve manually, and collect
feedback on how to better help those users.

blueprint ironic-shards

Change-Id: I7cd9e5ab878cea05462cac24de581dca6d50b3c3
2023-08-31 17:21:15 +00:00
Zuul e5ee5e035c Merge "Add compute_id to Instance object" 2023-07-27 01:38:06 +00:00
Zuul 54de747c25 Merge "Add dest_compute_id to Migration object" 2023-07-26 02:04:49 +00:00
Zuul e02c5f0e7a Merge "Populate ComputeNode.service_id" 2023-07-14 22:41:39 +00:00
Zuul 1fe8c4becb Merge "Fix failed count for anti-affinity check" 2023-06-07 14:35:52 +00:00
Zuul fc8951efb9 Merge "Process unlimited exceptions raised by unplug_vifs" 2023-06-07 14:16:10 +00:00
Yusuke Okada 56d320a203 Fix failed count for anti-affinity check
The late anti-affinity check runs in the compute manager to avoid
parallel scheduling requests to invalidate the anti-affinity server
group policy. When the check fails the instance is re-scheduled.
However this failure counted as a real instance boot failure of the
compute host and can lead to de-prioritization of the compute host
in the scheduler via BuildFailureWeigher. As the late anti-affinity
check does not indicate any fault of the compute host itself it
should not be counted towards the build failure counter.
This patch adds new build results to handle this case.

Closes-Bug: #1996732
Change-Id: I2ba035c09ace20e9835d9d12a5c5bee17d616718
Signed-off-by: Yusuke Okada <okada.yusuke@fujitsu.com>
2023-06-06 10:15:16 +02:00
Dan Smith 625fb569a7 Add compute_id to Instance object
This adds the compute_id field to the Instance object and adds
checks in save() and create() to make sure we no longer update node
without also updating compute_id.

Related to blueprint compute-object-ids

Change-Id: I0740a2e4e09a526da8565a18e6761b4dbdc4ec0b
2023-05-31 07:13:16 -07:00
Dan Smith 70516d4ff9 Add dest_compute_id to Migration object
This makes us store the compute_id of the destination node in the
Migration object. Since resize/cold-migration changes the node
affiliation of an instance *to* the destination node *from* the source
node, we need a positive record of the node id to be used. The
destination node set this to its own node when creating the migration,
and it is used by the source node when the switchover happens.

Because the migration may be backleveled for an older node involved
in that process and thus saved or passed without this field, this
adds a compatibility routine that falls back to looking up the node
by host/nodename.

Related to blueprint compute-object-ids

Change-Id: I362a40403d1094be36412f5f7afba00da8af8301
2023-05-31 07:13:16 -07:00
Dan Smith afad847e4d Populate ComputeNode.service_id
The ComputeNode object already has a service_id field that we stopped
using a while ago. This moves us back to the point where we set it when
creating new ComputeNode records, and also migrates existing records
when they are loaded.

The resource tracker is created before we may have created the
service record, but is updated afterwards in the pre_start_hook().
So this adds a way for us to pass the service_ref to the resource
tracker during that hook so that it is present before the first time
we update all of our ComputeNode records. It also makes sure to pass
the Service through from the actual Service manager instead of looking
it up again to make sure we maintain the tight relationship and avoid
any name-based ambiguity.

Related to blueprint compute-object-ids

Change-Id: I5e060d674b6145c9797c2251a2822106fc6d4a71
2023-05-31 07:06:34 -07:00
Artom Lifshitz faa1e64e5b Fix pep8 errors with new hacking
Hacking has bumped the version of flake8 that it's using to 5.0 in its
6.0.1 release. This turns up quite a few pep8 errors lurking in our
code. Fix them.

Needed-by: https://review.opendev.org/c/openstack/hacking/+/874516
Change-Id: I3b9c7f9f5de757f818ec358c992ffb0e5f3e310f
2023-04-28 08:34:52 -04:00
Dan Smith 82deb0ce4b Stop ignoring missing compute nodes in claims
The resource tracker will silently ignore attempts to claim resources
when the node requested is not managed by this host. The misleading
"self.disabled(nodename)" check will fail if the nodename is not known
to the resource tracker, causing us to bail early with a NopClaim.
That means we also don't do additional setup like creating a migration
context for the instance, claim resources in placement, and handle
PCI/NUMA things. This behavior is quite old, and clearly doesn't make
sense in a world with things like placement. The bulk of the test
changes here are due to the fact that a lot of tests were relying on
this silent ignoring of a mismatching node, because they were passing
node names that weren't even tracked.

This change makes us raise an error if this happens so that we can
actually catch it, and avoid silently continuing with no resource
claim.

Change-Id: I416126ee5d10428c296fe618aa877cca0e8dffcf
2023-04-24 15:26:52 -07:00
Alexey Stupnikov dacae335e4 Process unlimited exceptions raised by unplug_vifs
Currently compute manager's _cleanup_allocated_networks() method
expects NotImplementedError or exception.NovaException when
calling self.driver.unplug_vifs.

In reality, other class of exception could be raised. It could break
the Nova Compute flow and leave instance in inconsistent state. This
patch switches from exception.NovaException to all kinds of exceptions.

Closes-bug: #2015092
Change-Id: Icaf3cc93edfea97ee4fa497bdeb5f7d631c8ae55
2023-04-03 17:37:14 +02:00
Zuul d443f8e4c4 Merge "Transport context to all threads" 2023-02-27 15:11:25 +00:00
Dan Smith d892905904 Protect against a deleted node id file
If we are starting up for the first time, we expect to generate and
write a node_uuid file if one does not exist. If we are upgrading,
we expect to do the same. However, if we are starting up not after an
upgrade and not for the first time, a missing compute_id file is an
error, and we should abort.

Because of the additional check that this adds, which is called from
a variety of places that don't have the stable compute node singleton
stubbed to make it happy, this mocks the check for any test that does
not specifically aim to exercise it.

Related to blueprint stable-compute-uuid
Change-Id: If83ce14b96e7d84ae38eba9d798754557d5abdfd
2023-02-01 09:20:59 -08:00
Dan Smith 72370a188c Check our nodes for hypervisor_hostname changes
When we are loading our ComputeNode objects by UUID according to what
the virt driver reported, we can sanity check the DB records against
the virt driver's hostname. This covers the case where our CONF.host
has not changed but the hostname reported by the virt driver has,
assuming we can find the ComputeNode object(s) that match our
persistent node uuid.

Related to blueprint stable-compute-uuid
Change-Id: I41635210d7d6f46b437b06d2570a26a80ed8676a
2023-02-01 09:20:59 -08:00
Zuul ac6f895361 Merge "compute: enhance compute evacuate instance to support target state" 2023-02-01 11:49:38 +00:00
Sahid Orentino Ferdjaoui 8c2e765989 compute: enhance compute evacuate instance to support target state
Related to the bp/allowing-target-state-for-evacuate. This change
is extending compute API to accept a new argument targetState.

The targetState argument when set will force state of an evacuated
instance to the destination host.

Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@industrialdiscipline.com>
Change-Id: I9660d42937ad62d647afc6be965f166cc5631392
2023-01-31 11:29:01 +01:00
Dan Smith e258164f5a Detect host renames and abort startup
Except on ironic, if we are able to determine that our locally-
configured node has a hostname other than what we expect, we should
abort startup. Since we currently depend on the loose name-based
association of nodes, services, and instances, we need to make sure
we do not startup in an inconsistent configuration.

Related to blueprint stable-compute-uuid
Change-Id: I595b27a57516cffe1f172cf2fb736e1b11373a1d
2023-01-30 12:46:36 -08:00
Dan Smith 23c5f3d585 Make resource tracker use UUIDs instead of names
This makes the resource tracker look up and create ComputeNode objects
by uuid instead of nodename. For drivers like ironic that already
provide 'uuid' in the resources dict, we can use that. For those
that do not, we force the uuid to be the locally-persisted node
uuid, and use that to find/create the ComputeNode object.

A (happy) side-effect of this is that if we find a deleted compute
node object that matches that of our hypervisor, we undelete it
instead of re-creating one with a new uuid, which may clash with our
old one. This means we remove some of the special-casing of ironic
rebalance, although the tests for that still largely stay the same.

Change-Id: I6a582a38c302fd1554a49abc38cfeda7c324d911
2023-01-30 10:53:44 -08:00
Dan Smith 53a925ff0f Persist existing node uuids locally
If we start up from an older service version, we need to look at the
database to determine what our node is and write it locally. This
leap of faith will be the last time we need to look up our node based
on our hostname. Instead of relying on this, operators can (ideally)
pre-populate the compute node identities for maximum safety.

Change-Id: I178700bf5ba172603593b1c9972520d05face849
2023-01-20 08:07:06 -08:00
Dan Smith cf111d1001 Pass service ref to init_host(), if exists
This just adds the service ref to init_host() so that we can pass it
(if it exists) at startup. On the first run, this will be None, so
we know that we don't have an existing service record and thus do not
need to do any migration.

Related to blueprint stable-compute-uuid

Change-Id: I1491c9b234ef0c262b5ed01d4c138eba8dedff77
2023-01-20 07:22:02 -08:00
Pierre-Samuel Le Stang e5766446e5
Reproducer test of bug #1999674
This commit aims to show that an active instance which is hard rebooted
while nova compute is unavailable (service or host down) stays in
task_state REBOOTING_HARD after nova compute is available again.

Related-Bug: #1999674
Change-Id: Ic672ce509ca21715f74931b8a6c6990b1c20ce30
2023-01-05 17:23:17 +01:00
Zuul 1c46c4e9e5 Merge "Store allocated RP in InstancePCIRequest" 2022-12-21 19:20:09 +00:00
Balazs Gibizer f86f1800f0 Store allocated RP in InstancePCIRequest
After the scheduler selected a target host and allocated an allocation
candidate that is passed the filters nova need to make sure that PCI
claim will allocate the real PCI devices from the RP which is allocated
in placement. Placement returns the request group - provider mapping for
each allocation candidate so nova can map which InstancePCIRequest was
fulfilled from which RP in the selected allocation candidate. This
mapping is then recorded in the InstancePCIRequest object and used
during the PCI claim to filter for PCI pools that can be used to claim
PCI devices from.

blueprint: pci-device-tracking-in-placement
Change-Id: I18bb31e23cc014411db68c31317ed983886d1a8e
2022-12-21 16:17:34 +01:00
John Garbutt 8a476061c5 Ironic: retry when node not available
After a baremetal instance is deleted, and its allocation is removed
in placement, the ironic node might start cleaning. Eventually nova
will notice and update the inventory to be reserved.
During this window, a new instance may have already picked this
ironic node.

When that race happens today the build fails with an error:
"Failed to reserve node ..."

This change tries to ensure the remaining alternative hosts are
attempted before aborting the build.
Clearly the race is still there, but this makes it less painful.

Related-Bug: #1974070
Change-Id: Ie5cdc17219c86927ab3769605808cb9d9fa9fa4d
2022-12-15 16:33:43 +00:00
Sean Mooney 8449b7caef [compute] always set instance.host in post_livemigration
This change add a new _post_live_migration_update_host
function that wraps _post_live_migration and just ensures
that if we exit due to an exception instance.host is set
to the destination host.

when we are in _post_live_migration the guest has already
started running on the destination host and we cannot revert.
Sometimes admins or users will hard reboot the instance expecting
that to fix everything when the vm enters the error state after
the failed migrations. Previously this would end up recreating the
instance on the source node leading to possible data corruption if
the instance used shared storage.

Change-Id: Ibc4bc7edf1c8d1e841c72c9188a0a62836e9f153
Partial-Bug: #1628606
2022-10-07 12:23:13 +00:00
Balazs Gibizer e43bf900dc Gracefully ERROR in _init_instance if vnic_type changed
If the vnic_type of a bound port changes from "direct" to "macvtap" and
then the compute service is restarted then during _init_instance nova
tries to plug the vif of the changed port. However as it now has macvtap
vnic_type nova tries to look up the netdev of the parent VF. Still that
VF is consumed by the instance so there is no such netdev on the host
OS. This error killed the compute service at startup due to unhandled
exception. This patch adds the exception handler, logs an ERROR and
continue initializing other instances on the host.

Also this patch adds a detailed ERROR log when nova detects that the
vnic_type changed during _heal_instance_info_cache periodic.

Closes-Bug: #1981813
Change-Id: I1719f8eda04e8d15a3b01f0612977164c4e55e85
2022-09-08 09:19:16 +02:00
Rajat Dhasmana 30aab9c234 Add support for volume backed server rebuild
This patch adds the plumbing for rebuilding a volume backed
instance in compute code. This functionality will be enabled
in a subsequent patch which adds a new microversion and the
external support for requesting it.

The flow of the operation is as follows:

1) Create an empty attachment
2) Detach the volume
3) Request cinder to reimage the volume
4) Wait for cinder to notify success to nova (via external events)
5) Update and complete the attachment

Related blueprint volume-backed-server-rebuild

Change-Id: I0d889691de1af6875603a9f0f174590229e7be18
2022-08-31 16:38:37 +05:30
Balazs Gibizer 0d526d1f4b Reject PCI dependent device config
The PCI tracking in placement does not support the configuration where
both a PF and its children VFs are configured for nova usage. This patch
adds logic to detect and reject such configuration. To be able to kill
the service if started with such config special exception handling is
added for the update_available_resource code path, similarly how a
failed reshape is handled.

blueprint: pci-device-tracking-in-placement
Change-Id: I708724465d2afaa37a65c231c64da88fc8b458eb
2022-08-25 10:00:10 +02:00
Balazs Gibizer 14e68ac6e9 Rename [pci]passthrough_whitelist to device_spec
A later patch in the pci-device-tracking-in-placement work
will extend the existing [pci]passthrough_whitelist config syntax.
So we take the opportunity here to deprecate the old non inclusive
passthrough_whitelist name and introduce a better one.

All the usage of CONF.pci.passthrough_whitelist is now changed over to
the new device_spec config. Also the in tree documentation is updated
accordinly.

However the nova code still has a bunch of references to the
"whitelist" terminology. That will be handled in subsequent patches.

blueprint: pci-device-tracking-in-placement
Change-Id: I843032e113642416114f169069eebf6a56ed78dd
2022-08-10 17:08:35 +02:00
Fabian Wiesel 646fc51732 Transport context to all threads
The nova.utils.spawn and spawn_n methods transport
the context (and profiling information) to the
newly created threads. But the same isn't done
when submitting work to thread-pools in the
ComputeManager.

The code doing that for spawn and spawn_n
is extracted to a new function
and called to submit the work to the thread-pools.

Closes-Bug: #1962574
Change-Id: I9085deaa8cf0b167d87db68e4afc4a463c00569c
2022-08-04 17:36:23 +05:30
Balazs Gibizer f8cf050a13 Remove double mocking
In py310 unittest.mock does not allow to mock the same function twice as
the second mocking will fail to autospec the Mock object created by the
first mocking.

This patch manually fixes the double mocking.

Fixed cases:
1) one of the mock was totally unnecessary so it was removed
2) the second mock specialized the behavior of the first generic mock.
   In this case the second mock is replaced with the configuration of
   the first mock
3) a test case with two test steps mocked the same function for each
   step with overlapping mocks. Here the overlap was removed to have
   the two mock exists independently

The get_connection injection in the libvirt functional test needed a
further tweak (yeah I know it has many already) to act like a single
mock (basically case #2) instead of a temporary re-mocking. Still the
globalness of the get_connection mocking warrant the special set / reset
logic there.

Change-Id: I3998d0d49583806ac1c3ae64f1b1fe343cefd20d
2022-08-02 15:31:15 +02:00
Stephen Finucane 89ef050b8c Use unittest.mock instead of third party mock
Now that we no longer support py27, we can use the standard library
unittest.mock module instead of the third party mock lib. Most of this
is autogenerated, as described below, but there is one manual change
necessary:

nova/tests/functional/regressions/test_bug_1781286.py
  We need to avoid using 'fixtures.MockPatch' since fixtures is using
  'mock' (the library) under the hood and a call to 'mock.patch.stop'
  found in that test will now "stop" mocks from the wrong library. We
  have discussed making this configurable but the option proposed isn't
  that pretty [1] so this is better.

The remainder was auto-generated with the following (hacky) script, with
one or two manual tweaks after the fact:

  import glob

  for path in glob.glob('nova/tests/**/*.py', recursive=True):
      with open(path) as fh:
          lines = fh.readlines()
      if 'import mock\n' not in lines:
          continue
      import_group_found = False
      create_first_party_group = False
      for num, line in enumerate(lines):
          line = line.strip()
          if line.startswith('import ') or line.startswith('from '):
              tokens = line.split()
              for lib in (
                  'ddt', 'six', 'webob', 'fixtures', 'testtools'
                  'neutron', 'cinder', 'ironic', 'keystone', 'oslo',
              ):
                  if lib in tokens[1]:
                      create_first_party_group = True
                      break
              if create_first_party_group:
                  break
              import_group_found = True
          if not import_group_found:
              continue
          if line.startswith('import ') or line.startswith('from '):
              tokens = line.split()
              if tokens[1] > 'unittest':
                  break
              elif tokens[1] == 'unittest' and (
                  len(tokens) == 2 or tokens[4] > 'mock'
              ):
                  break
          elif not line:
              break
      if create_first_party_group:
          lines.insert(num, 'from unittest import mock\n\n')
      else:
          lines.insert(num, 'from unittest import mock\n')
      del lines[lines.index('import mock\n')]
      with open(path, 'w+') as fh:
          fh.writelines(lines)

Note that we cannot remove mock from our requirements files yet due to
importing pypowervm unit test code in nova unit tests. This library
still uses the mock lib, and since we are importing test code and that
lib (correctly) only declares mock in its test-requirements.txt, mock
would not otherwise be installed and would cause errors while loading
nova unit test code.

[1] https://github.com/testing-cabal/fixtures/pull/49

Change-Id: Id5b04cf2f6ca24af8e366d23f15cf0e5cac8e1cc
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
2022-08-01 17:46:26 +02:00
Sean Mooney cd2c2f359b ignore deleted server groups in validation
This change simply catches the exception raised when
we lookup a servergroup via a hint and the validation
upcall is enabled.

Change-Id: I858b4da35382a9f4dcf88f4b6db340e1f34eb82d
Closes-Bug: #1890244
2022-06-21 19:13:55 +01:00
Sean Mooney 84a84f7f2f add repoducer test for bug 1890244
This change adds a test to simulate validating
a instnace group policy where the group has been
deleted but is still referenced in the scheduler hint.

Change-Id: I803e6286a773d9e53639ab0cd449fc72bb3be613
Related-Bug: #1890244
2022-06-21 12:04:20 +01:00
Zuul 93a65f06df Merge "Record SRIOV PF MAC in the binding profile" 2022-06-13 17:25:14 +00:00
Rajesh Tailor 2521810e55 Fix typos
This change fixes some of the typos in unit tests as well
as in nova code-base.

Change-Id: I209bbb270baf889fcb2b9a4d1ce0ab4a962d0d0e
2022-05-30 17:40:00 +05:30