[Stable Only] hardware: Handle races during pinning
Due to how we do claiming of pinned CPUs and related NUMA "things", it's possible for claims to race. This raciness is usually not an issue since pinning with fail for the losing instance, which will just get rescheduled. This does mean that it's possible for an instance to land on a host with no CPUs at all though and this edge case is triggering a nasty bug made possible by Python's unusual scoping rules around for loops. >>> x = 5 >>> for y in range(x): ... pass ... >>> print(y) 4 'y' would be considered out of scope in the above for most other languages (JS and its even dumber scoping rules aside, I guess) and it leaves us with situations where the variable might never exist, i.e. the bug at hand: >>> x = 0 >>> for y in range(x): ... pass ... >>> print(y) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'y' is not defined Resolve this by adding a check to handle the "no CPUs at all" case and quick fail but also remove the reliance on this quirk of Python. This doesn't apply to stable/rocky since the issue was inadvertently resolved by changes I8982ab25338969cd98621f79b7fbec8af43d12c5 and I021ce59048d6292055af49457ba642022f648c81. However, those changes are significantly larger and backports have been previously rejected [1][2]. [1] https://review.openstack.org/#/c/588570/ [2] https://review.openstack.org/#/c/588571/ Change-Id: I6afc3af9f13e3c1cc312112eb28eb6e10d2a9e07 Signed-off-by: Stephen Finucane <sfinucan@redhat.com> Closes-Bug: #1821733
This commit is contained in:
parent
da9f9c962f
commit
50a0c6a4b9
|
@ -2398,6 +2398,26 @@ class CPUPinningCellTestCase(test.NoDBTestCase, _CPUPinningTestCaseBase):
|
|||
got_pinning = {x: x for x in range(0, 4)}
|
||||
self.assertEqual(got_pinning, inst_pin.cpu_pinning)
|
||||
|
||||
def test_get_pinning_host_siblings_fails(self):
|
||||
"""Validate that pinning fails if there are no free host CPUs.
|
||||
|
||||
This can happen due to raciness caused by nova-scheduler not claiming
|
||||
pinned CPU resources, meaning multiple instances can race, landing on
|
||||
the same host and attempting to claim the same resources.
|
||||
"""
|
||||
# This host has 8 cores but 2 of them are disabled via 'vcpu_pin_set',
|
||||
# meaning they have no thread siblings and aren't included in
|
||||
# 'siblings' below. This can break thread-aware checks, which assume a
|
||||
# sum(siblings) == sum(cpus).
|
||||
host_pin = objects.NUMACell(id=0, cpuset=set([0, 1, 2, 3, 4, 6]),
|
||||
memory=4096, memory_usage=0,
|
||||
siblings=[set([0, 1]), set([2, 3])],
|
||||
mempages=[], pinned_cpus=set([0, 1, 2, 3]))
|
||||
inst_pin = objects.InstanceNUMACell(cpuset=set([0, 1]),
|
||||
memory=2048)
|
||||
inst_pin = hw._numa_fit_instance_cell_with_pinning(host_pin, inst_pin)
|
||||
self.assertIsNone(inst_pin)
|
||||
|
||||
def test_get_pinning_require_policy_no_siblings(self):
|
||||
host_pin = objects.NUMACell(
|
||||
id=0,
|
||||
|
|
|
@ -693,6 +693,17 @@ def _pack_instance_onto_cores(available_siblings,
|
|||
for sib in available_siblings:
|
||||
for threads_no in range(1, len(sib) + 1):
|
||||
sibling_sets[threads_no].append(sib)
|
||||
|
||||
# Because we don't claim pinned CPUs in the scheduler, it is possible for
|
||||
# multiple instances to race and land on the same host. When this happens,
|
||||
# the CPUs that we thought were free may not longer be free. We should fail
|
||||
# fast in this scenario.
|
||||
if not sibling_sets:
|
||||
LOG.debug('No available siblings. This is likely due to a race '
|
||||
'caused by multiple instances attempting to claim the same '
|
||||
'resources')
|
||||
return
|
||||
|
||||
LOG.debug('Built sibling_sets: %(siblings)s', {'siblings': sibling_sets})
|
||||
|
||||
pinning = None
|
||||
|
@ -871,6 +882,7 @@ def _pack_instance_onto_cores(available_siblings,
|
|||
if (instance_cell.cpu_thread_policy !=
|
||||
fields.CPUThreadAllocationPolicy.REQUIRE and
|
||||
not pinning):
|
||||
sibling_set = sibling_sets[min(sibling_sets)]
|
||||
pinning = list(zip(sorted(instance_cell.cpuset),
|
||||
itertools.chain(*sibling_set)))
|
||||
|
||||
|
|
Loading…
Reference in New Issue