This finalizes the removal of the placement code from nova.
This change primarily removes code and makes fixes to cmd,
test and migration tooling to adapt to the removal.
Placement tests and documention were already removed in
early patches.
A database migration that calls
consumer_obj.create_incomplete_consumers in nova-manage has been
removed.
A functional test which confirms the default incomplete
consumer user and project id has been changes so its its use of
conf.placement.incomplete_* (now removed) is replaced with a
constant. The placement server, running in the functional
test, provides its own config.
placement-related configuration is updated to only register those
opts which are relevant on the nova side. This mostly means
ksa-related opts. placement-database configuration is removed
from nova/conf/database.
tox.ini is updated to remove the group_regex required by the
placement gabbi tests. This should probably have gone when the
placement functional tests went, but was overlooked.
A release note is added which describes that this is cleanup,
the main action already happened, but points people to the
nova to placement upgrade instructions in case they haven't
done it yet.
Change-Id: I4181f39dea7eb10b84e6f5057938767b3e422aff
There are cases where ``root_provider_id`` of a resource provider is
set to NULL just after it is upgraded to the Rocky release. In such
cases getting allocation candidates raises a Keyerror.
This patch fixes that bug for cases there is no sharing or nested
providers in play.
Change-Id: I9639d852078c95de506110f24d3f35e7cf5e361e
Closes-Bug:#1799892
If there are multiple consumers having allocations to the same
resource provider, with different classes, it will attempt
multiple INSERTs with the same consumer_id which is not allowed
because of the database constraints.
This patch adds a simple GROUP BY in order to ensure that the
database server only provides us with unique values to avoid
trying to INSERT duplicate values.
Change-Id: I1acba5e65cd562472f29e354c6077f82844fa87d
Closes-Bug: #1798163
The use of the independent context on _ensure_aggregate appears to
be unnecessary. It causes file-based uses of SQLite dbs to fail
(with database locked errors, as reported in the associated bug,
1789633) and thus may mask issues with other databases. Adding the
independent context manager was the result of a series of "throw
stuff at the wall and see what sticks" patches, but it looks now
that it is not required, and in some situations causes problems.
Runs through the gate show that the behavior it was fixing (as
described in bug 1786703) is not happening.
Change-Id: I1f325d55ec256db34a4c3bbd230dcd8a91bce542
Related-Bug: #1786703
Closes-Bug: #1789633
This patch modifies the code paths for the non-granular request group
allocation candidates processing. It removes the giant multi-join SQL
query and replaces it with multiple calls to
_get_providers_with_resource(), logging the number of matched providers
for each resource class requested and filter (on required traits,
forbidden traits and aggregate memebership).
Here are some examples of the debug output:
- A request for three resources with no aggregate or trait filters:
found 7 providers with available 5 VCPU
found 9 providers with available 1024 MEMORY_MB
found 5 providers after filtering by previous result
found 8 providers with available 1500 DISK_GB
found 2 providers after filtering by previous result
- The same request, but with a required trait that nobody has, shorts
out quickly:
found 0 providers after applying required traits filter (['HW_CPU_X86_AVX2'])
- A request for one resource with aggregates and forbidden (but no
required) traits:
found 2 providers after applying aggregates filter ([['3ed8fb2f-4793-46ee-a55b-fdf42cb392ca']])
found 1 providers after applying forbidden traits filter ([u'CUSTOM_TWO', u'CUSTOM_THREE'])
found 3 providers with available 4 VCPU
found 1 providers after applying initial aggregate and trait filters
Co-authored-by: Eric Fried <efried@us.ibm.com>
Closes-Bug: #1786519
Change-Id: If9ddb8a6d2f03392f3cc11136c4a0b026212b95b
Per the referenced bug, we weren't accounting for the scenario where a
reshape operation was removing *all* inventories for a provider (which
could be fairly common). With this fix, we do a three-stage lookup of
the provider object: If it's not in the inventories, we look in the
allocations; if it's not in the allocations, we look it up in the
database.
Change-Id: I594bb64f87c61b7ffd39c19e0fd42c4c087a3a11
Closes-Bug: #1783130
When replacing a provider's set of aggregate associations, we were
issuing a call to:
DELETE resource_provider_aggregates WHERE resource_provider_id = $rp
and then a single call to:
INSERT INTO resource_provider_aggregates
SELECT $rp, aggs.id
FROM provider_aggregates AS aggs
WHERE aggs.uuid IN ($agg_uuids)
This patch changes the _set_aggregates() function in a few ways.
First, we grab the aggregate's internal ID value when creating new
aggregate records (or grabbing a provider's existing aggregate
associations). This eliminates the need for any join to
provider_aggregates in an INSERT/DELETE statement.
Second, instead of a multi-row INSERT .. SELECT statement, we do
single-shot INSERT ... VALUES statements, one for each added aggregate.
Third, we no longer DELETE all aggregate associations for the provider
in question. Instead, we issue single-shot DELETE statements for only
the aggregates that are being disassociated.
Finally, I've added a number of log debug statements so that we can have
a little more information if this particular patch does not fix the
deadlock issue described in the associated bug.
Change-Id: I87e765305017eae1424005f7d6f419f42a2f8370
Closes-bug: #1786703
Somewhere in the past release, we started using extremely complex code
paths involving sharing providers, anchor providers, and nested resource
provider calculations when we absolutely don't need to do so.
There was a _has_provider_trees() function in the
nova/api/openstack/placement/objects/resource_provider.py file that used
to be used for top-level switching between a faster, simpler approach to
finding allocation candidates for a simple search of resources and
traits when no sharing providers and no nesting was used. That was
removed at some point and all code paths -- even for simple "get me
these amounts of these resources" when no trees or sharing providers are
present (which is the vast majority of OpenStack deployments) -- were
going through the complex tree-search-and-match queries and algorithms.
This patch changes that so that when there's a request for some
resources and there's no trees or sharing providers, we do the simple
code path. Hopefully this gets our performance for the simple, common
cases back to where we were pre-Rocky.
This change is a prerequisite for the following change which adds
debugging output to help diagnose which resource classes are running
out of inventory when GET /allocation_candidates returns 0 results.
That code is not possible without the changes here as they only
work if we can identify when a "simpler approach" is possible and
call that simpler code.
Related-Bug: #1786055
Partial-Bug: #1786519
Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1
We were calling ResourceProvider.get_by_uuid() inside
_build_provider_summaries() main loop over all providers involved in the
resulting allocation candidates. This results in a query per provider
involved, which is quite obviously not going to perform well. This patch
modifies the _build_provider_summaries() function to make a single call
to a new _provider_ids_from_rp_ids() function instead of multiple calls
to ResourceProvider.get_by_uuid().
Change-Id: I0e0a44e833afece0775ec712fbdf9fcf4eae7a93
Related-bug: #1786055
We already have a fully loaded resource provider object in the loop, so
we don't need to create another one. Doing so has a very large
performance impact, especially when there are many resource providers
in the collection of summaries (which will be true in a large and
sparsely used cloud).
The code which creates the summaries used here as a data source has the
same expensive use of get_by_uuid in a loop. That will be fixed in a
separate patch.
Existing functional tests cover this code.
Change-Id: I6068db78240c33a1dcefedc0c94e76740fd8d6e2
Partial-Bug: #1786055
`GET /resource_provider/{uuid}/allocations` API didn't
return all the allocations made by multiple users.
This was because the placement wrongly used project table
for user table. This patch fixes it with the test case.
Change-Id: I7c808dec5de1204ced0d1f1b31d8398de8c51679
Closes-Bug: #1785382
Call ensure_rc_cache from deploy, so that we only try it once per
process.
A small number of unit tests needed an adjustment to either mock
properly or call the ensure_rc_cache in advance of their work.
Change-Id: I7499bba6ac6b463d8da46e10469121e62ee52ed1
1. Change the comments on forbidden trait
2. We already change _get_provider_ids_matching result from "provider
IDs" to "tuples of (internal provider ID, root provider ID)" in the
I343a0cb19f4037ddde5c5fc96d0a053f699f5257
3. Remove out-of-date comments about "handling nested providers" from
039c94a6b9
trivialfix
Change-Id: I7b23ef4c06be8963e43e0c23ac6cc149ffd2ddae
There is a typo in comment, this comment is very
important for readers to understand what the
function is doing, so fix it.
Trivial-fix
Change-Id: I30f3a385565dc31651d136d80e499be181fa436e
1. Remove "GROUP by" line in _anchors_for_sharing_providers, because
this has been removed in Ib1738fb4a4664aa7b78398655fd23159a54f5f69.
2. Add reminder note when we are sure all root_provider_id values are
NOT NULL.
3. Fix note in test_anchors_for_sharing_providers, s1 get r3 only via
agg3.
trivialfix
Change-Id: Id8bfd83db58366047267ff0eeb2930a19bddbf4e
Getting allocation candidates with sharing providers, placement
creates a list of AllocationRequestResources to get all the
possible combinations of resource providers in the same aggregate.
However, the order of the list was arbitrary, which could cause
a bug later in duplicate check of the combination.
This patch ensures that the list is ordered by the resource
class id.
Note:
This bug is only exposed when it is tested with python3.6,
where order-preserving aspect is added to the dict object.
Change-Id: I2e236fbbc3a4cfd3bd66d50198de643e06d62331
Closes-Bug: #1784577
There is a redundant join when we want to get id from
_anchors_for_sharing_providers. The last outerjoin is used to get the
rp.UUID according rp.id, if we set get_id=True, we no longer need this
outer join.
So, we remove the redundant join in this patch.
Change-Id: Ib5fc6e4efae29dd88ce92df834700d2121ed8076
Closes-bug: #1784604
This change adds a fast retry loop around
AllocationList._set_allocations if a resource provider generation
conflict happens. It turns out that under high concurrency of allocation
claims being made on the same resource provider conflicts can be quite
common and client side retries are insufficient.
Because both consumer generation and resource provider generations had
raised the same exception there was no way to distinguish between the
two so a child of ConcurrentUpdateDetected has been created as
ResourceProviderConcurrentUpdateDetected. In the future this will allow
us to send different error codes to the client as well, but that change
is not done here.
When the conflict is detected, all the resource providers in the
AllocationList are reloaded and the list objects refreshed.
Logging is provided to indicate:
* at debug that a retry is going to happen
* at warning that all the retries failed and the client is going to
see the conflict
The tests for this are a bit funky: Some mocks are used to cause the
conflicts, then the real actions after a couple of iterations.
Change-Id: Id614d609fc8f3ed2d2ff29a2b52143f53b3b1b9a
Closes-Bug: #1719933
Adds a nova.api.openstack.placement.objects.resource_provider.reshape()
function that accepts a dict, keyed by provider UUID, of inventory
information for a set of providers and an AllocationList object that
contains all of the rejiggered allocation records for all consumers on
the providers involved in the reshape operation.
The reshape() function is decorated with the placement API's DB
transaction context manager which will catch all exceptions and issue a
ROLLBACK of the single writer transaction that is involved in the myriad
sub-operations that happen inside reshape(). Likewise, a single COMMIT
will be executed in the writer transaction when reshape() completes
without an exception.
Change-Id: I527de486eda63b8272ffbfe42f6475907304556c
blueprint: reshape-provider-tree
Ever since we introduced support for setting multiple consumers in a
single POST /allocations, the AllocationList.delete_all() method has
been housing a latent bad assumption and bug.
The AllocationList.delete_all() method used to assume that the
AllocationList's Allocation objects were only ever for a single
consumer, and took a shortcut in deleting the allocation by deleting all
allocations with the "first" Allocation's consumer UUID:
```python
def delete_all(self):
# Allocations can only have a single consumer, so take advantage of
# that fact and do an efficient batch delete
consumer_uuid = self.objects[0].consumer.uuid
_delete_allocations_for_consumer(self._context, consumer_uuid)
consumer_obj.delete_consumers_if_no_allocations(
self._context, [consumer_uuid])
```
The problem with the above is that if you get all the allocations for a
single resource provider, using
AllocationList.get_all_by_resource_provider() and there are more than
one consumer allocating resources against that provider, then calling
AllocationList.delete_all() will only delete *some* of the resource
provider's allocations, not all of them.
Luckily, the handler code has never used AllocationList.delete_all()
after calling AllocationList.get_all_by_resource_provider(), and so
we've not hit this latent bug in production.
However, in the next patch in this series (the reshaper DB work), we
*do* call AllocationList.delete_all() for allocation lists for each
provider involved in the reshape operation, which is why this fix is
important to get done correctly.
Note that this patch renames AllocationList.create_all() to
AllocationList.replace_all() to make it absolutely clear that all of
the allocations for all consumers in the list are first *deleted* by the
codebase and then re-created. We also remove the check in
AllocationList.create_all() that the Allocation objects in the list must
not have an 'id' field set. The reason for that is because in order to
properly implement AllocationList.delete_all() to call DELETE FROM
allocations WHERE id IN (<...>) we need the list of allocation record
internal IDs. These id field values are now properly set on the
Allocation objects when AllocationList.get_all_by_resource_provider()
and AllocationList.get_all_by_consumer_id() are called. This allows that
returned object to have delete_all() called on it and the DELETE
statement to work properly.
Change-Id: I12393b033054683bcc3e6f20da14e6243b4d5577
Closes-bug: #1781430
When updating a parent provider of a resource provider, placement
didn't update a root provider of another resource provider in the
same tree.
This patch fixes it to update the root provider field of all the
resource providers in the same tree.
Change-Id: Icdedc10cdd5ebfda672ca2d65a75bf0143aa769c
Closes-Bug: #1779818
We made the decision [1] to delete consumer records when those consumers
no longer had any allocations referring to them (as opposed to keeping
those consumer records around and incrementing the consumer generation
for them).
This patch adds a small check within the larger
AllocationList.create_all() and AllocationList.delete_all() DB
transactions that deletes consumer records when no allocation records
remain that reference that consumer. This patch does not, however,
attempt to clean up any "orphaned" consumer records that may have been
created in previous calls to PUT|POST /allocations that removed the last
remaining allocations for a consumer.
[1] https://goo.gl/DpAGbW
Change-Id: Ic2b82146d28be64b363b0b8e2e8d180b515bc0a0
Closes-bug: #1780799
Placement had RP loop detection for RP creation but if an RP is created
without a parent (e.g. root RP) then the parent can be set later with a
PUT /resource_providers/{uuid} request by providing the UUID of the
parent. In this code path the loop detection was missing from the
validation. Moreover there are different loop cases for create than for
set. For create the only possible loop is when the RP being created is
points to itself as a parent. However when the parent is provided later
in a PUT the RP being updated can have descendant RPs. Setting a parent
to a descendant also creates a loop.
This patch adds the missing check and returns HTTP 400 if loop is detected.
Closes-Bug: #1779635
Change-Id: I42c91f5f752f0a4fba8b1d95489fc3f87a1c5b6e
Add a change to _check_capacity_exceeded to also compare the amount
needed by a given allocation to a running total of amounts needed
against this class of resource on this resource provider.
Change-Id: Id8dde9a1f4b62112925616dfa54e77704109481c
Closes-Bug: #1778743
This patch adds new placement API microversion for handling consumer
generations.
Change-Id: I978fdea51f2d6c2572498ef80640c92ab38afe65
Co-Authored-By: Ed Leafe <ed@leafe.com>
Blueprint: add-consumer-generation
Traits sync had been tried any time a request that might involve
traits was called. If the global was set no syncing was done, but
lock handling was happening.
This change moves the syncing into the the deploy.load_app() handling.
This means that the syncing will be attempted any time a new WSGI
application is created. Most of the time this will be at the start of a
new process, but some WSGI servers have interesting threading models so
there's a (slim) possibility that it could be in a thread. Because of
this latter possibility, the locking is still in place.
Functional tests are updated to explicitly do the sync in their
setUp(). Some changes in fixtures are required to make sure that
the database is present prior to the sync.
While these changes are not strictly part of extracting placement, the
consolidation and isolation of database handling code makes where to put
this stuff a bit cleaner and more evident: an update_database() method
in deploy uses an empty DbContext class from db_api to the call the
ensure_trait_sync method in resource_provider. update_database is in
deploy because it an app deployment task and because putting it in
db_api leads to circual import problems.
blueprint placement-extract
Closes-Bug: #1756151
Change-Id: Ic87518948ed5bf4ab79f9819cd94714e350ce265
The placement database connection (which can use the same connection
string as the api database) needs to be managed from its own module so
that the nova db files are not imported as this eventually leads to
importing object-related and other code which the placement service does
not need.
The wsgi application is now responsible for initializing the database
configuration.
Fixtures and a number of tests are updated to reflected the new
location of the placement engine.
The original parse_args needed to import RPC and database related
modules that are not required in the placement context. A new
_parse_args method is added to wsgi.py which does the minimal required
work. Because we no longer use the central parse_args, the placement
db sync in nova/cmd/manager.py must make a manual configuration of the
context manager.
blueprint placement-extract
Change-Id: I2fff528060ec52a4a2e26a6484bdf18359b95f77
This patch adds optional `rp_ids` argument to
_provider_ids_matching_aggregates() for optimization to further
winnow results to a set of resource provider IDs in case that
we've already looked up the providers that have appropriate
inventory capacity getting allocation candidates.
Change-Id: I2a4a5c4bbaefd71ef102f25b8f35522287d2783d
Getting allocation candidates, in provider_summaries we have only
providers that are in allocation_requests.
This patch fixes it to include all the providers in trees in play.
Change-Id: I108dceb13bdefc541b272ea953acc1dec2945647
Blueprint: placement-return-all-resources
If 'connection' is set in the 'placement_database' conf group use
that as the connection URL for the placement database. Otherwise if
it is None, the default, then use the entire api_database conf group
to configure a database connection.
When placement_database.connection is not None a replica of the
structure of the API database is used, using the same migrations
used for the API database.
A placement_context_manager is added and used by the OVO objects in
nova.api.openstack.placement.objects.*. If there is no separate
placement database, this is still used, but points to the API
database.
nova.test and nova.test.fixtures are adjusted to add awareness of
the placement database.
This functionality is being provided to allow deployers to choose
between establishing a new database now or requiring a migration
later. The default is migration later. A reno is added to explain
the existence of the configuration setting.
This change returns the behavior removed by the revert in commit
39fb302fd9 but done in a more
appropriate way.
Note that with the advent of the nova-status command, which checks
to see if placement is "ready" the tests here had to be adjusted.
If we do allow a separate database the code will now check the
separate database (if configured), but nothing is done with regard
to migrating from the api to placement database or checking that.
blueprint placement-extract
Change-Id: I7e1e89cd66397883453935dcf7172d977bf82e84
Implements: blueprint optional-placement-database
Co-Authored-By: Roman Podoliaka <rpodolyaka@mirantis.com>
Removes the consumer_id, project_id and user_id fields from the
Allocation object definition. These values are now found in the Consumer
object that is embedded in the Allocation object which is now
non-nullable.
Modifies the serialization in the allocation handler to output
Allocation.consumer.project.external_id and
Allocation.consumer.user.external_id when appropriate for the
microversion.
Calls the create_incomplete_consumers() method during
AllocationList.get_all_by_consumer_id() and
AllocationList.get_all_by_resource_provider() to online-migrate missing
consumer records.
Change-Id: Icae5038190ab8c7bbdb38d54ae909fcbf9048912
blueprint: add-consumer-generation