This is the first in a series of commits to add support for codespell. This is continuning the process completed in ironic-python-agent.
Future Commits will add a Tox Target, CI support and potentially a git-blame-ignore-revs file if their are lots of spelling mistakes that could clutter git blame.
Change-Id: Id328ff64c352e85b58181e9d9e35973a8706ab7a
I've seen a situation where heartbeats managed to completely saturate
the conductor workers, so that no API requests could come through that
required interaction with the conductor (i.e. everything other than
reads). Add periodic tasks for a large (thousands) number of nodes, and
you get a completely locked up Ironic.
This change reserves 5% (configurable) of the threads for API requests.
This is done by splitting one executor into two, of which the latter is
only used by normal _spawn_worker calls and only when the former is
exhausted. This allows an operator to apply a remediation, e.g. abort
some deployments or outright power off some nodes.
Partial-Bug: #2038438
Change-Id: Iacc62d33ffccfc11694167ee2a7bc6aad82c1f2f
Sending signal ``SIGUSR2`` to a conductor process will now trigger a
drain shutdown. This is similar to a ``SIGTERM`` graceful shutdown but
the timeout is determined by ``[DEFAULT]drain_shutdown_timeout`` which
defaults to ``1800`` seconds. This is enough time for running tasks on
existing reserved nodes to either complete or reach their own failure
timeout.
During the drain period the conductor needs to be removed from the hash
ring to prevent new tasks from starting. Other conductors also need to
not fail reserved nodes on the draining conductor which would appear to
be orphaned. This is achieved by running the conductor keepalive
heartbeat for this period, but setting the ``online`` state to
``False``.
When this feature was proposed, SIGINT was suggested as the signal to
use to trigger a drain shutdown. However this is already used by
oslo_service fast exit[1] so using this for drain would be a change in
existing behaviour.
[1] https://opendev.org/openstack/oslo.service/src/branch/master/oslo_service/service.py#L340
Change-Id: I777898f5a14844c9ac9967168f33d55c4f97dfb9
This adds an `online` argument to the conductor touch methods so that
touch can be called with `online=False`. When called periodically this
allows the conductor `updated_at` to be within the threshold to avoid
locked nodes being failed as orphans by another conductor.
This will be used by drain shutdown (and graceful shutdown) so that
tasks can complete on existing locked nodes within the shutdown timeout,
while the conductor is also removed from the hash ring so new tasks are
not started on that conductor.
This change introduces the api but the existing behaviour won't change
until BaseConductorManager.del_host() no longer calls keepalive_halt().
Change-Id: Iedd62193fac1009137b9ee47a6ef5a9a8576f261
Especially in a single-conductor environment, the number of threads
should be larger than max_concurrent_deploy, otherwise the latter cannot
be reached in practice or will cause issues with heartbeats.
On the other hand, this change fixes an issue with how we use futurist.
Due to a misunderstanding, we ended up setting the workers pool size to
100 and then also allowing 100 more requests to be queued.
To be it shortly, this change moves from 100 threads + 100 queued to
300 threads and no queue.
Partial-Bug: #2038438
Change-Id: I1aeeda89a8925fbbc2dae752742f0be4bc23bee0
Julia, who has been looking at Metal3 container logs a lot, is tired of
not being able to easily figure out when on the timeline of the git repo
that the container's source code was running from. So instead of guessing
and having to figure out based upon behavior/logs, just log the version
the software believes itself to be using the existing version info,
similar to how we do it with IPA.
Change-Id: I3c76ddfb89b92d3d4bc29b7ccab4362604775568
Adds storage of the json-rpc port number to the conductor hostname
to enable rpc clients to understand which rpc servies they need to
connect to.
Depends-On: https://review.opendev.org/c/openstack/ironic-lib/+/879211
Change-Id: I6021152c83ab5025a9a9e6d8d24c64278c4c1053
Disables internal heartbeat mechanism when ironic has been
configured to utilize a SQLite database backend.
This is done to lessen the possibility of a
"database is locked" error, which can occur when two
distinct threads attempt to write to the database
at the same time with open writers.
The process keepalive heartbeat process was identified as
a major source of these write operations as it was writing
every ten seconds by default, which would also collide with
periodic tasks.
Change-Id: I7b6d7a78ba2910f22673ad8e72e255f321d3fdff
Instead of clearing existing reservations at the beginning of
del_host, wait for the tasks holding them to go to completion. This
check continues indefinitely until the conductor process exits due to
one of:
- All reservations for this conductor are released
- CONF.graceful_shutdown_timeout has elapsed
- The process manager (systemd, kubernetes) sends SIGKILL after the
configured graceful period
Because the default values of [DEFAULT]graceful_shutdown_timeout and
[conductor]heartbeat_timeout are the same (60s) no other conductor
will claim a node as an orphan until this conductor exits.
Change-Id: Ib8db915746228cd87272740825aaaea1fdf953c7
Currently when a conductor is stopped, the rpc service stops
responding to requests as soon as self.manager.del_host returns. This
means that until the hash ring is reset on the whole cluster, requests
can be sent to a service which is stopped.
This change waits for the remaining seconds to delay stopping until
CONF.hash_ring_reset_interval has elapsed. This will improve the
reliability of the cluster when scaling down or rolling out updates.
This delay only occurs when there is more than one online conductor,
to allow fast restarts on single-node ironic installs (bifrost,
metal3).
Change-Id: I643eb34f9605532c5c12dd2a42f4ea67bf3e0b40
One of the major changes in SQLAlchemy 2.0 is the removal
of autocommit support. It turns out Ironic was using this quite
aggressively without even really being aware of it.
* Moved the declaritive_base to ORM, as noted in the SQLAlchemy 2.0
changes[0].
* Console testing caused us to become aware of issues around locking
where session synchronization, when autocommit was enabled, was
defaulted to False. The result of this is that you could have two
sessions have different results, which could results on different
threads, and where one could still attempt to lock based upon prior
information. Inherently, while this basically worked, it was
also sort of broken behavior. This resulted in locking being
rewritten to use the style mandated in SQLAlchemy 2.0 migration
documentation. This ultimately is due to locking, which is *heavily*
relied upon in Ironic, and in unit testing with sqlite, there are
no transactions, which means we can get some data inconsistency
in unit testing as well if we're reliant upon the database to
precisely and exactly return what we committed.[1]
* Begins changing the query.one()/query.all() style to use explicit
select statements as part of the new style mandated for migration
to SQLAlchemy 2.0.
* Instead of using field label strings for joined queries, use the
object format, which makes much more sense now, and is part of
the items required for eventual migration to 2.0.
* DB queries involving Traits are now loaded using SelectInLoad
as opposed to Joins. The now deprecated ORM queries were quietly
and silently de-duplicating rows and providing consistent sets
from the resulting joined table responses, however putting much
higher CPU load on the processing of results on the client.
Prior performance testing has informed us this should be a minimal
overhead impact, however these queries should no longer be in
transactions with the Database Servers which should offset the
shift in load pattern. The reason we cannot continue to deduplicate
locally in our code is because we carry Dict data sets which cannot
be hashed for deduplication. Most projects have handled this by
treating them as Text and then converting, but without a massive
rewrite, this seems to be the viable middle ground.
* Adds an explict mapping for traits and tags on the Node object
to point directly to the NodeTrait and NodeTag classes. This
superceeds the prior usage of a backref to make the association.
* Splits SQLAlchemy class model Node into Node and NodeBase, which
allows for high performance queries to skip querying for ``tags``
and ``traits``. Otherwise with the afrormentioned lookups would
always execute as they are now properties as well on the Node
class. This more common of a SQLAlchemy model, but Ironic's model
has been a bit more rigid to date.
* Adds a ``start_consoles`` and ``start_allocations`` option to the
conductor ``init_host`` method. This allows unit tests to be
executed and launched with the service context, while *not* also
creating race conditions which resulted in failed tests.
* The db API ``_paginate_query`` wrapper now contains additional
logic to handle traditional ORM query responses and the newer style
of unified query responses. Due to differences in queries and handling,
which also was part of the driver for the creation of ``NodeBase``,
as SQLAlchemy will only create an object if a base object is referenced.
Also, by default, everything returned is a tuple in 1.4 with the
unified interface.
* Also modified one unit test which counted time.sleep calls, which is
a known pattern which can create failures which are ultimately noise.
Ultimately, I have labelled the remaining places which SQLAlchemy
warnings are raised at for deprecation/removal of functionality,
which needs to be addressed.
[0] https://docs.sqlalchemy.org/en/14/changelog/migration_20.html
[1] https://docs.sqlalchemy.org/en/14/dialects/sqlite.html#transaction-isolation-level-autocommit
Change-Id: Ie0f4b8a814eaef1e852088d12d33ce1eab408e23
This change makes it easier to configure power and management interfaces
(and thus vendor drivers) by figuring out reasonable defaults.
Story: #2009316
Task: #43717
Change-Id: I8779603e566be5a84daf6f680c0bbe2f191923d9
Host preparation file writing is already done by the __init__ method
of iPXEBoot. This change moves place_loaders_for_boot calls to
iPXEBoot and PXEBoot to be consistent, and to only write the files
when that driver is enabled.
This will mean multiple writes of the same file when subclasses of
these drivers are also enabled, but this overhead will be negligible.
Change-Id: I7e17f4d1a54cd6c5d1a4bf006a0d42db8d123a46
Adds capability to copy bootloader assets from the system OS
into the network boot folders on conductor startup.
Change-Id: Ica8f9472d0a2409cf78832166c57f2bb96677833
* Adds periodic task to purge node_history entries based upon
provided configuration.
* Adds recording of node history entries for errors in the
core conductor code.
* Also changes the rescue abort behavior to remove the notice
from being recorded as an error, as this is a likely bug in
behavior for any process or service evaluating the node
last_error field.
* Makes use of a semi-free form event_type field to help
provide some additional context into what is going on and
why. For example if deployments are repeatedly failing,
then perhaps it is a configuration issue, as opposed to
a general failure. If a conductor has no resources, then
the failure, in theory would point back to the conductor
itself.
Story: 2002980
Task: 42960
Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166
Prevent each driver comming online one at a time. So that
/driver returns nothign until all interfaces are registered
Story: #2008423
Task: #41368
Change-Id: I6ef3e6e36b96106faf4581509d9219e5c535a6d8
A partner performing some testing recognized a case where if a request
is sent to the Ironic Conductor while it is in the process of starting,
and the request makes it into be processed, yet latter the operation
fails with errors such as NodeNotLocked exception. Notably they were
able to reproduce this by requesting the attachment or detachment of
a VIF at the same time as restarting the conductor.
In part, this condition is due to to the conductor being restarted
where the conductors table includes the node being restarted and
the webserver has not possibly had a chance to observe that the
conductor is in the process of restarting as the hash ring is
still valid.
In short - Incoming RPC requests can come in during the initialization
window and as such we should not remove locks while the conductor could
possibly already be receiving work.
As such, we've added a ``prepare_host`` method which initializes
the conductor database connection and removes the stale locks.
Under normal operating conditions, the database client is reused.
rhbz# 1847305
Change-Id: I8e759168f1dc81cdcf430f3e33be990731595ec3
If a conductor hostname is changed while reservations are
issued to a conductor with one hostname, such as 'hostname'
and then the process is restarted with 'new_hostname', then the
queries would not match the node and effectively the nodes
would become inaccessible until the reservation is cleared.
This patch clears the reservation when stoping the
ironic-conductor service to avoid the nodes becoming inaccessible.
Ref to: https://review.opendev.org/#/c/711765/
Change-Id: Id31cd30564ff26df0bbe4976ffe3f268b0dd3d7b
Since we've dropped support for Python 2.7, it's time to look at
the bright future that Python 3.x will bring and stop forcing
compatibility with older versions.
This patch removes the six library from requirements, not
looking back.
Change-Id: Ib546f16965475c32b2f8caabd560e2c7d382ac5a
This change adds an option to publish the endpoint via mDNS on start
up and clean it up on tear down.
Story: #2005393
Task: #30383
Change-Id: I55d2e7718a23cde111eaac4e431588184cb16bda
`hash_distribution_replicas` was deprecated in the Stein cycle (12.1.0).
Story: #1680160
Task: #30033
Change-Id: Iddc59ed113fb9808f8c8564433475638491be84f
This change allows allocations that were not finished because of
conductor restarting or crashing to be finished after start up.
Change-Id: I016e08dcb59613b59ae753ef7d3bc9ac4a4a950a
Story: #2004341
Task: #29544
If a node is in ING state such as INSPECTING, RESCUING, and
the conductor goes down, when the conductor backs, the node
gets stuck with that ING state.
The cases for (DEPLOYING, CLEANING) are already processed
as expected, but (INSPECTING, RESCUING, UNRESCUING, VERIFYING,
ADOPTING). DELETING cannot be transitioned to 'fail' state.
Change-Id: Ie6886ea78fac8bae81675dabf467939deb1c4460
Story: #2003147
Task: #23282
This changes the calculation for keys in the hash ring manager to be of
the form "<conductor_group>:<driver>", instead of just driver. This is
used when the RPC version pin is 1.47 or greater (1.47 was created to
handle this).
When finding an RPC topic, we use the conductor group marked on the node
as part of this calculation. However, this becomes a problem when we
don't have a node that we're looking up a topic for. In this case we
look for a conductor in any group which has the driver loaded, and use a
temporary hash ring that does not use conductor groups to find a
conductor.
This also begins the API work, as the API must be aware of the new hash
ring calculation. However, exposing the conductor_group field and adding
a microversion is left for a future patch.
Story: 2001795
Task: 22641
Change-Id: Iaf71348666b683518fc6ce4769112459d98938f2
Adds the fields and bumps the objects versions. Excludes the field from
the node API for now.
Also adds the conductor_group config option, and populates the field in
the conductors table.
Also fixes a fundamentally broken test in ironic.tests.unit.db.test_api.
Change-Id: Ice2f90f7739b2927712ed45c969865136a216bd6
Story: 2001795
Task: 22640
Task: 22642
* removes any bits related to loading classic drivers from
the drivers factory code
* removes exceptions that only happen when classic drivers
can be loaded
* removes the BaseDriver, moves the useful functionality to
the BareDriver class
* /v1/drivers/?type=classic now always returns an empty list
* removes the migration updating classic drivers to hardware
types
The documentation will be updated separately.
Change-Id: I8ee58dfade87ae2a2544c5dcc27702c069f5089d
python-swiftclient stopped supporting the temp url structure used when radosgw
was set as the endpoint_type in ocata, meaning only Newton and older versions
of python-swiftclient will work. Newton is deprecated, so remove the option.
This breaks the deprecation cycle, but since it has been not working for so
long it needs to just be dropped.
Change-Id: Ibdc93b049b7e1ae34cac9e1f599786439c46a685
Also a few related errors based on some earlier investigation
may have been pulled in along the lines of E305.
Story: #2001985
Change-Id: Ifb2d3b481202fbd8cbb472e02de0f14f4d0809fd
Currently we only collect periodic tasks from interfaces used in enabled
classic drivers. Meaning, periodics are not collected from interfaces
that are only used in hardware types. This patch corrects it.
This patch does not enable collection of periodic tasks from hardware
types, since we did not collect them from classic drivers. I don't
remember the reason for that, and we may want to fix it later.
Change-Id: Ib1963f3f67a758a6b2405387bfe7b3e30cc31ed8
Story: #2001884
Task: #14357
If a conductor dies while holding a reservation, the node can get
stuck in its current state. Currently the conductor that takes
over the node only cleans it up if it's in the DEPLOYING state.
This change applies the same logic for all nodes:
1. Reservation is cleared by the conductor that took over the node
no matter what provision state.
2. CLEANING is also aborted, nodes are moved to CLEAN FAIL with
maintenance on.
3. Target power state is cleared as well.
The reservation is cleared even for nodes in maintenance mode,
otherwise it's impossible to move them out of maintenance.
Change-Id: I379c1335692046ca9423fda5ea68d2f10c065cb5
Closes-Bug: #1588901
When a conductor managing a node dies abruptly mid cleaing, the
node will get stuck in the CLEANING state.
This also moves _start_service() before creating CLEANING nodes
in tests. Finally, it adds autospec to a few places where the tests
fail in a mysterious way otherwise.
Change-Id: Ia7bce4dff57569707de4fcf3002eac241a5aa85b
Co-Authored-By: Dmitry Tantsur <dtantsur@redhat.com>
Partial-Bug: #1651092
When heartbeat thread of ironic-conductor server is reporting heartbeat,
it will be interrupted by database exceptions except 'DBConnectionError'.
So add 'Exception' in _conductor_service_record_keepalive to catch all
possible exceptions raised from database to ensure the heartbeat thread
not to exit. And also log the exception information. When the database
recovers from an exception, heartbeat thread will continue to report
heartbeat.
Change-Id: I0dc3ada945275811ef7272d500823e0a57011e8f
Closes-Bug: #1696296
If conductor is being stopped it is trying to wait of completion of
all periodical tasks which are already in the running state. If there
are many nodes assigned to the conductor this may take a long time,
and oslo service library can kill thread by timeout. This patch adds
code
that stops iterations over nodes in periodical tasks if conductor
is being stopped. These changes reduce probability to get locked
nodes after shutdown and time of shutdown.
Closes-Bug: #1701495
Change-Id: If6ea48d01132817a6f47560d3f6ee1756ebfab39
Currently config drive can be stored in swift with keystone
authentication. This change allows ironic to store the config drive in
ceph radosgw and use radosgw authentication mechanism that is not
currently supported. It uses swift API compatibility for ceph radosgw.
New options:
[deploy]/configdrive_use_object_store
[deploy]/object_store_endpoint_type
Deprecations:
[conductor]/configdrive_use_swift
Replaced by: [deploy]/configdrive_use_object_store
[glance]/temp_url_endpoint_type
Replaced by: [deploy]/object_store_endpoint_type
Change-Id: I9204c718505376cfb73632b0d0f31cea00d5e4d8
Closes-Bug: #1642719
The i18n team has decided not to translate the logs because it seems
like it's not very useful.
This patch removes translation of log messages from ironic/conductor.
Change-Id: I0fabef88f2d1bc588150f02cac0f5e975965fc29
Partial-Bug: #1674374
This changes driver_factory.default_interface() so that instead
of returning None if there is no calculated default interface,
it raises exception.NoValidDefaultForInterface.
This is a follow up to 6206c47720.
Change-Id: I0c3d5d75b5a37af02f3660968cf3f2c669e52019
Partial-Bug: #1524745
This adds additional constraints to the help messages for the
enabled_*_interfaces config options. It also checks if they are
empty at conductor startup, and if any are empty, errors out
with a better error message than previously provided.
Change-Id: I97fc318ce00291d5e43b70423930981c2f5a2de0
Partial-Bug: #1524745
This causes the conductor to fail to start up if a default interface
implementation cannot be found for any dynamic driver. This avoids
problems later where building a task object to operate on a node
could fail for the same reason.
This also removes a RAID interface test that turned out to be an
invalid test, but we couldn't tell it was invalid until we had
changed the start up behavior of the conductor.
Note that this release note doesn't actually note a change between
releases, but rather is mostly for my use when I come back to combine
many of the release notes for this feature later.
Change-Id: I39d3c30a6beda2e496ff85119281fdf4de191560
Partial-Bug: #1524745
This changes the driver loading validation in the conductor
startup to check for at least one classic *or* dynamic driver.
Previously the conductor would fail to start if no classic drivers
were loaded. This allows the conductor to use only dynamic
drivers, without loading any classic drivers.
It also now checks classic driver names against dynamic driver
names, and fails to start if there is a conflict there. This
would totally break the hash ring and cause mass confusion,
so we cannot allow it.
Change-Id: Id368690697f90471d09f16eaa4925338dadebd0f
Partial-Bug: #1524745