Commit Graph

194 Commits

Author SHA1 Message Date
Zuul ddca532f52 Merge "Fix the confusion around service_reboot/servicing_reboot" 2024-04-15 23:38:00 +00:00
Dmitry Tantsur 004e78c413
Fix the confusion around service_reboot/servicing_reboot
We ended up using two names for the same flag (and forgot it in one
place completely). To not just fix the issue but also prevent it in the
future, refactor asynchronous steps handling into a new helper module
with constants and helper functions.

I've settled on servicing_reboot as opposed to service_reboot because
that's the value we currently set (but not read), so it provides
better compatibility when backporting.

Remove excessive mocking in the Redfish unit tests.

Change-Id: I32b5f860b5d10864ce68f8d5f1dac3f76cd158d6
2024-04-12 18:09:54 +02:00
Dmitry Tantsur 6c8673c1b4
Fix servicing clean-up
Serious issues:
- Nothing powers on nodes after servicing, so they end up active and
  powered off in the end.
- Restoring power state was done three times.

Minor issues:
- Function _tear_down_node_servicing is called twice causing a traceback.
- Furthermore, process_event('done') is also called in another place
  in deploy utils.
- Make sure nodes are never considered for fast-track when servicing, it
  prevents clean-up of virtual media devices.

Change-Id: I92fd7a0009a816e93e316e4674c7509b61a474d4
2024-04-12 10:48:57 +02:00
Zuul df9e1ba80e Merge "[codespell] Fixing Spelling Mistakes" 2024-03-14 17:13:05 +00:00
Sharpz7 949387bd80 [codespell] Fixing Spelling Mistakes
This is the first in a series of commits to add support for codespell. This is continuning the process completed in ironic-python-agent.

Future Commits will add a Tox Target, CI support and potentially a git-blame-ignore-revs file if their are lots of spelling mistakes that could clutter git blame.

Change-Id: Id328ff64c352e85b58181e9d9e35973a8706ab7a
2024-02-12 19:58:56 +00:00
Sharpz7 41ee6aa2ff Ensure all errors are passed during cleaning
Related Bug: https://bugs.launchpad.net/ironic/+bug/1628422

This change makes sure that the caught error is passed through to node_history_record()

Change-Id: I9b78ec37f37024d04928403bbf0b85ed96906441
2024-02-10 00:25:11 +00:00
Dmitry Tantsur 607b8734e4
Cache firwmare components on the transition to "manageable"
Automated cleaning is not guaranteed to be enabled, and in any case it's
too late to cache the components at that point: firwmare upgrades may
happen before the transition to "available".

Change-Id: I6b74970fffcc150c167830bef195f284a8c6f197
2023-12-14 09:51:47 +01:00
Dmitry Tantsur 23745d97fe
Fix two severe errors in the firmware caching code
First, it tries to create components even if the current version is not
known and fails with a database constraint error (because the initial
version cannot be NULL). Can be reproduced with sushy-tools before
37f118237a

Second, unexpected exceptions are not handled in the caching code, so
any of them will cause the node to get stuck in cleaning forever.

On top of that, the caching code is missing a metrics decorator.

This change does not update any unit tests because none currently exist.

Change-Id: Iaa242ca6aa6138fcdaaf63b763708e2f1e559cb0
2023-12-08 18:11:03 +01:00
Dmitry Tantsur 0902912217 Generic API for attaching/detaching virtual media
This patch allows to attach or detach a generic image as
virtual media device after a node has been provisioned.

Closes-Bug: #2033288
Change-Id: I97b68047d769f6fb686c53e89084b5874e02b8c7
2023-11-23 09:55:09 +01:00
Iury Gregory Melo Ferreira 4eb0dbf7b5 RedfishFirmware Interface
Change-Id: I75b2433fade0c36522024c16608d61cd663b38d5
2023-09-20 13:09:38 -03:00
Julia Kreger 2366a4b86e Adds service steps
A huge list of initial work for service steps

* Adds service_step verb
* Adds service_step db/object/API field on the node object for the
  status.
* Increments the API version to 1.87 for both changes.
* Increments the RPC API version to 1.57.
* Adds initial testing to facilitate ensurance that supplied steps
  are passed through and executed upon.

Does not:

* Have tests for starting the agent ramdisk, although this is
  relatively boiler plate.
* Have a collection of pre-decorated steps available for immediate
  consumption.

Change-Id: I5b9dd928f24dff7877a4ab8dc7b743058cace994
2023-08-16 06:34:08 -07:00
Dmitry Tantsur 0370f5ac97 Migrate the inspector's /continue API
This change creates all necessary parts to processing inspection data:

* New API /v1/continue_inspection

Depending on the API version, either behaves like the inspector's API
or (new version) adds the lookup functionality on top.

The lookup process is migrated from ironic-inspector with minor changes.
It takes MAC addresses, BMC addresses and (optionally) a node UUID and
tries to find a single node in INSPECTWAIT state that satisfies all
of these. Any failure results in HTTP 404.

To make lookup faster, the resolved BMC addresses are cached in advance.

* New RPC continue_inspection

Essentially, checks the provision state again and delegates to the
inspect interface.

* New inspect interface call continue_inspection

The base version does nothing. Since we don't yet have in-band
inspection in Ironic proper, the only actual implementation is added
to the existing "inspector" interface that works by doing a call
to ironic-inspector.

Story: #2010275
Task: #46208
Change-Id: Ia3f5bb9d1845d6b8fab30232a72b5a360a5a56d2
2023-06-07 10:57:08 +02:00
Julia Kreger 75b881bd31 Fix DB/Lock session handling issues
Prior to this fix, we have been unable to run the Metal3 CI job
with SQLAlchemy's internal autocommit setting enabled. However
that setting is deprecated and needs to be removed.

Investigating our DB queries and request patterns, we were able
to identify some queries which generally resulted in the
underlying task and lock being held longer because the output
was not actually returned, which is something we've generally
had to fix in some places previously. Doing some of these
changes did drastically reduce the number of errors encountered
with the Metal3 CI job, however it did not eliminate them
entirely.

Further investigation, we were able to determine that the underlying
issue we were encountering was when we had an external semi-random
reader, such as Metal3 polling endpoints, we could reach a situation
where we would be blocked from updating the database as to open a
write lock, we need the active readers not to be interacting with
the database, and with a random reader of sorts, the only realistic
option we have is to enable the Write Ahead Log[0]. We didn't have
to do this with SQLAlchemy previously because autocommit behavior
hid the complexities from us, but in order to move to SQLAlchemy
2.0, we do need to remove autocommit.

Additionally, adds two unit tests for get_node_with_token rpc
method, which apparently we missed or lost somewhere along the
way. Also, adds notes to two Database interactions to suggest
we look at them in the future as they may not be the most
efficient path forward.

[0]: https://www.sqlite.org/wal.html

Change-Id: Iebcc15fe202910b942b58fc004d077740ec61912
2023-05-01 15:35:33 -07:00
Chris Krelle 510a612eed Add ablity to power off nodes in clean failed
We have seen duplicate ip issues when leaving clean failed nodes
powered on. This patch allows operators to power down nodes that
enter clean failed state.

Change-Id: Iecb402227485fe0ba787a262121c9d6a048b0e13
2023-04-24 16:20:54 -07:00
Zuul 821ce8c319 Merge "Wipe Agent Token when cleaning timeout occcurs" 2023-03-14 19:27:16 +00:00
Zuul 718d52c792 Merge "Clean out agent token even if power is already off" 2023-03-13 23:00:46 +00:00
Julia Kreger bcf6c12269 Clean out agent token even if power is already off
While investigating a very curious report, I discovered that
if somehow the power was *already* turned off to a node, say
through an incorrect BMC *or* human action, and Ironic were
to pick it up (as it does by default, because it checks before
applying the power state, then it would not wipe the token
information, preventing the agent from connecting on the next
action/attempt/operation.

We now remove the token on all calls to conductor
utilities node_power_action method when appropriate, even
if no other work is required.

Change-Id: Ie89e8be9ad2887467f277772445d4bef79fa5ea1
2023-03-02 15:02:23 +00:00
Julia Kreger 47b5909486 Wipe Agent Token when cleaning timeout occcurs
In a relatively odd turn of events, should cleaning
have started, but then timed out due to lost communications
or a hard failure of the machine, an agent token could
previously be orphaned preventing re-cleaning.

We now explicitly remove the token in this case.

Change-Id: I236cdf6ddb040284e9fd1fa10136ad17ef665638
2023-03-02 06:33:18 -08:00
Dmitry Tantsur 9a0fa631ca Do not move nodes to CLEAN FAILED with empty last_error
When cleaning fails, we power off the node, unless it has been running
a clean step already. This happens when aborting cleaning or on a boot
failure. This change makes sure that the power action does not wipe
the last_error field, resulting in a node with provision_state=CLEANFAIL
and last_error=None for several seconds. I've hit this in Metal3.

Also when aborting cleaning, make sure last_error is set during
the transition to CLEANFAIL, not when the clean up thread starts
running.

While here, make sure to log the current step in all cases, not only
when aborting a non-abortable step.

Change-Id: Id21dd7eb44dad149661ebe2d75a9b030aa70526f
Story: #2010603
Task: #47476
2023-03-01 11:16:46 +01:00
Zuul 5d2283137c Merge "Make anaconda non-image deploys sane" 2022-07-14 01:28:00 +00:00
Julia Kreger e78f123ff8 Make anaconda non-image deploys sane
Ironic has a lot of logic built up around use of images for filesystems,
however several recent additions, such as the ``ramdisk`` and ``anaconda``
deployment interfaces have started to break this mold.

In working with some operators attempting to utilzie the anaconda
deployment interface outside the context of full OpenStack, we discovered
some issues which needed to be make simpler to help remove the need to
route around data validation checks for things that are not required.

Standalong users also have the ability to point to a URL with anaconda,
where as Operators using OpenStack can only do so with customized kickstart
files. While this is okay, the disparity in configuraiton checking
was also creating additional issues.

In this, we discovered we were not really graceful with redirects,
so we're now a little more graceful with them.

Story: 2009939
Story: 2009940
Task: 44834
Task: 44833
Change-Id: I8b0a50751014c6093faa26094d9f99e173dcdd38
2022-07-11 07:41:06 -07:00
Dmitry Tantsur e09919caba Move logging out of skip_automated_cleaning
Simply boolean functions should not have logging as a side effect.
This one is also used in deploy_utils without logging.

Change-Id: Iaa398f09cec06a8417c595acac19b0b9f3f3a871
2022-07-06 17:00:11 +02:00
Zuul a4bf31de61 Merge "Auto-populate lessee for deployments" 2022-07-02 02:56:54 +00:00
Dmitry Tantsur 65583e6417 No deploy_kernel/ramdisk with the ramdisk deploy and no cleaning
Ramdisk deploys don't use IPA, no need to provide it. Cleaning may need
the agent, so only skip verification if cleaning is disabled.

Other boot interfaces may need fixing as well, I haven't checked them.

Change-Id: Ia2739311f065e19ba539fe3df7268075d6075787
2022-06-23 19:49:16 +02:00
Julia Kreger c3f397149a Auto-populate lessee for deployments
Adds a configuration option and capability to automatically
record the lessee for a deployment based upon the original
auth_token information provided in the request context.

Additional token information is now shared through the context
which is extended in the same fashion as most other projects
saving request token information to their RequestContext,
instead of triggering excess API calls in the background to
Keystone to try and figure out requestor's information.

Change-Id: I42a2ceb9d2e7dfdc575eb37ed773a1bc682cec23
2022-05-23 16:21:19 -07:00
Harald Jensås 4cf0147e86 Exclude current conductor from offline_conductors
In some cases the current conductor may have failed to updated
the heartbeat timestamp due to failure of resource starvation.
When this occurs the dbapi get_offline_conductors method will
include the current conductor in its return value.

In this scenario the conductor may end up forcefully remove
node reservations or allocations from itself, triggering takeover
which fail on-going operations.

This change adds a wrapper to exclude the current conductor.
The wrapper will log a warning to raise the issue.

Related-Bug: #1970484
Stroy: 2010016
Task: 45204
Change-Id: I6a8f38934b475f792433be6f0882540b82ca26c1
2022-04-28 10:28:26 +02:00
Dmitry Tantsur daa7dba331 Shorten error messages in commonly used modules
* Do not mention "deploy driver", it's not a thing.
* Be careful with the pattern "Error: %s" or "Reason: %s". It is good
  for long introductory sentences, but looks poor for shorter ones and
  becomes really problematic when several instances are concatenated.

This change updates deploy_utils, agent code and conductor modules.

Change-Id: Ie1efea02b5f1a174e9ef8c5253ce9754a60b4c56
2022-02-17 19:16:52 +01:00
Dmitry Tantsur a813c769e8 Explicit parameter to distinguish partition/whole-disk images
Using kernel/ramdisk makes no sense with local boot, we need a better
way. We already have an internal image_type instance parameter, let's
make it public.

Glance support will be added in the next patch.

Change-Id: I4ce5f7a2317d952f976194d2022328f4afbb0258
2022-01-28 19:13:13 +01:00
Zuul 19cafb55e1 Merge "Allow enabling fast-track per node" 2021-12-15 16:39:28 +00:00
Dmitry Tantsur 2a6cdf4b24 Allow enabling fast-track per node
This is useful when some nodes need the "agent" power interface, while
the others can be deployed normally.

Change-Id: Ief7df40c83ef03d0ec5ae92d09ceffd39d3c12a3
2021-12-08 14:26:51 +01:00
Steve Baker d5eb6ee567 Refactor driver_internal_info updates to methods
Making updates to driver_internal_info can result in hard to read code
due the requirement to assign the whole driver_internal_info back to
the node to trigger the expected update operation. This change
replaces driver_internal_info update operations with a new
methods:
- set_driver_internal_info
- del_driver_internal_info
- timestamp_driver_internal_info

This change defines the functions and moves core conductor logic to
use them. Subsequent changes in this series will move drivers to use
the new functions.

Change-Id: Ib8917c3c674e77cd3aba6a1e73c65162e3ee1141
2021-12-03 14:49:33 +13:00
Zuul ef5c1a3a44 Merge "Demote three warning messages" 2021-10-08 11:10:40 +00:00
Dmitry Tantsur dec673784b Demote three warning messages
These 3 messages do not convey a lot of useful information to the
operators and definitely do not represent a potential issue that
warrants a warning.

Change-Id: I77f5802125f79c945eb05a278f7ce53696df830a
2021-10-06 10:53:41 +02:00
Jacob Anders b385d9ae5b Add support for verify steps
This change adds support for verify steps in Ironic. Verify steps
allow executing actions on transition from "verifying" to "managable"
state and can perform actions such as cleaning BMC job queue or
resetting the BMC on supported platforms. Verify steps are similar
to deploy and clean steps, just simpler.

Story: 2009025
Task: 42751
Change-Id: Iee27199a0315b8609e629bac272998c28274802b
2021-09-30 20:46:17 +10:00
Julia Kreger d17749249c Record node history and manage events in db
* Adds periodic task to purge node_history entries based upon
  provided configuration.
* Adds recording of node history entries for errors in the
  core conductor code.
* Also changes the rescue abort behavior to remove the notice
  from being recorded as an error, as this is a likely bug in
  behavior for any process or service evaluating the node
  last_error field.
* Makes use of a semi-free form event_type field to help
  provide some additional context into what is going on and
  why. For example if deployments are repeatedly failing,
  then perhaps it is a configuration issue, as opposed to
  a general failure. If a conductor has no resources, then
  the failure, in theory would point back to the conductor
  itself.

Story: 2002980
Task: 42960

Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166
2021-09-10 14:47:27 -07:00
Cenne bc95c92f7c Add api endpoints for changing boot_mode and secure_boot state
Done:
  - Node API endpoints expose
  - RPC methods
  - Conductor Manager methods
  - Conductor utils new methods
  - RBAC new policies
  - Node API tests
  - Manager Tests (+ some testing for utils methods)
  - RBAC tests
  - Docs (api-ref)
  - REST API version history
  - Releasenotes

Story: 2008567
Task: 41709

Change-Id: I2d72389edf546b99c536c6b130ca85ababf80591
2021-08-23 19:38:58 +02:00
Cenne b03ff30f93 Fixes missing argument for log format string
Story: 2008567
Change-Id: Id5bcfad5cd4514dd710232d75fbd729856f16b17
2021-07-27 11:49:19 +02:00
Cenne 46ff51487a Add `boot_mode` and `secure_boot` to node object and expose in api
* add fields to Node object
  * expose them at endpoint `/v1/nodes/{node_ident}/states`
  * update states on powersync / entering managed state.
  * tests
  * update api endpoint info in api-ref

Story: 2008567
Task: 41709

Change-Id: Iddd1421a6fa37d69da56658a2fefa5bc8cfd15e4
2021-07-08 15:04:15 +02:00
Bob Fournier e15440370c Include bios registry fields in bios API
Provide the fields in the BIOS setting API -
``/v1/nodes/{node}/bios/{setting}``, and in the BIOS setting list API
when details are requested - ``/v1/nodes/<node>/bios?detail=True``.

Story: #2008571
Task: #42483
Change-Id: Ie86ec57e428e2bb2efd099a839105e51a94824ab
2021-05-27 12:15:20 -04:00
Dmitry Tantsur 172d1b22df Delay rendering configdrive
When the configdrive input is JSON (meta_data, etc), delay the rendering
until the ISO image is actually used. It has two benefits:
1) Avoid storing a large ISO image in instance_info,
2) Allow deploy steps to access the original user's input.

Fix configdrive masking to correctly mask dicts.

Story: #2008875
Task: #42419
Change-Id: I86d30bbb505b8c794bfa6412606f4516f8885aa9
2021-05-19 15:17:49 +02:00
Dmitry Tantsur c6e8281f85 Wipe agent tokens on inspection start and abort
Also make sure the pregenerated flag is always reset.

Change-Id: I73aaa803d3eb84ddac59a778e998836a645217eb
2021-04-08 13:42:25 +02:00
Dmitry Tantsur 30a85bd0ce API to force manual cleaning without booting IPA
Adds a new argument disable_ramdisk to the manual cleaning API.
Only steps that are marked with requires_ramdisk=False can be
run in this mode. Cleaning prepare/tear down is not done.

Some steps (like redfish BIOS) currently require IPA to detect
a successful reboot. They are not marked with requires_ramdisk
just yet.

Change-Id: Icacac871603bd48536188813647bc669c574de2a
Story: #2008491
Task: #41540
2021-03-16 16:08:46 +01:00
Riccardo Pittau d5b5356d60 [trivial] fix typos in conductor
Change-Id: Ib431c3507cb4bdbd9ba30b58e30b078e855e7754
2021-02-23 17:54:53 +01:00
Zuul 766d8f11b4 Merge "Add 'deploy steps' parameter for provisioning API" 2021-02-12 16:01:33 +00:00
Zuul af29f398cc Merge "Don't mark an agent as alive if rebooted" 2021-02-08 09:24:47 +00:00
Derek Higgins 4287951d71 Don't mark an agent as alive if rebooted
If 'agent_url' has been cleared from internal_info
it indicates that the node has been powered off.

Change-Id: Idba486c98e1e92d35fca2e2d156866566acb9e40
Story: 2008583
Task: 41736
2021-02-04 13:01:50 +00:00
Aija Jauntēva 3138acc836 Add 'deploy steps' parameter for provisioning API
Story: 2008043
Task: 40705
Change-Id: I3dc2d42b3edd2a9530595e752895e9d113f76ea8
2021-02-03 11:47:53 -05:00
Dmitry Tantsur b8a2dcaf86 Trivial: log the newly detected vendor
Change-Id: Ib751316a98d7a1c4469b405117c8e1fd1f296757
2021-02-03 17:31:54 +01:00
Dmitry Tantsur a5f7d75ba2 Apply force_persistent_boot_device to all boot interfaces
For some (likely historical) reasons we only use it for PXE and iPXE,
but the same logic applies to any boot interface (since it depends
on how the management interface and the BMC work, not on the boot
method). This change moves its handling to conductor utils.

Change-Id: I948beb4053034d3c1b4c5b7c64100e41f6022739
2021-02-01 13:37:20 +01:00
Dmitry Tantsur 121b3348c8 Refactor vendor detection and add Redfish implementation
Get rid of the TODO in the code and prepare for more management
interfaces supporting detect_vendor(). Vendor detecting now runs
during transition to manageable and on power state sync (essentially
same as before but for all drivers not only IPMI).

Update the IPMI implementation to no longer hide exceptions since
they're not handled on the upper level. Simplify the regex and fix
the docstring.

Add the Redfish implementation as a foundation for future
vendor-specific changes.

Change-Id: Ie521cf2295613dde5842cbf9a053540a40be4b9c
2021-01-28 16:41:45 +01:00