Commit Graph

109 Commits

Author SHA1 Message Date
Zuul ddca532f52 Merge "Fix the confusion around service_reboot/servicing_reboot" 2024-04-15 23:38:00 +00:00
Dmitry Tantsur 004e78c413
Fix the confusion around service_reboot/servicing_reboot
We ended up using two names for the same flag (and forgot it in one
place completely). To not just fix the issue but also prevent it in the
future, refactor asynchronous steps handling into a new helper module
with constants and helper functions.

I've settled on servicing_reboot as opposed to service_reboot because
that's the value we currently set (but not read), so it provides
better compatibility when backporting.

Remove excessive mocking in the Redfish unit tests.

Change-Id: I32b5f860b5d10864ce68f8d5f1dac3f76cd158d6
2024-04-12 18:09:54 +02:00
Dmitry Tantsur 6c8673c1b4
Fix servicing clean-up
Serious issues:
- Nothing powers on nodes after servicing, so they end up active and
  powered off in the end.
- Restoring power state was done three times.

Minor issues:
- Function _tear_down_node_servicing is called twice causing a traceback.
- Furthermore, process_event('done') is also called in another place
  in deploy utils.
- Make sure nodes are never considered for fast-track when servicing, it
  prevents clean-up of virtual media devices.

Change-Id: I92fd7a0009a816e93e316e4674c7509b61a474d4
2024-04-12 10:48:57 +02:00
Julia Kreger 2366a4b86e Adds service steps
A huge list of initial work for service steps

* Adds service_step verb
* Adds service_step db/object/API field on the node object for the
  status.
* Increments the API version to 1.87 for both changes.
* Increments the RPC API version to 1.57.
* Adds initial testing to facilitate ensurance that supplied steps
  are passed through and executed upon.

Does not:

* Have tests for starting the agent ramdisk, although this is
  relatively boiler plate.
* Have a collection of pre-decorated steps available for immediate
  consumption.

Change-Id: I5b9dd928f24dff7877a4ab8dc7b743058cace994
2023-08-16 06:34:08 -07:00
Stephen Finucane 7083545731 tests: Replace invalid UUIDs
Fix the warnings that oslo.versionedobjects has been emitting for years
now.

Change-Id: I53bd78d8b70f276d2ea8569f0ab1e7ce04f52fea
Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
2023-04-19 10:44:27 +01:00
Zuul 821ce8c319 Merge "Wipe Agent Token when cleaning timeout occcurs" 2023-03-14 19:27:16 +00:00
Zuul 718d52c792 Merge "Clean out agent token even if power is already off" 2023-03-13 23:00:46 +00:00
Julia Kreger bcf6c12269 Clean out agent token even if power is already off
While investigating a very curious report, I discovered that
if somehow the power was *already* turned off to a node, say
through an incorrect BMC *or* human action, and Ironic were
to pick it up (as it does by default, because it checks before
applying the power state, then it would not wipe the token
information, preventing the agent from connecting on the next
action/attempt/operation.

We now remove the token on all calls to conductor
utilities node_power_action method when appropriate, even
if no other work is required.

Change-Id: Ie89e8be9ad2887467f277772445d4bef79fa5ea1
2023-03-02 15:02:23 +00:00
Julia Kreger 47b5909486 Wipe Agent Token when cleaning timeout occcurs
In a relatively odd turn of events, should cleaning
have started, but then timed out due to lost communications
or a hard failure of the machine, an agent token could
previously be orphaned preventing re-cleaning.

We now explicitly remove the token in this case.

Change-Id: I236cdf6ddb040284e9fd1fa10136ad17ef665638
2023-03-02 06:33:18 -08:00
Dmitry Tantsur 9a0fa631ca Do not move nodes to CLEAN FAILED with empty last_error
When cleaning fails, we power off the node, unless it has been running
a clean step already. This happens when aborting cleaning or on a boot
failure. This change makes sure that the power action does not wipe
the last_error field, resulting in a node with provision_state=CLEANFAIL
and last_error=None for several seconds. I've hit this in Metal3.

Also when aborting cleaning, make sure last_error is set during
the transition to CLEANFAIL, not when the clean up thread starts
running.

While here, make sure to log the current step in all cases, not only
when aborting a non-abortable step.

Change-Id: Id21dd7eb44dad149661ebe2d75a9b030aa70526f
Story: #2010603
Task: #47476
2023-03-01 11:16:46 +01:00
Julia Kreger c3f397149a Auto-populate lessee for deployments
Adds a configuration option and capability to automatically
record the lessee for a deployment based upon the original
auth_token information provided in the request context.

Additional token information is now shared through the context
which is extended in the same fashion as most other projects
saving request token information to their RequestContext,
instead of triggering excess API calls in the background to
Keystone to try and figure out requestor's information.

Change-Id: I42a2ceb9d2e7dfdc575eb37ed773a1bc682cec23
2022-05-23 16:21:19 -07:00
Harald Jensås 4cf0147e86 Exclude current conductor from offline_conductors
In some cases the current conductor may have failed to updated
the heartbeat timestamp due to failure of resource starvation.
When this occurs the dbapi get_offline_conductors method will
include the current conductor in its return value.

In this scenario the conductor may end up forcefully remove
node reservations or allocations from itself, triggering takeover
which fail on-going operations.

This change adds a wrapper to exclude the current conductor.
The wrapper will log a warning to raise the issue.

Related-Bug: #1970484
Stroy: 2010016
Task: 45204
Change-Id: I6a8f38934b475f792433be6f0882540b82ca26c1
2022-04-28 10:28:26 +02:00
Zuul 19cafb55e1 Merge "Allow enabling fast-track per node" 2021-12-15 16:39:28 +00:00
Dmitry Tantsur 2a6cdf4b24 Allow enabling fast-track per node
This is useful when some nodes need the "agent" power interface, while
the others can be deployed normally.

Change-Id: Ief7df40c83ef03d0ec5ae92d09ceffd39d3c12a3
2021-12-08 14:26:51 +01:00
Steve Baker d5eb6ee567 Refactor driver_internal_info updates to methods
Making updates to driver_internal_info can result in hard to read code
due the requirement to assign the whole driver_internal_info back to
the node to trigger the expected update operation. This change
replaces driver_internal_info update operations with a new
methods:
- set_driver_internal_info
- del_driver_internal_info
- timestamp_driver_internal_info

This change defines the functions and moves core conductor logic to
use them. Subsequent changes in this series will move drivers to use
the new functions.

Change-Id: Ib8917c3c674e77cd3aba6a1e73c65162e3ee1141
2021-12-03 14:49:33 +13:00
Dmitry Tantsur dec673784b Demote three warning messages
These 3 messages do not convey a lot of useful information to the
operators and definitely do not represent a potential issue that
warrants a warning.

Change-Id: I77f5802125f79c945eb05a278f7ce53696df830a
2021-10-06 10:53:41 +02:00
Julia Kreger d17749249c Record node history and manage events in db
* Adds periodic task to purge node_history entries based upon
  provided configuration.
* Adds recording of node history entries for errors in the
  core conductor code.
* Also changes the rescue abort behavior to remove the notice
  from being recorded as an error, as this is a likely bug in
  behavior for any process or service evaluating the node
  last_error field.
* Makes use of a semi-free form event_type field to help
  provide some additional context into what is going on and
  why. For example if deployments are repeatedly failing,
  then perhaps it is a configuration issue, as opposed to
  a general failure. If a conductor has no resources, then
  the failure, in theory would point back to the conductor
  itself.

Story: 2002980
Task: 42960

Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166
2021-09-10 14:47:27 -07:00
Riccardo Pittau 7aec74d4e6 Increase version of hacking and pycodestyle
Fix errors in unit tests

Change-Id: Ib5a75fc5e5b6b5661d36e3a27796c3c02ed90056
2021-07-28 11:33:15 +02:00
Cenne 46ff51487a Add `boot_mode` and `secure_boot` to node object and expose in api
* add fields to Node object
  * expose them at endpoint `/v1/nodes/{node_ident}/states`
  * update states on powersync / entering managed state.
  * tests
  * update api endpoint info in api-ref

Story: 2008567
Task: 41709

Change-Id: Iddd1421a6fa37d69da56658a2fefa5bc8cfd15e4
2021-07-08 15:04:15 +02:00
Dmitry Tantsur 172d1b22df Delay rendering configdrive
When the configdrive input is JSON (meta_data, etc), delay the rendering
until the ISO image is actually used. It has two benefits:
1) Avoid storing a large ISO image in instance_info,
2) Allow deploy steps to access the original user's input.

Fix configdrive masking to correctly mask dicts.

Story: #2008875
Task: #42419
Change-Id: I86d30bbb505b8c794bfa6412606f4516f8885aa9
2021-05-19 15:17:49 +02:00
Zuul af29f398cc Merge "Don't mark an agent as alive if rebooted" 2021-02-08 09:24:47 +00:00
Derek Higgins 4287951d71 Don't mark an agent as alive if rebooted
If 'agent_url' has been cleared from internal_info
it indicates that the node has been powered off.

Change-Id: Idba486c98e1e92d35fca2e2d156866566acb9e40
Story: 2008583
Task: 41736
2021-02-04 13:01:50 +00:00
Dmitry Tantsur a5f7d75ba2 Apply force_persistent_boot_device to all boot interfaces
For some (likely historical) reasons we only use it for PXE and iPXE,
but the same logic applies to any boot interface (since it depends
on how the management interface and the BMC work, not on the boot
method). This change moves its handling to conductor utils.

Change-Id: I948beb4053034d3c1b4c5b7c64100e41f6022739
2021-02-01 13:37:20 +01:00
Dmitry Tantsur 121b3348c8 Refactor vendor detection and add Redfish implementation
Get rid of the TODO in the code and prepare for more management
interfaces supporting detect_vendor(). Vendor detecting now runs
during transition to manageable and on power state sync (essentially
same as before but for all drivers not only IPMI).

Update the IPMI implementation to no longer hide exceptions since
they're not handled on the upper level. Simplify the regex and fix
the docstring.

Add the Redfish implementation as a foundation for future
vendor-specific changes.

Change-Id: Ie521cf2295613dde5842cbf9a053540a40be4b9c
2021-01-28 16:41:45 +01:00
Aija Jauntēva 870181f3ae Update `cleaning_error_handler`
Update `cleaning_error_handler` to match with
`deploying_error_handler` that logs all errors and optionally
separates between logged message and `last_error`.

Logged message usually contains node's uuid as there is no
context for node in stream of log entries. `last_error`
usually does not contain node's uuid as it is already
displayed in the context of node.

Impact:
* There were messages that were only added to node's last_error.
Now they are going to be logged too.
* No need to log explicitly before `cleaning_error_handler`. Such
occurrences have been removed.
* Where there were different message for log and last_error it
is kept. Where there was only 1 message, it is left as it is to
be both logged and updated in `last_error`.
* Exception logging is replaced with error logging with traceback.

Story: 2008307
Task: 41198
Change-Id: I813228fb47a51ee6c45b420322acabdf565ff752
2020-11-13 05:17:40 -05:00
Dmitry Tantsur e39858dd8c Wiping agent tokens on reboot via API - take 2
Because of using an incorrect variable, reboot was treated as power on,
and the token was not wiped.

Change-Id: I656450c2bedc3dc0d20a70de78cc29bf64d5fe85
Story: #2008097
Task: #40799
2020-10-05 17:36:45 +02:00
Julia Kreger 68cea19d99 Minor agent version code cleanup
Internal to the conductor, we have long determiend that if no
agent version is submitted by the client, that the client is a
version 3.0.0 agent.

With agent token, we're effectively requiring 6.1.0 or later
to be leveraged, so all agents should be sending versions.

An agent not sending versions is thus unsupported and would no
longer work without agent token support.

Redudnant code has thus been removed. Additionally the conductor
utility is_agent_token_supported has been removed since it is now
redundant.

Change-Id: Id6c8d1df08c3ac7af61ed7d05d274f3099003582
2020-09-29 10:02:10 -07:00
Dmitry Tantsur bc628ac6ef Route conductor notification RPC to the same conductor
RPC continue_node_{deploy,clean} are called from a conductor handling
the node, so they don't have to go through the hash ring. This avoids
situation when take over happens in the middle of a deploy/clean step
processing, breaking it.

Eventually, we should stop using RPC for that at all, but that will be
a much more invasive change.

Story: #2008200
Task: #40984
Change-Id: I76293f8ec30d5957b99bdbce5b70e87e8378d135
2020-09-25 17:34:58 +02:00
Zuul 35882c438d Merge "Also wipe agent token on manual power off or reboot" 2020-09-17 08:10:38 +00:00
Dmitry Tantsur bc04a42a96 Also wipe agent token on manual power off or reboot
We have a check in the code that is never true for manual power
actions because of what happens in the conductor manager. Remove it.

Change-Id: I50b7b78a41188c41e4944894851f1d12684f824a
2020-09-14 16:09:54 +02:00
Dmitry Tantsur 2b676a6864 Accept and use a TLS certificate from the agent
Accepts the certificate from a heartbeat and stores its path in
driver_internal_info for further usage by the agent client (or
any 3rd party deploy implementations).

Similarly to agent_url, the certificate is protected from further
changes (unless the local copy does not exist) and is removed
on reboot or tear down (unless fast-tracking).

Change-Id: I81b326116e62cd86ad22b533f55d061e5ed53e96
Story: #2007214
Task: #40603
2020-09-09 17:27:30 +02:00
Dmitry Tantsur 7dd611dc5e Ensure in-band deploy steps are present in time for fast-track deployments
Currently we load in-band deploy steps on the first heartbeat in DEPLOYWAIT.
With fast-track, however, this heartbeat may happen way too late, because
the node is up and heartbeating before it's moved to DEPLOYWAIT. This
results in in-band deploy steps not loaded. This change forces a refresh
of deploy steps in deploy.deploy if fast-track is used.

Also make sure that cached steps are cleared on reboot or power off since
they may become outdated next time the agent boots.

Change-Id: I003543452717183b9b3359e368757bcd824a5ce7
2020-08-21 14:01:57 +02:00
Zuul c9a0bce01b Merge "Follow-up on blocking port deletions" 2020-07-18 04:22:06 +00:00
Julia Kreger ba0dc574bc Follow-up on blocking port deletions
A recent comment on https://review.opendev.org/#/c/665835
pointed out that we should likely make some changes and a fix
a missing check for the introspection_vif_port_id which was
likely introduced after this functionality was originally
written.

Also adds some documentation on the subject since we lack
docs even pointing out how to delete a port. :\

Change-Id: I0ba8a3741eefa80eb56e25a1b339f8433b3fc0dc
2020-07-16 12:47:07 -07:00
Dmitry Tantsur 2d4d375358 Wipe agent token during reboot or power off
Otherwise a reboot during fast-track will leave the newly booted
agent without an ability to request a token.

Change-Id: I963276efae5599bfed6cbb4df18e8dd3bd1b9839
2020-07-15 10:54:31 +02:00
Riccardo Pittau a58ca1f7a6 Enforce autospec in test_utils
And remove the corresponding H210 filter.

Change-Id: Iaec44cc66ba654e478381260044776fd838b7527
2020-06-24 09:06:52 +02:00
Zuul a2ad31ddef Merge "Fix agent token and URL handling during fast-track deployment" 2020-06-19 13:50:27 +00:00
Zuul 5822b2fb7d Merge "Block port deletions where vif is present" 2020-06-18 11:50:29 +00:00
Dmitry Tantsur 5909163924 Fix agent token and URL handling during fast-track deployment
We wipe these fields on some conditions, most notable - on starting
the deployment. Make the removal of these fields to always go through
the helpers in conductor/utils (and remove an unused one).

Change-Id: Idb952588bb8a6d5131764f29c6225762ba5d55cc
2020-06-16 18:06:50 +02:00
Julia Kreger 64674bf0ed Block port deletions where vif is present
We must not allow users to be able to delete
ports when a VIF record is present. If they are
able to, then the user risks orphaning ports in
neutron which could result in address depletion
prevent the port from being able to be used again.

Co-Authored-By: Madhuri Kumari <madhuri.kumari@intel.com>
Story: 2006710
Task: 37074
Change-Id: I9f8eaf94a375984ea2f567f0a87c2ed1ed5cb3b7
2020-06-02 14:13:24 -07:00
Iury Gregory Melo Ferreira d6e7552457 Switch to unittest mock
Python3 have a standard library for mock in the unittest module,
let's drop the mock requirement and switch tests to unittest mock.

Change-Id: I4f1b3e25c8adbc24cdda51c73da3b66967f7ef23
2020-04-30 19:04:17 +02:00
Zuul d927dd0d0a Merge "Generalize clean step functions to support deploy steps" 2020-03-28 20:51:49 +00:00
Zuul 9cfdd6a0d9 Merge "Hash the rescue_password" 2020-03-26 17:14:13 +00:00
Dmitry Tantsur 16bce7f18d Generalize clean step functions to support deploy steps
This change updates various agent cleaning functions to also
support deploy steps via a new step_type argument. It does not
yet make it possible to run in-band deploy steps.

The agent's get_clean_steps call has been modified to not fail
if there are no cached steps. This is going to be normal for
both cleaning and deploy in the future, and it is impossible
to hit now because of the way cleaning is started.

Change-Id: I789b130e7e490e23924338a024397973957272ac
Story: #2006963
2020-03-25 13:04:11 +01:00
Julia Kreger fcaefdbe74 Hash the rescue_password
In order to provide increased security, it is necessary
to hash the rescue password in advance of it being stored
into the database and to provide some sort of control for
hash strength.

This change IS incompatible with prior IPA versions with
regard to use of the rescue feature, but I fully expect
we will backport the change to IPA on to stable branches
and perform a release as it is a security improvement.

Change-Id: I1e118467a536229de6f7c245c1c48f0af38dcef2
Story: 2006777
Task: 27301
2020-03-24 20:11:43 +00:00
Iury Gregory Melo Ferreira 1425adbb05 Do not use random to generate token
To avoid problems with FIPS 140-2 let's generate
the token using the secrets module instead of random.

Change-Id: I90c3c94112d093e2309414b9902f58d31d925ad3
Story: 2007444
Task: 39104
2020-03-20 20:29:58 +01:00
Julia Kreger bb3b2349f9 Pre-shared agent token
In order to improve security of the lookup/heartbeat
endpoints, we need to generate and provide temporary tokens
to the initial callers, if supported, to facilitate the
verification of commands.

This is the first patch in an entire series which utimately
enables the endpoint communication to be better secured.

The idea behind this started in private story 2006634 which
is locked as a security related filing covering multiple
aspects of ironic/ironic-python-agent interaction centered
around miss-use and generally exposed endpoints. That story
will remain marked as a private bug because it has several
different items covered, some of which did not prove to be
actually exploitable, but spawned stories 2006777, 2006773,
2007025, and is ultimately similar to Story 1526748.

Operationally this is a minimally invasive security
enhancement to lay the foundation to harden interactions
with the agent. This will take place over a series of
patches to both Ironic and the Ironic-Python-Agent.

Also see "Security of /heartbeat and /lookup endpoints"
in http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010789.html

Story: 2007025
Task: 37818

Change-Id: I0118007cac3d6548e9d41c5e615a819150b6ef1a
2020-02-20 14:33:32 -08:00
Dmitry Tantsur d949318409 Split deployment-related functions from manager.py into a new module
With more than 4000 lines of code (more than 8000 of unit tests) the
current manager.py is barely manageable. In preparation for the new
deployment API, this change moves the free-standing deploy-related
functions to utils.py and the new deployments.py.

Change-Id: Ic3f369a7fa72d09263b0670ae78980913b024962
Story: #2006910
2020-02-06 13:03:57 +01:00
Julia Kreger 931c125982 Block ability update callback_url
A malicious user with:

* API access normally reserved for the provisioning,
  cleaning, rescue networks.
* Insight about a node, such as a MAC address, or baremetal node
  UUID.
* Insight into the state of the node, such as the access provided
  to Compute API users, or other Bare Metal API users.

Can submit an erroneous ``heartbeat`` to the ironic-api endpoint
with a ``callback_url`` that is not of the actual intended agent.
This can potentially cause a rescue, cleaning, or deployment
operation to be derailed, or at worst commands to be sent to
to an endpoint the malicious user controls.

Story: 2006773
Task: 37295
Change-Id: I1a5e3c2b34d45c06fb74e82d0f30735ce9041914
2019-11-15 22:09:08 +00:00
Mark Goddard 3e51982252 Don't resume deployment or cleaning on heartbeat when polling
Some drivers use a periodic task to poll for completion of a deploy
or clean step. The iDRAC RAID driver is one example of this. In
https://review.opendev.org/#/c/676152, the agent heartbeat handler was
modified to resume deployment if not currently in the core deploy step.
This makes sense for the ilo driver, which does not poll for completion
of RAID configuration, but the iDRAC driver polls the lifecycle
controller's job queue, and expects to be able to resume deployment once
the job is complete. However, there is now a race between the agent
heartbeat as the node boots up, and the job queue poller.

This change adds new flags, cleaning_polling and deployment_polling,
which can be used by a driver to signal that they are polling for
completion of a deploy step, and that the agent heartbeat should not be
used for this purpose.

We also add here some more cleanup of the cleaning and deployment step
metadata in driver_internal_info, since if these fields are left in
place they may affect subsequent cleaning or deployment steps.

Change-Id: I34591440ab993a80a0cc88be6e10e33f1ae4a660
Story: 2003817
Task: 36563
2019-09-21 12:52:28 +01:00