ironic

Commit Graph

Author	SHA1	Message	Date
Zuul	ddca532f52	Merge "Fix the confusion around service_reboot/servicing_reboot"	2024-04-15 23:38:00 +00:00
Dmitry Tantsur	004e78c413	Fix the confusion around service_reboot/servicing_reboot We ended up using two names for the same flag (and forgot it in one place completely). To not just fix the issue but also prevent it in the future, refactor asynchronous steps handling into a new helper module with constants and helper functions. I've settled on servicing_reboot as opposed to service_reboot because that's the value we currently set (but not read), so it provides better compatibility when backporting. Remove excessive mocking in the Redfish unit tests. Change-Id: I32b5f860b5d10864ce68f8d5f1dac3f76cd158d6	2024-04-12 18:09:54 +02:00
Dmitry Tantsur	6c8673c1b4	Fix servicing clean-up Serious issues: - Nothing powers on nodes after servicing, so they end up active and powered off in the end. - Restoring power state was done three times. Minor issues: - Function _tear_down_node_servicing is called twice causing a traceback. - Furthermore, process_event('done') is also called in another place in deploy utils. - Make sure nodes are never considered for fast-track when servicing, it prevents clean-up of virtual media devices. Change-Id: I92fd7a0009a816e93e316e4674c7509b61a474d4	2024-04-12 10:48:57 +02:00
Julia Kreger	2366a4b86e	Adds service steps A huge list of initial work for service steps * Adds service_step verb * Adds service_step db/object/API field on the node object for the status. * Increments the API version to 1.87 for both changes. * Increments the RPC API version to 1.57. * Adds initial testing to facilitate ensurance that supplied steps are passed through and executed upon. Does not: * Have tests for starting the agent ramdisk, although this is relatively boiler plate. * Have a collection of pre-decorated steps available for immediate consumption. Change-Id: I5b9dd928f24dff7877a4ab8dc7b743058cace994	2023-08-16 06:34:08 -07:00
Stephen Finucane	7083545731	tests: Replace invalid UUIDs Fix the warnings that oslo.versionedobjects has been emitting for years now. Change-Id: I53bd78d8b70f276d2ea8569f0ab1e7ce04f52fea Signed-off-by: Stephen Finucane <sfinucan@redhat.com>	2023-04-19 10:44:27 +01:00
Zuul	821ce8c319	Merge "Wipe Agent Token when cleaning timeout occcurs"	2023-03-14 19:27:16 +00:00
Zuul	718d52c792	Merge "Clean out agent token even if power is already off"	2023-03-13 23:00:46 +00:00
Julia Kreger	bcf6c12269	Clean out agent token even if power is already off While investigating a very curious report, I discovered that if somehow the power was already turned off to a node, say through an incorrect BMC or human action, and Ironic were to pick it up (as it does by default, because it checks before applying the power state, then it would not wipe the token information, preventing the agent from connecting on the next action/attempt/operation. We now remove the token on all calls to conductor utilities node_power_action method when appropriate, even if no other work is required. Change-Id: Ie89e8be9ad2887467f277772445d4bef79fa5ea1	2023-03-02 15:02:23 +00:00
Julia Kreger	47b5909486	Wipe Agent Token when cleaning timeout occcurs In a relatively odd turn of events, should cleaning have started, but then timed out due to lost communications or a hard failure of the machine, an agent token could previously be orphaned preventing re-cleaning. We now explicitly remove the token in this case. Change-Id: I236cdf6ddb040284e9fd1fa10136ad17ef665638	2023-03-02 06:33:18 -08:00
Dmitry Tantsur	9a0fa631ca	Do not move nodes to CLEAN FAILED with empty last_error When cleaning fails, we power off the node, unless it has been running a clean step already. This happens when aborting cleaning or on a boot failure. This change makes sure that the power action does not wipe the last_error field, resulting in a node with provision_state=CLEANFAIL and last_error=None for several seconds. I've hit this in Metal3. Also when aborting cleaning, make sure last_error is set during the transition to CLEANFAIL, not when the clean up thread starts running. While here, make sure to log the current step in all cases, not only when aborting a non-abortable step. Change-Id: Id21dd7eb44dad149661ebe2d75a9b030aa70526f Story: #2010603 Task: #47476	2023-03-01 11:16:46 +01:00
Julia Kreger	c3f397149a	Auto-populate lessee for deployments Adds a configuration option and capability to automatically record the lessee for a deployment based upon the original auth_token information provided in the request context. Additional token information is now shared through the context which is extended in the same fashion as most other projects saving request token information to their RequestContext, instead of triggering excess API calls in the background to Keystone to try and figure out requestor's information. Change-Id: I42a2ceb9d2e7dfdc575eb37ed773a1bc682cec23	2022-05-23 16:21:19 -07:00
Harald Jensås	4cf0147e86	Exclude current conductor from offline_conductors In some cases the current conductor may have failed to updated the heartbeat timestamp due to failure of resource starvation. When this occurs the dbapi get_offline_conductors method will include the current conductor in its return value. In this scenario the conductor may end up forcefully remove node reservations or allocations from itself, triggering takeover which fail on-going operations. This change adds a wrapper to exclude the current conductor. The wrapper will log a warning to raise the issue. Related-Bug: #1970484 Stroy: 2010016 Task: 45204 Change-Id: I6a8f38934b475f792433be6f0882540b82ca26c1	2022-04-28 10:28:26 +02:00
Zuul	19cafb55e1	Merge "Allow enabling fast-track per node"	2021-12-15 16:39:28 +00:00
Dmitry Tantsur	2a6cdf4b24	Allow enabling fast-track per node This is useful when some nodes need the "agent" power interface, while the others can be deployed normally. Change-Id: Ief7df40c83ef03d0ec5ae92d09ceffd39d3c12a3	2021-12-08 14:26:51 +01:00
Steve Baker	d5eb6ee567	Refactor driver_internal_info updates to methods Making updates to driver_internal_info can result in hard to read code due the requirement to assign the whole driver_internal_info back to the node to trigger the expected update operation. This change replaces driver_internal_info update operations with a new methods: - set_driver_internal_info - del_driver_internal_info - timestamp_driver_internal_info This change defines the functions and moves core conductor logic to use them. Subsequent changes in this series will move drivers to use the new functions. Change-Id: Ib8917c3c674e77cd3aba6a1e73c65162e3ee1141	2021-12-03 14:49:33 +13:00
Dmitry Tantsur	dec673784b	Demote three warning messages These 3 messages do not convey a lot of useful information to the operators and definitely do not represent a potential issue that warrants a warning. Change-Id: I77f5802125f79c945eb05a278f7ce53696df830a	2021-10-06 10:53:41 +02:00
Julia Kreger	d17749249c	Record node history and manage events in db * Adds periodic task to purge node_history entries based upon provided configuration. * Adds recording of node history entries for errors in the core conductor code. * Also changes the rescue abort behavior to remove the notice from being recorded as an error, as this is a likely bug in behavior for any process or service evaluating the node last_error field. * Makes use of a semi-free form event_type field to help provide some additional context into what is going on and why. For example if deployments are repeatedly failing, then perhaps it is a configuration issue, as opposed to a general failure. If a conductor has no resources, then the failure, in theory would point back to the conductor itself. Story: 2002980 Task: 42960 Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166	2021-09-10 14:47:27 -07:00
Riccardo Pittau	7aec74d4e6	Increase version of hacking and pycodestyle Fix errors in unit tests Change-Id: Ib5a75fc5e5b6b5661d36e3a27796c3c02ed90056	2021-07-28 11:33:15 +02:00
Cenne	46ff51487a	Add `boot_mode` and `secure_boot` to node object and expose in api * add fields to Node object * expose them at endpoint `/v1/nodes/{node_ident}/states` * update states on powersync / entering managed state. * tests * update api endpoint info in api-ref Story: 2008567 Task: 41709 Change-Id: Iddd1421a6fa37d69da56658a2fefa5bc8cfd15e4	2021-07-08 15:04:15 +02:00
Dmitry Tantsur	172d1b22df	Delay rendering configdrive When the configdrive input is JSON (meta_data, etc), delay the rendering until the ISO image is actually used. It has two benefits: 1) Avoid storing a large ISO image in instance_info, 2) Allow deploy steps to access the original user's input. Fix configdrive masking to correctly mask dicts. Story: #2008875 Task: #42419 Change-Id: I86d30bbb505b8c794bfa6412606f4516f8885aa9	2021-05-19 15:17:49 +02:00
Zuul	af29f398cc	Merge "Don't mark an agent as alive if rebooted"	2021-02-08 09:24:47 +00:00
Derek Higgins	4287951d71	Don't mark an agent as alive if rebooted If 'agent_url' has been cleared from internal_info it indicates that the node has been powered off. Change-Id: Idba486c98e1e92d35fca2e2d156866566acb9e40 Story: 2008583 Task: 41736	2021-02-04 13:01:50 +00:00
Dmitry Tantsur	a5f7d75ba2	Apply force_persistent_boot_device to all boot interfaces For some (likely historical) reasons we only use it for PXE and iPXE, but the same logic applies to any boot interface (since it depends on how the management interface and the BMC work, not on the boot method). This change moves its handling to conductor utils. Change-Id: I948beb4053034d3c1b4c5b7c64100e41f6022739	2021-02-01 13:37:20 +01:00
Dmitry Tantsur	121b3348c8	Refactor vendor detection and add Redfish implementation Get rid of the TODO in the code and prepare for more management interfaces supporting detect_vendor(). Vendor detecting now runs during transition to manageable and on power state sync (essentially same as before but for all drivers not only IPMI). Update the IPMI implementation to no longer hide exceptions since they're not handled on the upper level. Simplify the regex and fix the docstring. Add the Redfish implementation as a foundation for future vendor-specific changes. Change-Id: Ie521cf2295613dde5842cbf9a053540a40be4b9c	2021-01-28 16:41:45 +01:00
Aija Jauntēva	870181f3ae	Update `cleaning_error_handler` Update `cleaning_error_handler` to match with `deploying_error_handler` that logs all errors and optionally separates between logged message and `last_error`. Logged message usually contains node's uuid as there is no context for node in stream of log entries. `last_error` usually does not contain node's uuid as it is already displayed in the context of node. Impact: * There were messages that were only added to node's last_error. Now they are going to be logged too. * No need to log explicitly before `cleaning_error_handler`. Such occurrences have been removed. * Where there were different message for log and last_error it is kept. Where there was only 1 message, it is left as it is to be both logged and updated in `last_error`. * Exception logging is replaced with error logging with traceback. Story: 2008307 Task: 41198 Change-Id: I813228fb47a51ee6c45b420322acabdf565ff752	2020-11-13 05:17:40 -05:00
Dmitry Tantsur	e39858dd8c	Wiping agent tokens on reboot via API - take 2 Because of using an incorrect variable, reboot was treated as power on, and the token was not wiped. Change-Id: I656450c2bedc3dc0d20a70de78cc29bf64d5fe85 Story: #2008097 Task: #40799	2020-10-05 17:36:45 +02:00
Julia Kreger	68cea19d99	Minor agent version code cleanup Internal to the conductor, we have long determiend that if no agent version is submitted by the client, that the client is a version 3.0.0 agent. With agent token, we're effectively requiring 6.1.0 or later to be leveraged, so all agents should be sending versions. An agent not sending versions is thus unsupported and would no longer work without agent token support. Redudnant code has thus been removed. Additionally the conductor utility is_agent_token_supported has been removed since it is now redundant. Change-Id: Id6c8d1df08c3ac7af61ed7d05d274f3099003582	2020-09-29 10:02:10 -07:00
Dmitry Tantsur	bc628ac6ef	Route conductor notification RPC to the same conductor RPC continue_node_{deploy,clean} are called from a conductor handling the node, so they don't have to go through the hash ring. This avoids situation when take over happens in the middle of a deploy/clean step processing, breaking it. Eventually, we should stop using RPC for that at all, but that will be a much more invasive change. Story: #2008200 Task: #40984 Change-Id: I76293f8ec30d5957b99bdbce5b70e87e8378d135	2020-09-25 17:34:58 +02:00
Zuul	35882c438d	Merge "Also wipe agent token on manual power off or reboot"	2020-09-17 08:10:38 +00:00
Dmitry Tantsur	bc04a42a96	Also wipe agent token on manual power off or reboot We have a check in the code that is never true for manual power actions because of what happens in the conductor manager. Remove it. Change-Id: I50b7b78a41188c41e4944894851f1d12684f824a	2020-09-14 16:09:54 +02:00
Dmitry Tantsur	2b676a6864	Accept and use a TLS certificate from the agent Accepts the certificate from a heartbeat and stores its path in driver_internal_info for further usage by the agent client (or any 3rd party deploy implementations). Similarly to agent_url, the certificate is protected from further changes (unless the local copy does not exist) and is removed on reboot or tear down (unless fast-tracking). Change-Id: I81b326116e62cd86ad22b533f55d061e5ed53e96 Story: #2007214 Task: #40603	2020-09-09 17:27:30 +02:00
Dmitry Tantsur	7dd611dc5e	Ensure in-band deploy steps are present in time for fast-track deployments Currently we load in-band deploy steps on the first heartbeat in DEPLOYWAIT. With fast-track, however, this heartbeat may happen way too late, because the node is up and heartbeating before it's moved to DEPLOYWAIT. This results in in-band deploy steps not loaded. This change forces a refresh of deploy steps in deploy.deploy if fast-track is used. Also make sure that cached steps are cleared on reboot or power off since they may become outdated next time the agent boots. Change-Id: I003543452717183b9b3359e368757bcd824a5ce7	2020-08-21 14:01:57 +02:00
Zuul	c9a0bce01b	Merge "Follow-up on blocking port deletions"	2020-07-18 04:22:06 +00:00
Julia Kreger	ba0dc574bc	Follow-up on blocking port deletions A recent comment on https://review.opendev.org/#/c/665835 pointed out that we should likely make some changes and a fix a missing check for the introspection_vif_port_id which was likely introduced after this functionality was originally written. Also adds some documentation on the subject since we lack docs even pointing out how to delete a port. :\ Change-Id: I0ba8a3741eefa80eb56e25a1b339f8433b3fc0dc	2020-07-16 12:47:07 -07:00
Dmitry Tantsur	2d4d375358	Wipe agent token during reboot or power off Otherwise a reboot during fast-track will leave the newly booted agent without an ability to request a token. Change-Id: I963276efae5599bfed6cbb4df18e8dd3bd1b9839	2020-07-15 10:54:31 +02:00
Riccardo Pittau	a58ca1f7a6	Enforce autospec in test_utils And remove the corresponding H210 filter. Change-Id: Iaec44cc66ba654e478381260044776fd838b7527	2020-06-24 09:06:52 +02:00
Zuul	a2ad31ddef	Merge "Fix agent token and URL handling during fast-track deployment"	2020-06-19 13:50:27 +00:00
Zuul	5822b2fb7d	Merge "Block port deletions where vif is present"	2020-06-18 11:50:29 +00:00
Dmitry Tantsur	5909163924	Fix agent token and URL handling during fast-track deployment We wipe these fields on some conditions, most notable - on starting the deployment. Make the removal of these fields to always go through the helpers in conductor/utils (and remove an unused one). Change-Id: Idb952588bb8a6d5131764f29c6225762ba5d55cc	2020-06-16 18:06:50 +02:00
Julia Kreger	64674bf0ed	Block port deletions where vif is present We must not allow users to be able to delete ports when a VIF record is present. If they are able to, then the user risks orphaning ports in neutron which could result in address depletion prevent the port from being able to be used again. Co-Authored-By: Madhuri Kumari <madhuri.kumari@intel.com> Story: 2006710 Task: 37074 Change-Id: I9f8eaf94a375984ea2f567f0a87c2ed1ed5cb3b7	2020-06-02 14:13:24 -07:00
Iury Gregory Melo Ferreira	d6e7552457	Switch to unittest mock Python3 have a standard library for mock in the unittest module, let's drop the mock requirement and switch tests to unittest mock. Change-Id: I4f1b3e25c8adbc24cdda51c73da3b66967f7ef23	2020-04-30 19:04:17 +02:00
Zuul	d927dd0d0a	Merge "Generalize clean step functions to support deploy steps"	2020-03-28 20:51:49 +00:00
Zuul	9cfdd6a0d9	Merge "Hash the rescue_password"	2020-03-26 17:14:13 +00:00
Dmitry Tantsur	16bce7f18d	Generalize clean step functions to support deploy steps This change updates various agent cleaning functions to also support deploy steps via a new step_type argument. It does not yet make it possible to run in-band deploy steps. The agent's get_clean_steps call has been modified to not fail if there are no cached steps. This is going to be normal for both cleaning and deploy in the future, and it is impossible to hit now because of the way cleaning is started. Change-Id: I789b130e7e490e23924338a024397973957272ac Story: #2006963	2020-03-25 13:04:11 +01:00
Julia Kreger	fcaefdbe74	Hash the rescue_password In order to provide increased security, it is necessary to hash the rescue password in advance of it being stored into the database and to provide some sort of control for hash strength. This change IS incompatible with prior IPA versions with regard to use of the rescue feature, but I fully expect we will backport the change to IPA on to stable branches and perform a release as it is a security improvement. Change-Id: I1e118467a536229de6f7c245c1c48f0af38dcef2 Story: 2006777 Task: 27301	2020-03-24 20:11:43 +00:00
Iury Gregory Melo Ferreira	1425adbb05	Do not use random to generate token To avoid problems with FIPS 140-2 let's generate the token using the secrets module instead of random. Change-Id: I90c3c94112d093e2309414b9902f58d31d925ad3 Story: 2007444 Task: 39104	2020-03-20 20:29:58 +01:00
Julia Kreger	bb3b2349f9	Pre-shared agent token In order to improve security of the lookup/heartbeat endpoints, we need to generate and provide temporary tokens to the initial callers, if supported, to facilitate the verification of commands. This is the first patch in an entire series which utimately enables the endpoint communication to be better secured. The idea behind this started in private story 2006634 which is locked as a security related filing covering multiple aspects of ironic/ironic-python-agent interaction centered around miss-use and generally exposed endpoints. That story will remain marked as a private bug because it has several different items covered, some of which did not prove to be actually exploitable, but spawned stories 2006777, 2006773, 2007025, and is ultimately similar to Story 1526748. Operationally this is a minimally invasive security enhancement to lay the foundation to harden interactions with the agent. This will take place over a series of patches to both Ironic and the Ironic-Python-Agent. Also see "Security of /heartbeat and /lookup endpoints" in http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010789.html Story: 2007025 Task: 37818 Change-Id: I0118007cac3d6548e9d41c5e615a819150b6ef1a	2020-02-20 14:33:32 -08:00
Dmitry Tantsur	d949318409	Split deployment-related functions from manager.py into a new module With more than 4000 lines of code (more than 8000 of unit tests) the current manager.py is barely manageable. In preparation for the new deployment API, this change moves the free-standing deploy-related functions to utils.py and the new deployments.py. Change-Id: Ic3f369a7fa72d09263b0670ae78980913b024962 Story: #2006910	2020-02-06 13:03:57 +01:00
Julia Kreger	931c125982	Block ability update callback_url A malicious user with: * API access normally reserved for the provisioning, cleaning, rescue networks. * Insight about a node, such as a MAC address, or baremetal node UUID. * Insight into the state of the node, such as the access provided to Compute API users, or other Bare Metal API users. Can submit an erroneous ``heartbeat`` to the ironic-api endpoint with a ``callback_url`` that is not of the actual intended agent. This can potentially cause a rescue, cleaning, or deployment operation to be derailed, or at worst commands to be sent to to an endpoint the malicious user controls. Story: 2006773 Task: 37295 Change-Id: I1a5e3c2b34d45c06fb74e82d0f30735ce9041914	2019-11-15 22:09:08 +00:00
Mark Goddard	3e51982252	Don't resume deployment or cleaning on heartbeat when polling Some drivers use a periodic task to poll for completion of a deploy or clean step. The iDRAC RAID driver is one example of this. In https://review.opendev.org/#/c/676152, the agent heartbeat handler was modified to resume deployment if not currently in the core deploy step. This makes sense for the ilo driver, which does not poll for completion of RAID configuration, but the iDRAC driver polls the lifecycle controller's job queue, and expects to be able to resume deployment once the job is complete. However, there is now a race between the agent heartbeat as the node boots up, and the job queue poller. This change adds new flags, cleaning_polling and deployment_polling, which can be used by a driver to signal that they are polling for completion of a deploy step, and that the agent heartbeat should not be used for this purpose. We also add here some more cleanup of the cleaning and deployment step metadata in driver_internal_info, since if these fields are left in place they may affect subsequent cleaning or deployment steps. Change-Id: I34591440ab993a80a0cc88be6e10e33f1ae4a660 Story: 2003817 Task: 36563	2019-09-21 12:52:28 +01:00

1 2 3

109 Commits