ironic

Commit Graph

Author	SHA1	Message	Date
Zuul	ddca532f52	Merge "Fix the confusion around service_reboot/servicing_reboot"	2024-04-15 23:38:00 +00:00
Dmitry Tantsur	004e78c413	Fix the confusion around service_reboot/servicing_reboot We ended up using two names for the same flag (and forgot it in one place completely). To not just fix the issue but also prevent it in the future, refactor asynchronous steps handling into a new helper module with constants and helper functions. I've settled on servicing_reboot as opposed to service_reboot because that's the value we currently set (but not read), so it provides better compatibility when backporting. Remove excessive mocking in the Redfish unit tests. Change-Id: I32b5f860b5d10864ce68f8d5f1dac3f76cd158d6	2024-04-12 18:09:54 +02:00
Dmitry Tantsur	6c8673c1b4	Fix servicing clean-up Serious issues: - Nothing powers on nodes after servicing, so they end up active and powered off in the end. - Restoring power state was done three times. Minor issues: - Function _tear_down_node_servicing is called twice causing a traceback. - Furthermore, process_event('done') is also called in another place in deploy utils. - Make sure nodes are never considered for fast-track when servicing, it prevents clean-up of virtual media devices. Change-Id: I92fd7a0009a816e93e316e4674c7509b61a474d4	2024-04-12 10:48:57 +02:00
Zuul	df9e1ba80e	Merge "[codespell] Fixing Spelling Mistakes"	2024-03-14 17:13:05 +00:00
Sharpz7	949387bd80	[codespell] Fixing Spelling Mistakes This is the first in a series of commits to add support for codespell. This is continuning the process completed in ironic-python-agent. Future Commits will add a Tox Target, CI support and potentially a git-blame-ignore-revs file if their are lots of spelling mistakes that could clutter git blame. Change-Id: Id328ff64c352e85b58181e9d9e35973a8706ab7a	2024-02-12 19:58:56 +00:00
Sharpz7	41ee6aa2ff	Ensure all errors are passed during cleaning Related Bug: https://bugs.launchpad.net/ironic/+bug/1628422 This change makes sure that the caught error is passed through to node_history_record() Change-Id: I9b78ec37f37024d04928403bbf0b85ed96906441	2024-02-10 00:25:11 +00:00
Dmitry Tantsur	607b8734e4	Cache firwmare components on the transition to "manageable" Automated cleaning is not guaranteed to be enabled, and in any case it's too late to cache the components at that point: firwmare upgrades may happen before the transition to "available". Change-Id: I6b74970fffcc150c167830bef195f284a8c6f197	2023-12-14 09:51:47 +01:00
Dmitry Tantsur	23745d97fe	Fix two severe errors in the firmware caching code First, it tries to create components even if the current version is not known and fails with a database constraint error (because the initial version cannot be NULL). Can be reproduced with sushy-tools before `37f118237a` Second, unexpected exceptions are not handled in the caching code, so any of them will cause the node to get stuck in cleaning forever. On top of that, the caching code is missing a metrics decorator. This change does not update any unit tests because none currently exist. Change-Id: Iaa242ca6aa6138fcdaaf63b763708e2f1e559cb0	2023-12-08 18:11:03 +01:00
Dmitry Tantsur	0902912217	Generic API for attaching/detaching virtual media This patch allows to attach or detach a generic image as virtual media device after a node has been provisioned. Closes-Bug: #2033288 Change-Id: I97b68047d769f6fb686c53e89084b5874e02b8c7	2023-11-23 09:55:09 +01:00
Iury Gregory Melo Ferreira	4eb0dbf7b5	RedfishFirmware Interface Change-Id: I75b2433fade0c36522024c16608d61cd663b38d5	2023-09-20 13:09:38 -03:00
Julia Kreger	2366a4b86e	Adds service steps A huge list of initial work for service steps * Adds service_step verb * Adds service_step db/object/API field on the node object for the status. * Increments the API version to 1.87 for both changes. * Increments the RPC API version to 1.57. * Adds initial testing to facilitate ensurance that supplied steps are passed through and executed upon. Does not: * Have tests for starting the agent ramdisk, although this is relatively boiler plate. * Have a collection of pre-decorated steps available for immediate consumption. Change-Id: I5b9dd928f24dff7877a4ab8dc7b743058cace994	2023-08-16 06:34:08 -07:00
Dmitry Tantsur	0370f5ac97	Migrate the inspector's /continue API This change creates all necessary parts to processing inspection data: * New API /v1/continue_inspection Depending on the API version, either behaves like the inspector's API or (new version) adds the lookup functionality on top. The lookup process is migrated from ironic-inspector with minor changes. It takes MAC addresses, BMC addresses and (optionally) a node UUID and tries to find a single node in INSPECTWAIT state that satisfies all of these. Any failure results in HTTP 404. To make lookup faster, the resolved BMC addresses are cached in advance. * New RPC continue_inspection Essentially, checks the provision state again and delegates to the inspect interface. * New inspect interface call continue_inspection The base version does nothing. Since we don't yet have in-band inspection in Ironic proper, the only actual implementation is added to the existing "inspector" interface that works by doing a call to ironic-inspector. Story: #2010275 Task: #46208 Change-Id: Ia3f5bb9d1845d6b8fab30232a72b5a360a5a56d2	2023-06-07 10:57:08 +02:00
Julia Kreger	75b881bd31	Fix DB/Lock session handling issues Prior to this fix, we have been unable to run the Metal3 CI job with SQLAlchemy's internal autocommit setting enabled. However that setting is deprecated and needs to be removed. Investigating our DB queries and request patterns, we were able to identify some queries which generally resulted in the underlying task and lock being held longer because the output was not actually returned, which is something we've generally had to fix in some places previously. Doing some of these changes did drastically reduce the number of errors encountered with the Metal3 CI job, however it did not eliminate them entirely. Further investigation, we were able to determine that the underlying issue we were encountering was when we had an external semi-random reader, such as Metal3 polling endpoints, we could reach a situation where we would be blocked from updating the database as to open a write lock, we need the active readers not to be interacting with the database, and with a random reader of sorts, the only realistic option we have is to enable the Write Ahead Log[0]. We didn't have to do this with SQLAlchemy previously because autocommit behavior hid the complexities from us, but in order to move to SQLAlchemy 2.0, we do need to remove autocommit. Additionally, adds two unit tests for get_node_with_token rpc method, which apparently we missed or lost somewhere along the way. Also, adds notes to two Database interactions to suggest we look at them in the future as they may not be the most efficient path forward. [0]: https://www.sqlite.org/wal.html Change-Id: Iebcc15fe202910b942b58fc004d077740ec61912	2023-05-01 15:35:33 -07:00
Chris Krelle	510a612eed	Add ablity to power off nodes in clean failed We have seen duplicate ip issues when leaving clean failed nodes powered on. This patch allows operators to power down nodes that enter clean failed state. Change-Id: Iecb402227485fe0ba787a262121c9d6a048b0e13	2023-04-24 16:20:54 -07:00
Zuul	821ce8c319	Merge "Wipe Agent Token when cleaning timeout occcurs"	2023-03-14 19:27:16 +00:00
Zuul	718d52c792	Merge "Clean out agent token even if power is already off"	2023-03-13 23:00:46 +00:00
Julia Kreger	bcf6c12269	Clean out agent token even if power is already off While investigating a very curious report, I discovered that if somehow the power was already turned off to a node, say through an incorrect BMC or human action, and Ironic were to pick it up (as it does by default, because it checks before applying the power state, then it would not wipe the token information, preventing the agent from connecting on the next action/attempt/operation. We now remove the token on all calls to conductor utilities node_power_action method when appropriate, even if no other work is required. Change-Id: Ie89e8be9ad2887467f277772445d4bef79fa5ea1	2023-03-02 15:02:23 +00:00
Julia Kreger	47b5909486	Wipe Agent Token when cleaning timeout occcurs In a relatively odd turn of events, should cleaning have started, but then timed out due to lost communications or a hard failure of the machine, an agent token could previously be orphaned preventing re-cleaning. We now explicitly remove the token in this case. Change-Id: I236cdf6ddb040284e9fd1fa10136ad17ef665638	2023-03-02 06:33:18 -08:00
Dmitry Tantsur	9a0fa631ca	Do not move nodes to CLEAN FAILED with empty last_error When cleaning fails, we power off the node, unless it has been running a clean step already. This happens when aborting cleaning or on a boot failure. This change makes sure that the power action does not wipe the last_error field, resulting in a node with provision_state=CLEANFAIL and last_error=None for several seconds. I've hit this in Metal3. Also when aborting cleaning, make sure last_error is set during the transition to CLEANFAIL, not when the clean up thread starts running. While here, make sure to log the current step in all cases, not only when aborting a non-abortable step. Change-Id: Id21dd7eb44dad149661ebe2d75a9b030aa70526f Story: #2010603 Task: #47476	2023-03-01 11:16:46 +01:00
Zuul	5d2283137c	Merge "Make anaconda non-image deploys sane"	2022-07-14 01:28:00 +00:00
Julia Kreger	e78f123ff8	Make anaconda non-image deploys sane Ironic has a lot of logic built up around use of images for filesystems, however several recent additions, such as the ``ramdisk`` and ``anaconda`` deployment interfaces have started to break this mold. In working with some operators attempting to utilzie the anaconda deployment interface outside the context of full OpenStack, we discovered some issues which needed to be make simpler to help remove the need to route around data validation checks for things that are not required. Standalong users also have the ability to point to a URL with anaconda, where as Operators using OpenStack can only do so with customized kickstart files. While this is okay, the disparity in configuraiton checking was also creating additional issues. In this, we discovered we were not really graceful with redirects, so we're now a little more graceful with them. Story: 2009939 Story: 2009940 Task: 44834 Task: 44833 Change-Id: I8b0a50751014c6093faa26094d9f99e173dcdd38	2022-07-11 07:41:06 -07:00
Dmitry Tantsur	e09919caba	Move logging out of skip_automated_cleaning Simply boolean functions should not have logging as a side effect. This one is also used in deploy_utils without logging. Change-Id: Iaa398f09cec06a8417c595acac19b0b9f3f3a871	2022-07-06 17:00:11 +02:00
Zuul	a4bf31de61	Merge "Auto-populate lessee for deployments"	2022-07-02 02:56:54 +00:00
Dmitry Tantsur	65583e6417	No deploy_kernel/ramdisk with the ramdisk deploy and no cleaning Ramdisk deploys don't use IPA, no need to provide it. Cleaning may need the agent, so only skip verification if cleaning is disabled. Other boot interfaces may need fixing as well, I haven't checked them. Change-Id: Ia2739311f065e19ba539fe3df7268075d6075787	2022-06-23 19:49:16 +02:00
Julia Kreger	c3f397149a	Auto-populate lessee for deployments Adds a configuration option and capability to automatically record the lessee for a deployment based upon the original auth_token information provided in the request context. Additional token information is now shared through the context which is extended in the same fashion as most other projects saving request token information to their RequestContext, instead of triggering excess API calls in the background to Keystone to try and figure out requestor's information. Change-Id: I42a2ceb9d2e7dfdc575eb37ed773a1bc682cec23	2022-05-23 16:21:19 -07:00
Harald Jensås	4cf0147e86	Exclude current conductor from offline_conductors In some cases the current conductor may have failed to updated the heartbeat timestamp due to failure of resource starvation. When this occurs the dbapi get_offline_conductors method will include the current conductor in its return value. In this scenario the conductor may end up forcefully remove node reservations or allocations from itself, triggering takeover which fail on-going operations. This change adds a wrapper to exclude the current conductor. The wrapper will log a warning to raise the issue. Related-Bug: #1970484 Stroy: 2010016 Task: 45204 Change-Id: I6a8f38934b475f792433be6f0882540b82ca26c1	2022-04-28 10:28:26 +02:00
Dmitry Tantsur	daa7dba331	Shorten error messages in commonly used modules * Do not mention "deploy driver", it's not a thing. * Be careful with the pattern "Error: %s" or "Reason: %s". It is good for long introductory sentences, but looks poor for shorter ones and becomes really problematic when several instances are concatenated. This change updates deploy_utils, agent code and conductor modules. Change-Id: Ie1efea02b5f1a174e9ef8c5253ce9754a60b4c56	2022-02-17 19:16:52 +01:00
Dmitry Tantsur	a813c769e8	Explicit parameter to distinguish partition/whole-disk images Using kernel/ramdisk makes no sense with local boot, we need a better way. We already have an internal image_type instance parameter, let's make it public. Glance support will be added in the next patch. Change-Id: I4ce5f7a2317d952f976194d2022328f4afbb0258	2022-01-28 19:13:13 +01:00
Zuul	19cafb55e1	Merge "Allow enabling fast-track per node"	2021-12-15 16:39:28 +00:00
Dmitry Tantsur	2a6cdf4b24	Allow enabling fast-track per node This is useful when some nodes need the "agent" power interface, while the others can be deployed normally. Change-Id: Ief7df40c83ef03d0ec5ae92d09ceffd39d3c12a3	2021-12-08 14:26:51 +01:00
Steve Baker	d5eb6ee567	Refactor driver_internal_info updates to methods Making updates to driver_internal_info can result in hard to read code due the requirement to assign the whole driver_internal_info back to the node to trigger the expected update operation. This change replaces driver_internal_info update operations with a new methods: - set_driver_internal_info - del_driver_internal_info - timestamp_driver_internal_info This change defines the functions and moves core conductor logic to use them. Subsequent changes in this series will move drivers to use the new functions. Change-Id: Ib8917c3c674e77cd3aba6a1e73c65162e3ee1141	2021-12-03 14:49:33 +13:00
Zuul	ef5c1a3a44	Merge "Demote three warning messages"	2021-10-08 11:10:40 +00:00
Dmitry Tantsur	dec673784b	Demote three warning messages These 3 messages do not convey a lot of useful information to the operators and definitely do not represent a potential issue that warrants a warning. Change-Id: I77f5802125f79c945eb05a278f7ce53696df830a	2021-10-06 10:53:41 +02:00
Jacob Anders	b385d9ae5b	Add support for verify steps This change adds support for verify steps in Ironic. Verify steps allow executing actions on transition from "verifying" to "managable" state and can perform actions such as cleaning BMC job queue or resetting the BMC on supported platforms. Verify steps are similar to deploy and clean steps, just simpler. Story: 2009025 Task: 42751 Change-Id: Iee27199a0315b8609e629bac272998c28274802b	2021-09-30 20:46:17 +10:00
Julia Kreger	d17749249c	Record node history and manage events in db * Adds periodic task to purge node_history entries based upon provided configuration. * Adds recording of node history entries for errors in the core conductor code. * Also changes the rescue abort behavior to remove the notice from being recorded as an error, as this is a likely bug in behavior for any process or service evaluating the node last_error field. * Makes use of a semi-free form event_type field to help provide some additional context into what is going on and why. For example if deployments are repeatedly failing, then perhaps it is a configuration issue, as opposed to a general failure. If a conductor has no resources, then the failure, in theory would point back to the conductor itself. Story: 2002980 Task: 42960 Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166	2021-09-10 14:47:27 -07:00
Cenne	bc95c92f7c	Add api endpoints for changing boot_mode and secure_boot state Done: - Node API endpoints expose - RPC methods - Conductor Manager methods - Conductor utils new methods - RBAC new policies - Node API tests - Manager Tests (+ some testing for utils methods) - RBAC tests - Docs (api-ref) - REST API version history - Releasenotes Story: 2008567 Task: 41709 Change-Id: I2d72389edf546b99c536c6b130ca85ababf80591	2021-08-23 19:38:58 +02:00
Cenne	b03ff30f93	Fixes missing argument for log format string Story: 2008567 Change-Id: Id5bcfad5cd4514dd710232d75fbd729856f16b17	2021-07-27 11:49:19 +02:00
Cenne	46ff51487a	Add `boot_mode` and `secure_boot` to node object and expose in api * add fields to Node object * expose them at endpoint `/v1/nodes/{node_ident}/states` * update states on powersync / entering managed state. * tests * update api endpoint info in api-ref Story: 2008567 Task: 41709 Change-Id: Iddd1421a6fa37d69da56658a2fefa5bc8cfd15e4	2021-07-08 15:04:15 +02:00
Bob Fournier	e15440370c	Include bios registry fields in bios API Provide the fields in the BIOS setting API - ``/v1/nodes/{node}/bios/{setting}``, and in the BIOS setting list API when details are requested - ``/v1/nodes/<node>/bios?detail=True``. Story: #2008571 Task: #42483 Change-Id: Ie86ec57e428e2bb2efd099a839105e51a94824ab	2021-05-27 12:15:20 -04:00
Dmitry Tantsur	172d1b22df	Delay rendering configdrive When the configdrive input is JSON (meta_data, etc), delay the rendering until the ISO image is actually used. It has two benefits: 1) Avoid storing a large ISO image in instance_info, 2) Allow deploy steps to access the original user's input. Fix configdrive masking to correctly mask dicts. Story: #2008875 Task: #42419 Change-Id: I86d30bbb505b8c794bfa6412606f4516f8885aa9	2021-05-19 15:17:49 +02:00
Dmitry Tantsur	c6e8281f85	Wipe agent tokens on inspection start and abort Also make sure the pregenerated flag is always reset. Change-Id: I73aaa803d3eb84ddac59a778e998836a645217eb	2021-04-08 13:42:25 +02:00
Dmitry Tantsur	30a85bd0ce	API to force manual cleaning without booting IPA Adds a new argument disable_ramdisk to the manual cleaning API. Only steps that are marked with requires_ramdisk=False can be run in this mode. Cleaning prepare/tear down is not done. Some steps (like redfish BIOS) currently require IPA to detect a successful reboot. They are not marked with requires_ramdisk just yet. Change-Id: Icacac871603bd48536188813647bc669c574de2a Story: #2008491 Task: #41540	2021-03-16 16:08:46 +01:00
Riccardo Pittau	d5b5356d60	[trivial] fix typos in conductor Change-Id: Ib431c3507cb4bdbd9ba30b58e30b078e855e7754	2021-02-23 17:54:53 +01:00
Zuul	766d8f11b4	Merge "Add 'deploy steps' parameter for provisioning API"	2021-02-12 16:01:33 +00:00
Zuul	af29f398cc	Merge "Don't mark an agent as alive if rebooted"	2021-02-08 09:24:47 +00:00
Derek Higgins	4287951d71	Don't mark an agent as alive if rebooted If 'agent_url' has been cleared from internal_info it indicates that the node has been powered off. Change-Id: Idba486c98e1e92d35fca2e2d156866566acb9e40 Story: 2008583 Task: 41736	2021-02-04 13:01:50 +00:00
Aija Jauntēva	3138acc836	Add 'deploy steps' parameter for provisioning API Story: 2008043 Task: 40705 Change-Id: I3dc2d42b3edd2a9530595e752895e9d113f76ea8	2021-02-03 11:47:53 -05:00
Dmitry Tantsur	b8a2dcaf86	Trivial: log the newly detected vendor Change-Id: Ib751316a98d7a1c4469b405117c8e1fd1f296757	2021-02-03 17:31:54 +01:00
Dmitry Tantsur	a5f7d75ba2	Apply force_persistent_boot_device to all boot interfaces For some (likely historical) reasons we only use it for PXE and iPXE, but the same logic applies to any boot interface (since it depends on how the management interface and the BMC work, not on the boot method). This change moves its handling to conductor utils. Change-Id: I948beb4053034d3c1b4c5b7c64100e41f6022739	2021-02-01 13:37:20 +01:00
Dmitry Tantsur	121b3348c8	Refactor vendor detection and add Redfish implementation Get rid of the TODO in the code and prepare for more management interfaces supporting detect_vendor(). Vendor detecting now runs during transition to manageable and on power state sync (essentially same as before but for all drivers not only IPMI). Update the IPMI implementation to no longer hide exceptions since they're not handled on the upper level. Simplify the regex and fix the docstring. Add the Redfish implementation as a foundation for future vendor-specific changes. Change-Id: Ie521cf2295613dde5842cbf9a053540a40be4b9c	2021-01-28 16:41:45 +01:00

1 2 3 4

194 Commits