ironic

Commit Graph

Author	SHA1	Message	Date
Dmitry Tantsur	adec0f6f01	Add a reserved workers pool (5% by default) I've seen a situation where heartbeats managed to completely saturate the conductor workers, so that no API requests could come through that required interaction with the conductor (i.e. everything other than reads). Add periodic tasks for a large (thousands) number of nodes, and you get a completely locked up Ironic. This change reserves 5% (configurable) of the threads for API requests. This is done by splitting one executor into two, of which the latter is only used by normal _spawn_worker calls and only when the former is exhausted. This allows an operator to apply a remediation, e.g. abort some deployments or outright power off some nodes. Partial-Bug: #2038438 Change-Id: Iacc62d33ffccfc11694167ee2a7bc6aad82c1f2f	2023-12-07 13:47:39 +01:00
Dmitry Tantsur	203660a0be	Fix _by_arch documentation and un-deprecate the options without it First, the _by_arch options are not a replacement for plain options: the cpu_arch property is neither required not standardized. This is why older options with *_by_arch equivalents are not deprecated. Second, the example in the documentation is wrong: oslo.config does not use Python dictionaries. Which makes me suspect that the feature has never been properly tested (indeed, it's not used in the devstack CI, and Bifrost uses the older options). Change-Id: If1e633930909ce9d80e14f3ec3daa0bf8d48b7f0	2023-11-27 18:12:58 +01:00
Dmitry Tantsur	224cdd726c	Bump workers_pool_size to 300 and remove queueing of tasks Especially in a single-conductor environment, the number of threads should be larger than max_concurrent_deploy, otherwise the latter cannot be reached in practice or will cause issues with heartbeats. On the other hand, this change fixes an issue with how we use futurist. Due to a misunderstanding, we ended up setting the workers pool size to 100 and then also allowing 100 more requests to be queued. To be it shortly, this change moves from 100 threads + 100 queued to 300 threads and no queue. Partial-Bug: #2038438 Change-Id: I1aeeda89a8925fbbc2dae752742f0be4bc23bee0	2023-10-05 08:51:51 +02:00
Zuul	8be7efdeab	Merge "Introduce default kernel/ramdisks by arch"	2023-08-29 04:11:32 +00:00
Bifrost	3c5e05a8a4	Introduce default kernel/ramdisks by arch Introduce config to allow setting default ramdisks per-architecture. The hierarchy of the parameters is: Node config > config by architecture > general config Change-Id: I95dfece3e8f7bcd3121ac808985cb61997877a51	2023-08-28 17:25:37 +01:00
Julia Kreger	2366a4b86e	Adds service steps A huge list of initial work for service steps * Adds service_step verb * Adds service_step db/object/API field on the node object for the status. * Increments the API version to 1.87 for both changes. * Increments the RPC API version to 1.57. * Adds initial testing to facilitate ensurance that supplied steps are passed through and executed upon. Does not: * Have tests for starting the agent ramdisk, although this is relatively boiler plate. * Have a collection of pre-decorated steps available for immediate consumption. Change-Id: I5b9dd928f24dff7877a4ab8dc7b743058cace994	2023-08-16 06:34:08 -07:00
Julia Kreger	8fc8372e74	Add wait step Adds a wait step to allow for finer grained workflows and forcing interruptions which may be needed in some cases with specialized hardware. Change-Id: Idc338b761ebe35a4635022a324ca5acbf29fc462	2023-07-24 22:42:20 +00:00
Julia Kreger	013ac0cb41	execute on child node support Allows steps to be executed on child nodes, and adds the reserved power_on, power_off, and reboot step names. Change-Id: I4673214d2ed066aa8b95a35513b144668ade3e2b	2023-05-24 15:42:46 -07:00
Chris Krelle	510a612eed	Add ablity to power off nodes in clean failed We have seen duplicate ip issues when leaving clean failed nodes powered on. This patch allows operators to power down nodes that enter clean failed state. Change-Id: Iecb402227485fe0ba787a262121c9d6a048b0e13	2023-04-24 16:20:54 -07:00
Julia Kreger	82b8ec7a39	Get conductor metric data This change adds the capability for the ironic-conductor and standalone service process to transmit timer and counter metrics to the message bus notifier which may be consumed by a ceilometer, ironic-prometheus-exporter, or other consumer of metrics event data on to the message bus. This functionality is not presently supported on dedicated API services such as those running as an ``ironic-api`` application process, or Ironic WSGI application. This is due to the lack of an internal trigger mechanism to transmit the data in a metrics update to the message bus and/or notifier plugin. This change requires ironic-lib 5.4.0 to collect and ship metrics via the message bus. Depends-On: https://review.opendev.org/c/openstack/ironic-lib/+/865311 Change-Id: If6941f970241a22d96e06d88365f76edc4683364	2023-02-23 11:39:07 -08:00
Julia Kreger	9a8b1d149c	Concurrent Distructive/Intensive ops limits Provide the ability to limit resource intensive or potentially wide scale operations which could be a symptom of a highly distructive and unplanned operation in progress. The idea behind this change is to help guard the overall deployment to prevent an overall resource exhaustion situation, or prevent an attacker with valid credentials from putting an entire deployment into a potentially disasterous cleaning situation since ironic only other wise limits concurrency based upon running tasks by conductor. Story: 2010007 Task: 45140 Change-Id: I642452cd480e7674ff720b65ca32bce59a4a834a	2022-09-20 06:47:38 -07:00
Julia Kreger	c3f397149a	Auto-populate lessee for deployments Adds a configuration option and capability to automatically record the lessee for a deployment based upon the original auth_token information provided in the request context. Additional token information is now shared through the context which is extended in the same fashion as most other projects saving request token information to their RequestContext, instead of triggering excess API calls in the background to Keystone to try and figure out requestor's information. Change-Id: I42a2ceb9d2e7dfdc575eb37ed773a1bc682cec23	2022-05-23 16:21:19 -07:00
Jacob Anders	c7a6c69f12	Follow up to Add support for verify steps This is a follow up to commit `b385d9ae5b` shortening log messages, removing unnecessary validations and fixing a typo. Change-Id: Iedb32b5e571c554e19c78c8b7ef9be05d1909242	2021-10-07 20:48:45 +10:00
Zuul	63e4174146	Merge "Add support for verify steps"	2021-10-06 11:42:53 +00:00
Jacob Anders	b385d9ae5b	Add support for verify steps This change adds support for verify steps in Ironic. Verify steps allow executing actions on transition from "verifying" to "managable" state and can perform actions such as cleaning BMC job queue or resetting the BMC on supported platforms. Verify steps are similar to deploy and clean steps, just simpler. Story: 2009025 Task: 42751 Change-Id: Iee27199a0315b8609e629bac272998c28274802b	2021-09-30 20:46:17 +10:00
Dmitry Tantsur	db4b4c08d9	Clean up caches periodically Currently they're only cleaned up on demand, which can lead to unnecessary disk usage on deployments that are not actively used. Story: #2008909 Task: #42500 Change-Id: Id5b58d1d1b2bbd2988db7a08d4ccfe2166033147	2021-09-22 15:19:24 +02:00
Julia Kreger	d17749249c	Record node history and manage events in db * Adds periodic task to purge node_history entries based upon provided configuration. * Adds recording of node history entries for errors in the core conductor code. * Also changes the rescue abort behavior to remove the notice from being recorded as an error, as this is a likely bug in behavior for any process or service evaluating the node last_error field. * Makes use of a semi-free form event_type field to help provide some additional context into what is going on and why. For example if deployments are repeatedly failing, then perhaps it is a configuration issue, as opposed to a general failure. If a conductor has no resources, then the failure, in theory would point back to the conductor itself. Story: 2002980 Task: 42960 Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166	2021-09-10 14:47:27 -07:00
Jacob Anders	1523ae1ce4	Generic way to configure clean step priorites This change adds a generic method of configuring clean step priorities instead of making changes in Ironic code every time a new clean step is introduced. Change-Id: I56b9a878724d27af2ac05232a1680017de4d8df5 Story: 1618014	2021-03-31 14:11:49 +10:00
Zuul	436bf9cc7b	Merge "If the "[conductor]XXX_timeout" is less than 0，disable periodic task"	2020-06-05 12:05:23 +00:00
Zuul	4688d62fd4	Merge "Mark more configuration options as reloadable"	2020-05-19 16:56:52 +00:00
e	f464e78efe	If the "[conductor]XXX_timeout" is less than 0，disable periodic task "[conductor]clean_callback_timeout"，"[conductor]inspect_wait_timeout" and "[conductor]inspect_wait_timeout" are a negative value will cause an error on start up from now on. Change-Id: Id3bef9a753be7f0c468ea3033698f0e9cd276a64 Story: 2007600 Task: 39576	2020-05-11 01:08:07 +00:00
Dmitry Tantsur	50c81cdbc9	Mark more configuration options as reloadable I have only checked the main configuration options and two generic drivers, leaving everything else untouched. Added are options that is possible and makes sense to reload. Change-Id: I74c629bcaf50da7f829f0ec8c526d936b9d40b36	2020-05-06 16:22:51 +02:00
Kaifeng Wang	3e6dfdb3b9	Remove [conductor]api_url It was deprecated long before. Change-Id: I05d8a90dbf6e92ef230b1a9624c6816fc96c6a7f	2020-05-01 07:39:21 +00:00
Julia Kreger	fcaefdbe74	Hash the rescue_password In order to provide increased security, it is necessary to hash the rescue password in advance of it being stored into the database and to provide some sort of control for hash strength. This change IS incompatible with prior IPA versions with regard to use of the rescue feature, but I fully expect we will backport the change to IPA on to stable branches and perform a release as it is a security improvement. Change-Id: I1e118467a536229de6f7c245c1c48f0af38dcef2 Story: 2006777 Task: 27301	2020-03-24 20:11:43 +00:00
Julia Kreger	4383303dfb	Extend power sync timeout for Ericsson SDI Ericsson SDI uses cached power state data in what is returned to the API user, and typically this takes longer than 30 seconds to provide an updated value to the API user. As such, if we do not extend the timeout value, such users of Ironic's Redfish API interface will have deployments fail as the power state timeout will be encountered. Change-Id: I0aa5131504b60b13d43c73c9a3be1f50f7855cbc	2020-03-11 17:19:49 -07:00
Dmitry Tantsur	fcb793682d	Add an option to abort cleaning and deployment if node is in maintenance We don't prevent cleaning to happen for nodes in maintenance mode. However, cleaning cannot succeed in this case, as we disable processing heartbeats. This change adds a new configuration option that will cause such node to enter CLEAN FAIL on the first heartbeat. The same is done for deployment and automated cleaning during providing. Finally, elevate the log level for such heartbeats from debug to warning, as it may be a sign of a problem (especially if the new option is off). Change-Id: I9f3ee44f39c448eb2609c5989acd36e7da844ef4 Story: #1563644 Task: #9171	2019-09-17 13:04:40 +02:00
Ilya Etingof	9fab96fc37	Add Redfish Virtual Media Boot support This patch introduces standard Redfish virtual media boot support to ironic. The patch implements basic boot interface features along with devstack plugin support for virtual media boot. Functionally, redfish boot interface supports the same set of features as PXE. Unlike other virtual media boot implementations (e.g. iLo), this patch does not require user-built deploy/rescue/boot ISO images for virtual media boot. Instead, ironic will build necessary images out of common kernel/ramdisk pair (though user needs to provide ESP image). Story: 1526753 Task: 10389 Co-Authored-By: Shivanand Tendulker <stendulker@gmail.com> Change-Id: I0db0a64c5ccf260f5a0695dbe994af1e11f71517	2019-08-14 14:19:03 +02:00
Dmitry Tantsur	f06240f7dd	Allow configuring global deploy and rescue kernel/ramdisk The devstack plugin was updated to configure basic ops before ironic starts, so that we can put links to deploy images in the ironic.conf. Change-Id: I305fc3712b1ac0cf2fe64569729e236c7b614bb4 Story: #2006175 Task: #35699	2019-08-06 15:31:19 +02:00
Dmitry Tantsur	db7d9bb1f0	Trivial: correct configuration option copy-pased from inspector Change-Id: Iec065e54f0ca50515180fb5f2051380bde329ab0	2019-05-29 08:36:24 +02:00
Dmitry Tantsur	c36a01a439	Publish baremetal endpoint via mdns This change adds an option to publish the endpoint via mDNS on start up and clean it up on tear down. Story: #2005393 Task: #30383 Change-Id: I55d2e7718a23cde111eaac4e431588184cb16bda	2019-05-23 17:11:50 +02:00
Zuul	a03e55f2d4	Merge "Do not try to create temporary URLs with zero lifetime"	2019-04-17 15:38:29 +00:00
Dmitry Tantsur	2039138cfe	Do not try to create temporary URLs with zero lifetime A new option is introduced for that, which defaults to the deploy timeout if it is set and to 1800 seconds if it is not. Change-Id: I10e02919e40d25bd4411f2b6f98f9317d1cfb187 Story: #1653112 Task: #9707	2019-04-15 14:27:36 +02:00
Julia Kreger	68ba345520	Make it possible to send sensor data for all nodes Presently the data collection defaults to only permit sensor data to be collected and transmitted as notifications for instances deployed via nova, however standalone operators or general data center operators may find the sensor data useful to identify undeployed failing hardware and overall check the hardware health. Adds a boolean to control the filter being set for a deployed node. Change-Id: I345f6e3a9f47d8d09ea488d64927fd0c5fb7dfc7	2019-04-02 16:00:22 -07:00
Kaifeng Wang	7e9ff1d6bf	Follow up to available node protection Turns [deploy]allow_deleting_available_nodes to a mutable option, so that it can be changed without service restart. Change-Id: Ia6d51994441ec7367bc2eba76c47d5f3c425a837 Story: 2005060 Task: 29604	2019-03-13 09:44:21 +08:00
Zuul	5d25189b13	Merge "Add option to protect available nodes from accidental deletion"	2019-03-05 14:21:08 +00:00
Arne Wiebalck	885ddb4362	Add option to protect available nodes from accidental deletion Ironic allows to delete nodes which are in state 'available'. As bringing nodes into 'available' comes at an operational cost (i.e. enroll, inspect, clean, ...), this patch proposes a new option 'allow_deleting_available_nodes' to support the protection of available nodes against accidental removal. Change-Id: I08d31b5ddbad626811c971389e634a450aeaf066 Story: #2005060 Task: #29604	2019-02-27 20:50:17 +00:00
Dmitry Tantsur	6885c674cb	Allocation API: taking over allocations of offline conductors This change allows conductors to periodically check and take over allocations that were processed by conductors that went offline. Change-Id: Ia7b9b5bc485a66215def4a76c6682c47342b86d9 Story: #2004341 Task: #28474	2019-02-19 17:36:16 +01:00
Zuul	0516356ddb	Merge "Parallelize periodic power sync calls follow up"	2019-02-02 12:08:28 +00:00
Zuul	c9aa36b78c	Merge "Parallelize periodic power sync calls"	2019-02-02 11:57:17 +00:00
Ilya Etingof	821d5fef73	Parallelize periodic power sync calls follow up Fixes quite a few assorted nits. Follows up Change-Id I80297c877d9a87d3bd8fc30d0ed65cd443f200b3 Change-Id: I92faffb4fac349a77e3597f3668c75bbc2c2397d	2019-01-23 16:06:30 +01:00
Ilya Etingof	7448603ab8	Parallelize periodic power sync calls Node power sync is performed from a periodic task. In that task all nodes are iterated over and power sync call is performed. While the power sync call itself if non-blocking relative to other concurrent I/O tasks, iteration over the nodes seems sequential meaning that nodes power sync is performed one node at a time. If the above observation holds, large-scale settings may never be able to power sync all their nodes properly, throttling at walking all active the nodes in 60 second period. This patch distributes power sync calls over a bunch of green threads each working on a portion of the nodes to be taken care of. Change-Id: I80297c877d9a87d3bd8fc30d0ed65cd443f200b3	2019-01-22 18:29:45 +01:00
Kaifeng Wang	cc73bb21fd	Limit the timeout value of heartbeat_timeout Set the heartbeat_timeout value too high will cause OverflowError, which affects places where the delta value is used to check the online status of a conductor. This involves get_offline_conductors, _filter_active_conductors, and descending caller like the _check_orphan_nodes periodic task, /v1/drivers endpoint, etc. Limit the max value to 10 years as [1] does. [1] https://review.openstack.org/#/c/631538 Story: 2004807 Task: 29021 Change-Id: I2b47fa6747e2f97c6910be708c328bed9daba455	2019-01-22 13:57:30 +08:00
Zuul	c80e912b3a	Merge "Disable periodic tasks if interval set to 0"	2018-10-02 23:46:09 +00:00
Kaifeng Wang	3907f8af4f	Remove inspecting state support from inspect_hardware Returning INSPECTING state from InspectInterface.inspect_hardware was deprecated and removed in this patch. This also removed the deprecated configuration option [conductor]inspect_timeout. Change-Id: I636e11a80451aa3a44d7f4b30295257d57028c34 Story: #1725211 Task: #26177	2018-09-18 08:58:22 +08:00
Ruby Loo	f39aae0c95	Disable periodic tasks if interval set to 0 For periodic tasks that are specified with the decorator @perodics.periodic(), a ValueError exception was raised if a value <= 0 was specified for any of the spacing values (taken from configuration options). Specifying a value <=0 used to work, to disable the task altogether. It broke when we switched to using the futurist package (some time in mitaka cycle). This fixes it so that setting these configuration options to 0 (or a negative value) will disable the periodic tasks: - [conductor]sync_power_state_interval: sync power states for the nodes - [conductor]check_provision_state_interval: - check deployments and time out if the deployment takes too long - check the status of cleaning a node and time out if it takes too long - check the status of inspecting a node and time out if it takes too long - check for and handle nodes that are taken over by new conductors (if an old conductor disappeared) - [conductor]send_sensor_data_interval: send sensor data to ceilometer - [conductor]sync_local_state_interval: refresh a conductor's copy of the consistent hash ring. If any mappings have changed, determines which, if any, nodes need to be "taken over". The ensuing actions could include preparing a PXE environment, updating the DHCP server, and so on. - [oneview]periodic_check_interval: - check for nodes taken over by OneView users - check for nodes freed by OneView users Change-Id: I62708e239295344d0dcf0bff7dd68ec8c34ab9a0 Story: #2002059 Task: #19708	2018-08-14 01:19:02 +00:00
Jim Rollenhagen	7929361a0b	Add conductor_group field to config, node and conductor objects Adds the fields and bumps the objects versions. Excludes the field from the node API for now. Also adds the conductor_group config option, and populates the field in the conductors table. Also fixes a fundamentally broken test in ironic.tests.unit.db.test_api. Change-Id: Ice2f90f7739b2927712ed45c969865136a216bd6 Story: 2001795 Task: 22640 Task: 22642	2018-07-18 21:50:29 +00:00
Kaifeng Wang	0a1b165ba5	Power fault recovery: apply fault This patch implements setting and using the fault field. For each case currently maintenance is set to True, the fault is set accordingly. A periodic task is added to check power state for nodes in maintenance due to power failure, maintenance is cleared if the power state of a node can be retrieved. When a node is taken out of maintenance by user, the fault is cleared (if there is any). Story: #1596107 Task: #10469 Change-Id: Ic4ab20af9022a2d06bdac567e7a098f3ba08570a Partial-Bug: #1596107	2018-05-27 23:28:39 +08:00
Kaifeng Wang	6df82ee2bc	Implementation of inspect wait state This patch provides implementations to feature of adding inspect wait state. Changes covered in this patch: * Added state and transitions, state diagram regenerated. * inspector and oneview inspect interface now return INSPECTWAIT instead of INSPECTING. Move node to inspect wait if inspect interface returns INSPECTING or INSPECTWAIT. * Add a timeout option to conductor, and a periodic task to check timeout in the inspect wait state. Story: #1725211 Task: #10630 Partial-Bug: #1725211 Change-Id: Ie76bfdad5966014a4dae826919ff5705462c743b	2018-04-10 11:21:46 +08:00
Shivanand Tendulker	b9b4a55a41	Update description for config params of 'rescue' interface As ``rescue`` API implementation has merged, we can let users use configuration parameters related to this feature. This patch fixes the help messages for the same. Change-Id: Idab95302011c3bb3f1db560a6a3f9481371e7671 Partial-bug: #1526449	2018-01-27 09:20:04 +00:00
Jay Faulkner	a9bc2e6ddf	Add rescuewait timeout periodic task Ensure nodes don't get stuck in rescuewait forever when a rescue ramdisk fails to boot and start heartbeating. Change-Id: I15a92c0f619505e25768dc2fbc1b2a796f0b38fa Related-bug: #1526449 Co-Authored-By: Jay Faulkner <jay@jvf.cc> Co-Authored-By: Mario Villaplana <mario.villaplana@gmail.com> Co-Authored-By: Jesse J. Cook <jesse.j.cook@member.fsf.org> Co-Authored-By: Aparna <aparnavtce@gmail.com> Co-Authored-By: Shivanand Tendulker <stendulker@gmail.com>	2018-01-22 11:38:36 -05:00

1 2

59 Commits