I've seen a situation where heartbeats managed to completely saturate
the conductor workers, so that no API requests could come through that
required interaction with the conductor (i.e. everything other than
reads). Add periodic tasks for a large (thousands) number of nodes, and
you get a completely locked up Ironic.
This change reserves 5% (configurable) of the threads for API requests.
This is done by splitting one executor into two, of which the latter is
only used by normal _spawn_worker calls and only when the former is
exhausted. This allows an operator to apply a remediation, e.g. abort
some deployments or outright power off some nodes.
Partial-Bug: #2038438
Change-Id: Iacc62d33ffccfc11694167ee2a7bc6aad82c1f2f
First, the *_by_arch options are not a replacement for plain options:
the cpu_arch property is neither required not standardized. This is why
older options with *_by_arch equivalents are not deprecated.
Second, the example in the documentation is wrong: oslo.config does not
use Python dictionaries. Which makes me suspect that the feature has
never been properly tested (indeed, it's not used in the devstack CI,
and Bifrost uses the older options).
Change-Id: If1e633930909ce9d80e14f3ec3daa0bf8d48b7f0
Especially in a single-conductor environment, the number of threads
should be larger than max_concurrent_deploy, otherwise the latter cannot
be reached in practice or will cause issues with heartbeats.
On the other hand, this change fixes an issue with how we use futurist.
Due to a misunderstanding, we ended up setting the workers pool size to
100 and then also allowing 100 more requests to be queued.
To be it shortly, this change moves from 100 threads + 100 queued to
300 threads and no queue.
Partial-Bug: #2038438
Change-Id: I1aeeda89a8925fbbc2dae752742f0be4bc23bee0
Introduce config to allow setting default ramdisks per-architecture.
The hierarchy of the parameters is:
Node config > config by architecture > general config
Change-Id: I95dfece3e8f7bcd3121ac808985cb61997877a51
A huge list of initial work for service steps
* Adds service_step verb
* Adds service_step db/object/API field on the node object for the
status.
* Increments the API version to 1.87 for both changes.
* Increments the RPC API version to 1.57.
* Adds initial testing to facilitate ensurance that supplied steps
are passed through and executed upon.
Does not:
* Have tests for starting the agent ramdisk, although this is
relatively boiler plate.
* Have a collection of pre-decorated steps available for immediate
consumption.
Change-Id: I5b9dd928f24dff7877a4ab8dc7b743058cace994
Adds a wait step to allow for finer grained workflows
and forcing interruptions which may be needed in some
cases with specialized hardware.
Change-Id: Idc338b761ebe35a4635022a324ca5acbf29fc462
Allows steps to be executed on child nodes, and adds
the reserved power_on, power_off, and reboot step names.
Change-Id: I4673214d2ed066aa8b95a35513b144668ade3e2b
We have seen duplicate ip issues when leaving clean failed nodes
powered on. This patch allows operators to power down nodes that
enter clean failed state.
Change-Id: Iecb402227485fe0ba787a262121c9d6a048b0e13
This change adds the capability for the ironic-conductor
and standalone service process to transmit timer and counter
metrics to the message bus notifier which may be consumed by
a ceilometer, ironic-prometheus-exporter, or other consumer of
metrics event data on to the message bus.
This functionality is not presently supported on dedicated API
services such as those running as an ``ironic-api`` application
process, or Ironic WSGI application. This is due to the lack of
an internal trigger mechanism to transmit the data in a metrics
update to the message bus and/or notifier plugin.
This change requires ironic-lib 5.4.0 to collect and ship metrics via
the message bus.
Depends-On: https://review.opendev.org/c/openstack/ironic-lib/+/865311
Change-Id: If6941f970241a22d96e06d88365f76edc4683364
Provide the ability to limit resource intensive or potentially
wide scale operations which could be a symptom of a highly
distructive and unplanned operation in progress.
The idea behind this change is to help guard the overall deployment
to prevent an overall resource exhaustion situation, or prevent an
attacker with valid credentials from putting an entire deployment
into a potentially disasterous cleaning situation since ironic only
other wise limits concurrency based upon running tasks by conductor.
Story: 2010007
Task: 45140
Change-Id: I642452cd480e7674ff720b65ca32bce59a4a834a
Adds a configuration option and capability to automatically
record the lessee for a deployment based upon the original
auth_token information provided in the request context.
Additional token information is now shared through the context
which is extended in the same fashion as most other projects
saving request token information to their RequestContext,
instead of triggering excess API calls in the background to
Keystone to try and figure out requestor's information.
Change-Id: I42a2ceb9d2e7dfdc575eb37ed773a1bc682cec23
This is a follow up to commit b385d9ae5b
shortening log messages, removing unnecessary validations and fixing
a typo.
Change-Id: Iedb32b5e571c554e19c78c8b7ef9be05d1909242
This change adds support for verify steps in Ironic. Verify steps
allow executing actions on transition from "verifying" to "managable"
state and can perform actions such as cleaning BMC job queue or
resetting the BMC on supported platforms. Verify steps are similar
to deploy and clean steps, just simpler.
Story: 2009025
Task: 42751
Change-Id: Iee27199a0315b8609e629bac272998c28274802b
Currently they're only cleaned up on demand, which can lead to
unnecessary disk usage on deployments that are not actively used.
Story: #2008909
Task: #42500
Change-Id: Id5b58d1d1b2bbd2988db7a08d4ccfe2166033147
* Adds periodic task to purge node_history entries based upon
provided configuration.
* Adds recording of node history entries for errors in the
core conductor code.
* Also changes the rescue abort behavior to remove the notice
from being recorded as an error, as this is a likely bug in
behavior for any process or service evaluating the node
last_error field.
* Makes use of a semi-free form event_type field to help
provide some additional context into what is going on and
why. For example if deployments are repeatedly failing,
then perhaps it is a configuration issue, as opposed to
a general failure. If a conductor has no resources, then
the failure, in theory would point back to the conductor
itself.
Story: 2002980
Task: 42960
Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166
This change adds a generic method of configuring clean step
priorities instead of making changes in Ironic code every time a new
clean step is introduced.
Change-Id: I56b9a878724d27af2ac05232a1680017de4d8df5
Story: 1618014
"[conductor]clean_callback_timeout","[conductor]inspect_wait_timeout"
and "[conductor]inspect_wait_timeout" are a negative value will cause
an error on start up from now on.
Change-Id: Id3bef9a753be7f0c468ea3033698f0e9cd276a64
Story: 2007600
Task: 39576
I have only checked the main configuration options and two generic
drivers, leaving everything else untouched. Added are options
that is possible and makes sense to reload.
Change-Id: I74c629bcaf50da7f829f0ec8c526d936b9d40b36
In order to provide increased security, it is necessary
to hash the rescue password in advance of it being stored
into the database and to provide some sort of control for
hash strength.
This change IS incompatible with prior IPA versions with
regard to use of the rescue feature, but I fully expect
we will backport the change to IPA on to stable branches
and perform a release as it is a security improvement.
Change-Id: I1e118467a536229de6f7c245c1c48f0af38dcef2
Story: 2006777
Task: 27301
Ericsson SDI uses cached power state data in what is returned
to the API user, and typically this takes longer than 30 seconds
to provide an updated value to the API user.
As such, if we do not extend the timeout value, such users of
Ironic's Redfish API interface will have deployments fail as
the power state timeout will be encountered.
Change-Id: I0aa5131504b60b13d43c73c9a3be1f50f7855cbc
We don't prevent cleaning to happen for nodes in maintenance mode.
However, cleaning cannot succeed in this case, as we disable processing
heartbeats. This change adds a new configuration option that will
cause such node to enter CLEAN FAIL on the first heartbeat.
The same is done for deployment and automated cleaning during providing.
Finally, elevate the log level for such heartbeats from debug to warning,
as it may be a sign of a problem (especially if the new option is off).
Change-Id: I9f3ee44f39c448eb2609c5989acd36e7da844ef4
Story: #1563644
Task: #9171
This patch introduces standard Redfish virtual media boot
support to ironic.
The patch implements basic boot interface features along with
devstack plugin support for virtual media boot. Functionally,
redfish boot interface supports the same set of features as PXE.
Unlike other virtual media boot implementations (e.g. iLo), this
patch does not require user-built deploy/rescue/boot ISO images
for virtual media boot. Instead, ironic will build necessary images
out of common kernel/ramdisk pair (though user needs to provide
ESP image).
Story: 1526753
Task: 10389
Co-Authored-By: Shivanand Tendulker <stendulker@gmail.com>
Change-Id: I0db0a64c5ccf260f5a0695dbe994af1e11f71517
The devstack plugin was updated to configure basic ops before
ironic starts, so that we can put links to deploy images
in the ironic.conf.
Change-Id: I305fc3712b1ac0cf2fe64569729e236c7b614bb4
Story: #2006175
Task: #35699
This change adds an option to publish the endpoint via mDNS on start
up and clean it up on tear down.
Story: #2005393
Task: #30383
Change-Id: I55d2e7718a23cde111eaac4e431588184cb16bda
A new option is introduced for that, which defaults to the deploy timeout
if it is set and to 1800 seconds if it is not.
Change-Id: I10e02919e40d25bd4411f2b6f98f9317d1cfb187
Story: #1653112
Task: #9707
Presently the data collection defaults to only permit
sensor data to be collected and transmitted as notifications
for instances deployed via nova, however standalone operators
or general data center operators may find the sensor data
useful to identify undeployed failing hardware and overall
check the hardware health.
Adds a boolean to control the filter being set for a deployed
node.
Change-Id: I345f6e3a9f47d8d09ea488d64927fd0c5fb7dfc7
Turns [deploy]allow_deleting_available_nodes to a mutable option,
so that it can be changed without service restart.
Change-Id: Ia6d51994441ec7367bc2eba76c47d5f3c425a837
Story: 2005060
Task: 29604
Ironic allows to delete nodes which are in state 'available'.
As bringing nodes into 'available' comes at an operational cost
(i.e. enroll, inspect, clean, ...), this patch proposes a new
option 'allow_deleting_available_nodes' to support the protection
of available nodes against accidental removal.
Change-Id: I08d31b5ddbad626811c971389e634a450aeaf066
Story: #2005060
Task: #29604
This change allows conductors to periodically check and take over
allocations that were processed by conductors that went offline.
Change-Id: Ia7b9b5bc485a66215def4a76c6682c47342b86d9
Story: #2004341
Task: #28474
Node power sync is performed from a periodic task. In that task
all nodes are iterated over and power sync call is performed.
While the power sync call itself if non-blocking relative to
other concurrent I/O tasks, iteration over the nodes seems
sequential meaning that nodes power sync is performed one node
at a time.
If the above observation holds, large-scale settings may never
be able to power sync all their nodes properly, throttling at
walking all active the nodes in 60 second period.
This patch distributes power sync calls over a bunch of green
threads each working on a portion of the nodes to be taken care
of.
Change-Id: I80297c877d9a87d3bd8fc30d0ed65cd443f200b3
Set the heartbeat_timeout value too high will cause OverflowError,
which affects places where the delta value is used to check the
online status of a conductor. This involves get_offline_conductors,
_filter_active_conductors, and descending caller like the
_check_orphan_nodes periodic task, /v1/drivers endpoint, etc.
Limit the max value to 10 years as [1] does.
[1] https://review.openstack.org/#/c/631538
Story: 2004807
Task: 29021
Change-Id: I2b47fa6747e2f97c6910be708c328bed9daba455
Returning INSPECTING state from InspectInterface.inspect_hardware
was deprecated and removed in this patch.
This also removed the deprecated configuration option
[conductor]inspect_timeout.
Change-Id: I636e11a80451aa3a44d7f4b30295257d57028c34
Story: #1725211
Task: #26177
For periodic tasks that are specified with the decorator @perodics.periodic(),
a ValueError exception was raised if a value <= 0 was specified for any of the
spacing values (taken from configuration options).
Specifying a value <=0 used to work, to disable the task altogether. It broke
when we switched to using the futurist package (some time in mitaka cycle).
This fixes it so that setting these configuration options to 0 (or a negative
value) will disable the periodic tasks:
- [conductor]sync_power_state_interval: sync power states for the nodes
- [conductor]check_provision_state_interval:
- check deployments and time out if the deployment takes too long
- check the status of cleaning a node and time out if it takes too long
- check the status of inspecting a node and time out if it takes too long
- check for and handle nodes that are taken over by new conductors (if an
old conductor disappeared)
- [conductor]send_sensor_data_interval: send sensor data to ceilometer
- [conductor]sync_local_state_interval: refresh a conductor's copy of the
consistent hash ring. If any mappings have changed, determines which,
if any, nodes need to be "taken over". The ensuing actions could include
preparing a PXE environment, updating the DHCP server, and so on.
- [oneview]periodic_check_interval:
- check for nodes taken over by OneView users
- check for nodes freed by OneView users
Change-Id: I62708e239295344d0dcf0bff7dd68ec8c34ab9a0
Story: #2002059
Task: #19708
Adds the fields and bumps the objects versions. Excludes the field from
the node API for now.
Also adds the conductor_group config option, and populates the field in
the conductors table.
Also fixes a fundamentally broken test in ironic.tests.unit.db.test_api.
Change-Id: Ice2f90f7739b2927712ed45c969865136a216bd6
Story: 2001795
Task: 22640
Task: 22642
This patch implements setting and using the fault field.
For each case currently maintenance is set to True, the fault is set
accordingly. A periodic task is added to check power state for nodes
in maintenance due to power failure, maintenance is cleared if the
power state of a node can be retrieved.
When a node is taken out of maintenance by user, the fault is
cleared (if there is any).
Story: #1596107
Task: #10469
Change-Id: Ic4ab20af9022a2d06bdac567e7a098f3ba08570a
Partial-Bug: #1596107
This patch provides implementations to feature of adding inspect wait state.
Changes covered in this patch:
* Added state and transitions, state diagram regenerated.
* inspector and oneview inspect interface now return INSPECTWAIT instead of
INSPECTING. Move node to inspect wait if inspect interface returns
INSPECTING or INSPECTWAIT.
* Add a timeout option to conductor, and a periodic task to check timeout
in the inspect wait state.
Story: #1725211
Task: #10630
Partial-Bug: #1725211
Change-Id: Ie76bfdad5966014a4dae826919ff5705462c743b
As ``rescue`` API implementation has merged, we can let users use
configuration parameters related to this feature. This patch fixes
the help messages for the same.
Change-Id: Idab95302011c3bb3f1db560a6a3f9481371e7671
Partial-bug: #1526449
Ensure nodes don't get stuck in rescuewait forever when
a rescue ramdisk fails to boot and start heartbeating.
Change-Id: I15a92c0f619505e25768dc2fbc1b2a796f0b38fa
Related-bug: #1526449
Co-Authored-By: Jay Faulkner <jay@jvf.cc>
Co-Authored-By: Mario Villaplana <mario.villaplana@gmail.com>
Co-Authored-By: Jesse J. Cook <jesse.j.cook@member.fsf.org>
Co-Authored-By: Aparna <aparnavtce@gmail.com>
Co-Authored-By: Shivanand Tendulker <stendulker@gmail.com>