Use node.fault field for power fault recovery work
This updates the spec to use a node.fault field as opposed to a node.maintenance_type field, to keep track of any fault that is detected by ironic. (ironic will still put the node in maintenance mode upon detection of a fault.) A few other changes were made to the spec to clarify things. Story: #1596107 Change-Id: I675c46e068654019073f5cb3bf4c290291ba405f
This commit is contained in:
parent
39ce029c8f
commit
d9cf8fefe6
|
@ -8,18 +8,17 @@
|
|||
Power fault recovery
|
||||
====================
|
||||
|
||||
https://bugs.launchpad.net/ironic/+bug/1596107
|
||||
|
||||
This spec proposes adding a new node field to differentiate different
|
||||
maintenance types. And, if possible, recover node from maintenance
|
||||
state if the node is in maintenance due to power sync failure, and power
|
||||
sync is succeed.
|
||||
https://storyboard.openstack.org/#!/story/1596107
|
||||
|
||||
This spec proposes adding a new node field to reflect any faults detected
|
||||
by the system. For nodes put into maintenance mode due to power-sync
|
||||
failures, this proposes a mechanims to try to recover the nodes from
|
||||
the failure.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently, ironic's API and data model makes no differentiation between ironic
|
||||
Currently, ironic's API and data model do not differentiate between ironic
|
||||
setting maintenance on a node (for instance, when cleaning fails, or when the
|
||||
power status loop cannot contact the BMC) and an operator setting maintenance
|
||||
on a node (an explicit API action).
|
||||
|
@ -29,47 +28,67 @@ This situation makes it difficult to apply any power fault recovery mechanism.
|
|||
Proposed change
|
||||
===============
|
||||
|
||||
A string field named ``maintenance_type`` will be added to ironic data model,
|
||||
it will be used to record different maintenance types.
|
||||
A string field named ``fault`` will be added to ironic's data model for a Node.
|
||||
It will be used to record any detected faults.
|
||||
|
||||
The field will be set to following values in different cases:
|
||||
The field will be set to one of the following values, depending on the
|
||||
situation (when ironic puts the node into maintenance due to one of these):
|
||||
|
||||
* ``manual``: when a node is explicitly put into maintenance from ironic API.
|
||||
* ``power failure``: when a node is entering maintenance due to power sync
|
||||
failure exceeded max retries.
|
||||
* ``clean failure``: when a node is entering maintenance due to failure of a
|
||||
* ``power failure``: when a node is put into maintenance due to power sync
|
||||
failures that exceed max retries.
|
||||
* ``clean failure``: when a node is put into maintenance due to failure of a
|
||||
cleaning operation.
|
||||
* ``rescue abort failure``: when a node is entering maintenance due to failure
|
||||
* ``rescue abort failure``: when a node is put into maintenance due to failure
|
||||
of cleaning up during rescue abort.
|
||||
* None: there is no fault detected for the node; this is the default value.
|
||||
Since the field is set to a fault when ironic puts the node into maintenance
|
||||
due to a fault, it gets set to None when the node is taken out of
|
||||
maintenance (by ironic or the user).
|
||||
|
||||
The field will be set to None if not in maintenance.
|
||||
For nodes that were in maintenance prior to upgrading to this release,
|
||||
the field will also be None, since we don't know for sure, if (or what)
|
||||
there had been a fault. Even if there had been, we don't know whether the
|
||||
operator wants the node in maintenance for other reasons as well.
|
||||
|
||||
Add a periodic task to conductor to filter out nodes in maintenance and
|
||||
``maintenance_type`` is ``power failure``, trying to get power status, and
|
||||
bring nodes out of maintenance once the power status is retrieved
|
||||
successfully.
|
||||
A periodic task will be added to the conductor. This will operate on nodes
|
||||
with ``node.fault == power failure``. It will try to get the node's power
|
||||
state. If successful, the node will be taken out of maintenance.
|
||||
|
||||
Add a new integer option named ``[conductor]power_failure_recovery_interval``
|
||||
to config how many seconds between each run of periodic task. Set to 0 will
|
||||
disable power fault recovery, the default value is 300 (5 minutes).
|
||||
An integer configuration option named
|
||||
``[conductor]power_failure_recovery_interval`` will be added. This is the
|
||||
interval (number of seconds) between each run of the periodic task. The
|
||||
default value is 300 seconds (5 minutes). Setting it to 0 will disable
|
||||
power fault recovery.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Proposals have already been made and rejected to use the combination of
|
||||
Proposals have already been made and rejected to use a combination of
|
||||
maintenance being set and a given maintenance_reason or last_error to do
|
||||
self-healing due to inconsistency.
|
||||
|
||||
Specific faults support [#]_ is proposed but it's too big and doesn't reach
|
||||
a consensus.
|
||||
Specific faults support [#]_ is proposed but it's too big and to-date, no
|
||||
consensus has been reached on it.
|
||||
|
||||
An earlier version of this spec had proposed a ``maintenance_type``
|
||||
field for the node, instead of a ``fault`` field. We prefer the
|
||||
``fault`` field since it better reflects that it is about faults.
|
||||
``maintenance_type`` caused the field to be tied in with the
|
||||
existing ``maintenance`` and ``maintenance_reason`` fields.
|
||||
Using ``fault`` makes it distinct from ``maintenance``; it will
|
||||
make it easier to move forward in the future, should we decide to
|
||||
provide more enhancements that are fault-related, or to separate
|
||||
the notion and handling of faults from maintenance (which some
|
||||
would argue should only be operator-initiated).
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
A string field named ``maintenance_type`` will be added to ironic node table.
|
||||
A string field named ``fault`` will be added to ironic's node table.
|
||||
|
||||
The field will be added to ``Node`` object, object version will be increased.
|
||||
The field will be added to ``Node`` object; object version will be
|
||||
incremented.
|
||||
|
||||
State Machine Impact
|
||||
--------------------
|
||||
|
@ -79,10 +98,10 @@ None
|
|||
REST API impact
|
||||
---------------
|
||||
|
||||
The field ``maintenance_type`` will be added to ``Node`` API object, ironic
|
||||
The field ``fault`` will be added to ``Node`` API object, ironic
|
||||
API version will be bumped.
|
||||
|
||||
Node related API endpoints will be affected:
|
||||
Node-related API endpoints will be affected:
|
||||
|
||||
* POST /v1/nodes
|
||||
* GET /v1/nodes
|
||||
|
@ -90,11 +109,11 @@ Node related API endpoints will be affected:
|
|||
* GET /v1/nodes/{node_ident}
|
||||
* PATCH /v1/nodes/{node_ident}
|
||||
|
||||
For clients with older microversion, the ``maintenance_type`` is hidden from
|
||||
the response, this will be done in the ``hide_fields_in_newer_versions``.
|
||||
For requests with older microversion, ``node.fault`` is hidden from
|
||||
the response. This will be done in the ``hide_fields_in_newer_versions()``.
|
||||
|
||||
Field ``maintenance_type`` is read only from API, and can only be modified
|
||||
internally. Any POST/PATCH request to Node with ``maintenance_type`` set will
|
||||
Field ``fault`` is read-only from the API, and can only be modified
|
||||
internally. Any POST/PATCH request to Node with ``fault`` set will
|
||||
get 400 (Bad Request).
|
||||
|
||||
There is no other impact to API behaviors.
|
||||
|
@ -111,7 +130,7 @@ None
|
|||
"openstack baremetal" CLI
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The CLI will be updated to support field ``maintenance_type``, guarded
|
||||
The CLI will be updated to support field ``fault``, guarded
|
||||
by ironic API microversion.
|
||||
|
||||
RPC API impact
|
||||
|
@ -127,7 +146,9 @@ None
|
|||
Nova driver impact
|
||||
------------------
|
||||
|
||||
None
|
||||
None, although we should update the ironic-virt driver code to
|
||||
also look at this new node.fault field, in addition to the
|
||||
``node.maintenance`` field.
|
||||
|
||||
Ramdisk impact
|
||||
--------------
|
||||
|
@ -185,13 +206,12 @@ Other contributors:
|
|||
Work Items
|
||||
----------
|
||||
|
||||
* Update db layer to include the ``maintenance_type`` field.
|
||||
* Change places where nodes entering/leaving maintenance to set
|
||||
``maintenance_type`` accordingly.
|
||||
* Update db layer to include the ``fault`` field.
|
||||
* Change places where nodes enter/leave maintenance, to set
|
||||
``fault`` accordingly.
|
||||
* Add ``[conductor]power_failure_recovery_interval`` option to ironic
|
||||
configuration, add periodic task to handle power recovery.
|
||||
* API change.
|
||||
* Handle maintenance data migration on db upgrade.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
@ -211,15 +231,23 @@ Upgrades and Backwards Compatibility
|
|||
|
||||
ironic API change is guarded by microversion.
|
||||
|
||||
dbsync and db api will be updated to facilitate populating the
|
||||
``maintenance_type`` field to ``manual`` during data migration for nodes
|
||||
previously been put under maintenance.
|
||||
When upgrading, any nodes in maintenance will have ``node.fault`` set to None.
|
||||
This is because there is no easy/guaranteed way to determine if a node had
|
||||
been (previously) put into maintenance due to one of our internal faults.
|
||||
Furthermore, even if we could determine whether it was due to a fault, we don't
|
||||
know whether the operator wants to keep the node in maintenance for other
|
||||
reasons besides the fault.
|
||||
|
||||
Should the operator want to take advantage of the power fault recovery
|
||||
mechanism, they could take the nodes out of maintenance. If there are still
|
||||
issues, ironic will do its thing -- detect the fault and try to recover.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* Update api-reference.
|
||||
* New option will generated by config generator.
|
||||
* New option will be generated by config generator.
|
||||
|
||||
References
|
||||
==========
|
||||
|
|
Loading…
Reference in New Issue