From d9cf8fefe6d1703d2222a78a81bfb711fff2e226 Mon Sep 17 00:00:00 2001 From: Ruby Loo Date: Tue, 15 May 2018 11:46:53 -0400 Subject: [PATCH] Use node.fault field for power fault recovery work This updates the spec to use a node.fault field as opposed to a node.maintenance_type field, to keep track of any fault that is detected by ironic. (ironic will still put the node in maintenance mode upon detection of a fault.) A few other changes were made to the spec to clarify things. Story: #1596107 Change-Id: I675c46e068654019073f5cb3bf4c290291ba405f --- specs/approved/power-fault-recovery.rst | 116 +++++++++++++++--------- 1 file changed, 72 insertions(+), 44 deletions(-) diff --git a/specs/approved/power-fault-recovery.rst b/specs/approved/power-fault-recovery.rst index 46052535..80ede087 100644 --- a/specs/approved/power-fault-recovery.rst +++ b/specs/approved/power-fault-recovery.rst @@ -8,18 +8,17 @@ Power fault recovery ==================== -https://bugs.launchpad.net/ironic/+bug/1596107 - -This spec proposes adding a new node field to differentiate different -maintenance types. And, if possible, recover node from maintenance -state if the node is in maintenance due to power sync failure, and power -sync is succeed. +https://storyboard.openstack.org/#!/story/1596107 +This spec proposes adding a new node field to reflect any faults detected +by the system. For nodes put into maintenance mode due to power-sync +failures, this proposes a mechanims to try to recover the nodes from +the failure. Problem description =================== -Currently, ironic's API and data model makes no differentiation between ironic +Currently, ironic's API and data model do not differentiate between ironic setting maintenance on a node (for instance, when cleaning fails, or when the power status loop cannot contact the BMC) and an operator setting maintenance on a node (an explicit API action). @@ -29,47 +28,67 @@ This situation makes it difficult to apply any power fault recovery mechanism. Proposed change =============== -A string field named ``maintenance_type`` will be added to ironic data model, -it will be used to record different maintenance types. +A string field named ``fault`` will be added to ironic's data model for a Node. +It will be used to record any detected faults. -The field will be set to following values in different cases: +The field will be set to one of the following values, depending on the +situation (when ironic puts the node into maintenance due to one of these): -* ``manual``: when a node is explicitly put into maintenance from ironic API. -* ``power failure``: when a node is entering maintenance due to power sync - failure exceeded max retries. -* ``clean failure``: when a node is entering maintenance due to failure of a +* ``power failure``: when a node is put into maintenance due to power sync + failures that exceed max retries. +* ``clean failure``: when a node is put into maintenance due to failure of a cleaning operation. -* ``rescue abort failure``: when a node is entering maintenance due to failure +* ``rescue abort failure``: when a node is put into maintenance due to failure of cleaning up during rescue abort. +* None: there is no fault detected for the node; this is the default value. + Since the field is set to a fault when ironic puts the node into maintenance + due to a fault, it gets set to None when the node is taken out of + maintenance (by ironic or the user). -The field will be set to None if not in maintenance. + For nodes that were in maintenance prior to upgrading to this release, + the field will also be None, since we don't know for sure, if (or what) + there had been a fault. Even if there had been, we don't know whether the + operator wants the node in maintenance for other reasons as well. -Add a periodic task to conductor to filter out nodes in maintenance and -``maintenance_type`` is ``power failure``, trying to get power status, and -bring nodes out of maintenance once the power status is retrieved -successfully. +A periodic task will be added to the conductor. This will operate on nodes +with ``node.fault == power failure``. It will try to get the node's power +state. If successful, the node will be taken out of maintenance. -Add a new integer option named ``[conductor]power_failure_recovery_interval`` -to config how many seconds between each run of periodic task. Set to 0 will -disable power fault recovery, the default value is 300 (5 minutes). +An integer configuration option named +``[conductor]power_failure_recovery_interval`` will be added. This is the +interval (number of seconds) between each run of the periodic task. The +default value is 300 seconds (5 minutes). Setting it to 0 will disable +power fault recovery. Alternatives ------------ -Proposals have already been made and rejected to use the combination of +Proposals have already been made and rejected to use a combination of maintenance being set and a given maintenance_reason or last_error to do self-healing due to inconsistency. -Specific faults support [#]_ is proposed but it's too big and doesn't reach -a consensus. +Specific faults support [#]_ is proposed but it's too big and to-date, no +consensus has been reached on it. + +An earlier version of this spec had proposed a ``maintenance_type`` +field for the node, instead of a ``fault`` field. We prefer the +``fault`` field since it better reflects that it is about faults. +``maintenance_type`` caused the field to be tied in with the +existing ``maintenance`` and ``maintenance_reason`` fields. +Using ``fault`` makes it distinct from ``maintenance``; it will +make it easier to move forward in the future, should we decide to +provide more enhancements that are fault-related, or to separate +the notion and handling of faults from maintenance (which some +would argue should only be operator-initiated). Data model impact ----------------- -A string field named ``maintenance_type`` will be added to ironic node table. +A string field named ``fault`` will be added to ironic's node table. -The field will be added to ``Node`` object, object version will be increased. +The field will be added to ``Node`` object; object version will be +incremented. State Machine Impact -------------------- @@ -79,10 +98,10 @@ None REST API impact --------------- -The field ``maintenance_type`` will be added to ``Node`` API object, ironic +The field ``fault`` will be added to ``Node`` API object, ironic API version will be bumped. -Node related API endpoints will be affected: +Node-related API endpoints will be affected: * POST /v1/nodes * GET /v1/nodes @@ -90,11 +109,11 @@ Node related API endpoints will be affected: * GET /v1/nodes/{node_ident} * PATCH /v1/nodes/{node_ident} -For clients with older microversion, the ``maintenance_type`` is hidden from -the response, this will be done in the ``hide_fields_in_newer_versions``. +For requests with older microversion, ``node.fault`` is hidden from +the response. This will be done in the ``hide_fields_in_newer_versions()``. -Field ``maintenance_type`` is read only from API, and can only be modified -internally. Any POST/PATCH request to Node with ``maintenance_type`` set will +Field ``fault`` is read-only from the API, and can only be modified +internally. Any POST/PATCH request to Node with ``fault`` set will get 400 (Bad Request). There is no other impact to API behaviors. @@ -111,7 +130,7 @@ None "openstack baremetal" CLI ~~~~~~~~~~~~~~~~~~~~~~~~~ -The CLI will be updated to support field ``maintenance_type``, guarded +The CLI will be updated to support field ``fault``, guarded by ironic API microversion. RPC API impact @@ -127,7 +146,9 @@ None Nova driver impact ------------------ -None +None, although we should update the ironic-virt driver code to +also look at this new node.fault field, in addition to the +``node.maintenance`` field. Ramdisk impact -------------- @@ -185,13 +206,12 @@ Other contributors: Work Items ---------- -* Update db layer to include the ``maintenance_type`` field. -* Change places where nodes entering/leaving maintenance to set - ``maintenance_type`` accordingly. +* Update db layer to include the ``fault`` field. +* Change places where nodes enter/leave maintenance, to set + ``fault`` accordingly. * Add ``[conductor]power_failure_recovery_interval`` option to ironic configuration, add periodic task to handle power recovery. * API change. -* Handle maintenance data migration on db upgrade. Dependencies ============ @@ -211,15 +231,23 @@ Upgrades and Backwards Compatibility ironic API change is guarded by microversion. -dbsync and db api will be updated to facilitate populating the -``maintenance_type`` field to ``manual`` during data migration for nodes -previously been put under maintenance. +When upgrading, any nodes in maintenance will have ``node.fault`` set to None. +This is because there is no easy/guaranteed way to determine if a node had +been (previously) put into maintenance due to one of our internal faults. +Furthermore, even if we could determine whether it was due to a fault, we don't +know whether the operator wants to keep the node in maintenance for other +reasons besides the fault. + +Should the operator want to take advantage of the power fault recovery +mechanism, they could take the nodes out of maintenance. If there are still +issues, ironic will do its thing -- detect the fault and try to recover. + Documentation Impact ==================== * Update api-reference. -* New option will generated by config generator. +* New option will be generated by config generator. References ==========