Merge "Add progress details for recovery workflow"
This commit is contained in:
commit
d4c06a0eeb
|
@ -0,0 +1,411 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
============================================
|
||||||
|
Add progress details for recovery workflows
|
||||||
|
============================================
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/masakari/+spec/progress-details-recovery-workflows
|
||||||
|
|
||||||
|
This blueprint proposes to have a feature that notifies events for recovery
|
||||||
|
workflows.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
Currently, Masakari doesn't send any events during recovery operation request
|
||||||
|
received by Masakari monitor.
|
||||||
|
|
||||||
|
It would be useful to receive events at each stage of task of recovery
|
||||||
|
workflow along with completion status and progress details so that operator
|
||||||
|
will come to know about what's happening during execution.
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
---------
|
||||||
|
|
||||||
|
Operators will be able to know following things by detailed progress details
|
||||||
|
captured during each event of recovery:
|
||||||
|
|
||||||
|
* Beginning/End of each task of recovery flow
|
||||||
|
* Errors of failure of process recovery
|
||||||
|
* Progress details which will contain the details of each task
|
||||||
|
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Masakari Recovery Workflow is a certain set of tasks executed to recover
|
||||||
|
from failure. Masakari supports three types of recovery failures:
|
||||||
|
|
||||||
|
* instance-failure
|
||||||
|
* process-failure
|
||||||
|
* host-failure
|
||||||
|
|
||||||
|
For each of these failures, Masakari executes a workflow to recover from
|
||||||
|
failure. Currently Masakari uses taskflow library to execute the workflow
|
||||||
|
which consists of recovery actions which are predefined and are executed
|
||||||
|
linearly. Proposing here to record these recovery actions with the help of
|
||||||
|
Taskflow persistence feature. Masakari will persist the flow so that it can be
|
||||||
|
resumed, restarted or rolled-back on engine failure.
|
||||||
|
|
||||||
|
Taskflow supports persistence of workflow which helps to persist each task
|
||||||
|
details in the database. For more details please refer `persistence-doc`_
|
||||||
|
|
||||||
|
Taskflow has below three tables where workflow/task details are getting
|
||||||
|
stored:
|
||||||
|
|
||||||
|
* logbooks
|
||||||
|
* flowdetails
|
||||||
|
* atomdetails
|
||||||
|
|
||||||
|
In particular, for each flow there is a corresponding flowdetails
|
||||||
|
record, and for each task there is a corresponding atomdetails record. These
|
||||||
|
form the basic level of information about how a flow will be persisted.
|
||||||
|
|
||||||
|
With the help of importing persistence package `taskflow_persistence`_ and by
|
||||||
|
accessing Masakari storage via masakari engine, able to import Taskflow tables
|
||||||
|
into Masakari. In taskflow library there is workflow, and each workflow has
|
||||||
|
task which has state and status. With the help of `notifier_method`_ will
|
||||||
|
update progress details for detailed execution flow for each task of recovery.
|
||||||
|
|
||||||
|
Saved recovery task details (failures, successes, intermediary results) going
|
||||||
|
to render on Horizon on tabular format which helps operators to understand
|
||||||
|
progress/status of recovery. Each flow execution details stored with scale
|
||||||
|
0 to 1, so that operator will able to get progress completion along with
|
||||||
|
detailed information of each task.
|
||||||
|
|
||||||
|
Explaining below the how actions/events that going to be recorded for
|
||||||
|
‘instance-failure recovery workflow’ along with progress details:
|
||||||
|
|
||||||
|
* Stop Instance Task: Below listed are possible events along with progress
|
||||||
|
details that will be recorded:
|
||||||
|
|
||||||
|
* Starting of Stop instance task::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 0.50,
|
||||||
|
"progress_data": "Started execution of StopInstanceTask <INSTANCE_UUID>"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Skipping recovery event if an instance is not HA_Enabled and
|
||||||
|
"process_all_instances" config option is also disabled::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_data": "Skipping recovery for instance <INSTANCE_UUID> as it is not Ha_Enabled"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Ignored recovery event if an instance VM state is either in 'paused',
|
||||||
|
'rescued'::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_data": "Ignoring recovery for instance <INSTANCE_UUID> as it is in paused/rescued state"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Stop instance event::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_data": "Finished execution of StopInstanceTask <INSTANCE_UUID>"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Failure event in case failed to stop instance::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_data": "Failed to stop instance <INSTANCE_UUID>"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Start Instance Task: Below listed are possible events along with progress
|
||||||
|
details that will be recorded:
|
||||||
|
|
||||||
|
* Start instance event::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 0.5,
|
||||||
|
"progress_data": "Started execution of StartInstanceTask <INSTANCE_UUID>"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Finish of Start instance event::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_data": "Finished execution of StartInstanceTask <INSTANCE_UUID>"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Failure event in case failed to start instance or if invalid state of it::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_data": "Failed to start instance <INSTANCE_UUID>"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Confirm Instance Active Task: Below listed are possible events along with
|
||||||
|
progress details that will be recorded:
|
||||||
|
|
||||||
|
* Start of Confirm instance event::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 0.5,
|
||||||
|
"progress_data": "Confirming instance <INSTANCE_UUID> is Active"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Finish of Confirm instance started event::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_data": "Confirmed instance <INSTANCE_UUID> is Active"
|
||||||
|
}
|
||||||
|
|
||||||
|
* Failure event in case failed to confirm instance::
|
||||||
|
|
||||||
|
"progress_details" = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_data": "Failed to confirm instance <INSTANCE_UUID>"
|
||||||
|
}
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
Events are emitted only when masakari engine starts processing received
|
||||||
|
notifications by executing recovery workflow.
|
||||||
|
|
||||||
|
Mentioning below the database entries that going to be recorded for
|
||||||
|
‘instance-failure recovery workflow’::
|
||||||
|
|
||||||
|
LogBook: 'instance_recovery'
|
||||||
|
- uuid = 68e86fda-25ba-4b1d-a9fc-d999bc1c796e
|
||||||
|
- created_at = 2019-01-08 08:15:21
|
||||||
|
- updated_at = 2019-01-08 08:15:21
|
||||||
|
- meta: {"notification_uuid": "9ca38361-eef9-4fca-a1fe-49ef0c7e23e8"}
|
||||||
|
FlowDetail: 'instance_recovery_engine'
|
||||||
|
- uuid = 6a780ae7-9c63-42d9-8510-aa020d7ee566
|
||||||
|
- state = SUCCESS
|
||||||
|
TaskDetail: 'StopInstanceTask'
|
||||||
|
- uuid = c165b8c2-5123-4489-99c1-97eafff72d24
|
||||||
|
- state = SUCCESS
|
||||||
|
- version = 1.0
|
||||||
|
- failure = False
|
||||||
|
- meta: {}
|
||||||
|
- results: <CONTEXT_DETAILS>
|
||||||
|
TaskDetail: 'StopInstanceTask'
|
||||||
|
- uuid = c165b8c2-5123-4489-99c1-97eafff72d24
|
||||||
|
- state = SUCCESS
|
||||||
|
- version = 1.0
|
||||||
|
- failure = False
|
||||||
|
- meta:
|
||||||
|
+ progress = 100.00%
|
||||||
|
+ progress_details = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_details": {
|
||||||
|
"at_progress": 1,
|
||||||
|
"details": {
|
||||||
|
"progress_details": [
|
||||||
|
"progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
- results: NULL
|
||||||
|
TaskDetail: 'StartInstanceTask'
|
||||||
|
- uuid = a4155556-fb5a-44f8-b8aa-ab8ecfe8f1ce
|
||||||
|
- state = SUCCESS
|
||||||
|
- version = 1.0
|
||||||
|
- failure = False
|
||||||
|
- meta:
|
||||||
|
+ progress = 100.00%
|
||||||
|
+ progress_details = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_details": {
|
||||||
|
"at_progress": 1,
|
||||||
|
"details": {
|
||||||
|
"progress_details": [
|
||||||
|
"progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
- results: NULL
|
||||||
|
TaskDetail: 'ConfirmInstanceActiveTask'
|
||||||
|
- uuid = 0ea82633-599b-422d-8fd2-df2057efb29d
|
||||||
|
- state = SUCCESS
|
||||||
|
- version = 1.0
|
||||||
|
- failure = False
|
||||||
|
- meta:
|
||||||
|
+ progress = 100.00%
|
||||||
|
+ progress_details = {
|
||||||
|
"progress": 1,
|
||||||
|
"progress_details": {
|
||||||
|
"at_progress": 1,
|
||||||
|
"details": {
|
||||||
|
"progress_details": [
|
||||||
|
"progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
- results: NULL
|
||||||
|
|
||||||
|
|
||||||
|
Mentioning below how the recorded data will be used to render task details
|
||||||
|
in tabular format for ‘instance-failure recovery workflow’ on Horizon::
|
||||||
|
|
||||||
|
* Stop Instance Task
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
Request ID Action Start Time Message
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
req-679033b7-1755-4929-bf85-eb3bfaef7e0b StopInstanceTask Jan 10 2019, 10:40 a.m Started execution of StopInstanceTask <INSTANCE_UUID>
|
||||||
|
req-679033b7-1755-4929-bf85-eb3bfaef7e0b StopInstanceTask Jan 10 2019, 10:41 a.m Finished execution of StopInstanceTask <INSTANCE_UUID>
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
|
||||||
|
* Start Instance Task
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
Request ID Action Start Time Message
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
req-679033b7-1755-4929-bf85-eb3bfaef7e0b StartInstanceTask Jan 10 2019, 10:41 a.m Starting instance <INSTANCE_UUID>
|
||||||
|
req-679033b7-1755-4929-bf85-eb3bfaef7e0b StartInstanceTask Jan 10 2019, 10:42 a.m Started instance <INSTANCE_UUID>
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
|
||||||
|
* Confirm Instance Active Task
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
Request ID Action Start Time Message
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
req-679033b7-1755-4929-bf85-eb3bfaef7e0b ConfirmInstanceActiveTask Jan 10 2019, 10:43 a.m Confirming instance is Active <INSTANCE_UUID>
|
||||||
|
req-679033b7-1755-4929-bf85-eb3bfaef7e0b ConfirmInstanceActiveTask Jan 10 2019, 10:43 a.m Confirmed instance is Active <INSTANCE_UUID>
|
||||||
|
============================================ ========================== ========================== ====================================================
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
Send Versioned notifications similar to the other OpenStack services for
|
||||||
|
recovery workflows.
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Below tables will get added into Masakari Database
|
||||||
|
|
||||||
|
* alembic_version
|
||||||
|
* logbooks
|
||||||
|
* flowdetails
|
||||||
|
* atomdetails
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
alembic_version here stores version information of taskflow database
|
||||||
|
version, not of Masakari database.
|
||||||
|
Masakaari database as of now is not under alembic control.
|
||||||
|
|
||||||
|
For example in case of ‘instance-failure recovery workflow’, data will be
|
||||||
|
stored in below columns
|
||||||
|
|
||||||
|
* logbooks: Parent table, one entry for each notification received.
|
||||||
|
* flowdetails: Child table for logbooks, one entry for each notification received.
|
||||||
|
* atomdetails: Child table for flowdetails, one entry for each task of recovery.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
Foreign key association is not there for taskflow persistence tables.
|
||||||
|
If we delete logbook entry, respective child entries also got deleted.
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
A new microversion will be created to add event details to GET
|
||||||
|
/notifications/<notification_uuid> API.
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Notifications impact
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
Masakari recovery failure doesn't support event notification feature.
|
||||||
|
This spec will add this feature.
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
There will be a slight performance impact due to the overhead for storing
|
||||||
|
events during processing of each recovery failure into database.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
|
||||||
|
* Jayashri Bidwe <Jayashri.Bidwe@nttdata.com>
|
||||||
|
* Vrushali Kamde <Vrushali.Kamde@nttdata.com>
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Fetch backend as Masakari backend for each taskflow
|
||||||
|
* Execute taskflow with all details at each task that required
|
||||||
|
* Populate meta with progress status
|
||||||
|
* Update the notification API for GET /notifications/<notification_uuid> in a
|
||||||
|
new microversion to pass the stored event related information of recovery
|
||||||
|
failure
|
||||||
|
* Update unit tests for code coverage
|
||||||
|
* Add documentation on how to use this feature at Horizon
|
||||||
|
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
No need to write tempest tests as unit tests are sufficient to check
|
||||||
|
whether the events are sent or not for recovery operations.
|
||||||
|
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
.. _`persistence-doc`: https://docs.openstack.org/taskflow/latest/user/persistence.html
|
||||||
|
.. _`taskflow_persistence`: https://github.com/openstack/taskflow/tree/master/taskflow/persistence
|
||||||
|
.. _`notifier_method`: https://github.com/openstack/taskflow/blob/master/taskflow/types/notifier.py#L186
|
||||||
|
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
||||||
|
|
||||||
|
.. list-table:: Revisions
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Release Name
|
||||||
|
- Description
|
||||||
|
* - Stein
|
||||||
|
- Introduced
|
Loading…
Reference in New Issue