summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorshilpa.devharakar <shilpa.devharakar@nttdata.com>2019-01-04 16:15:36 +0000
committershilpa.devharakar <shilpa.devharakar@nttdata.com>2019-01-28 14:06:26 +0000
commitaf589037e73f5435e3f3231050f2d236ca78c813 (patch)
tree7330070e7320c0c2a082722b6834cd211388e193
parent2b1a6f4a982b2bf03844543729ed733fd4875dd6 (diff)
Add progress details for recovery workflow
Taskflow supports persistence of task which helps to persist each task details in the database. Using this functionality, Masakari will store task details for recovery failures. Change-Id: I4fe394f473a93aedc9e167bbde3dd196cfc89559 Implements: bp progress-details-recovery-workflows
Notes
Notes (review): Code-Review+2: Sampath Priyankara (samP) <sam47priya@gmail.com> Workflow+1: Sampath Priyankara (samP) <sam47priya@gmail.com> Verified+2: Zuul Submitted-by: Zuul Submitted-at: Tue, 26 Feb 2019 04:03:16 +0000 Reviewed-on: https://review.openstack.org/632079 Project: openstack/masakari-specs Branch: refs/heads/master
-rw-r--r--specs/stein/approved/progress-details-for-recovery-workflows.rst411
1 files changed, 411 insertions, 0 deletions
diff --git a/specs/stein/approved/progress-details-for-recovery-workflows.rst b/specs/stein/approved/progress-details-for-recovery-workflows.rst
new file mode 100644
index 0000000..d5569be
--- /dev/null
+++ b/specs/stein/approved/progress-details-for-recovery-workflows.rst
@@ -0,0 +1,411 @@
1..
2 This work is licensed under a Creative Commons Attribution 3.0 Unported
3 License.
4
5 http://creativecommons.org/licenses/by/3.0/legalcode
6
7============================================
8Add progress details for recovery workflows
9============================================
10
11https://blueprints.launchpad.net/masakari/+spec/progress-details-recovery-workflows
12
13This blueprint proposes to have a feature that notifies events for recovery
14workflows.
15
16Problem description
17===================
18
19Currently, Masakari doesn't send any events during recovery operation request
20received by Masakari monitor.
21
22It would be useful to receive events at each stage of task of recovery
23workflow along with completion status and progress details so that operator
24will come to know about what's happening during execution.
25
26Use Cases
27---------
28
29Operators will be able to know following things by detailed progress details
30captured during each event of recovery:
31
32* Beginning/End of each task of recovery flow
33* Errors of failure of process recovery
34* Progress details which will contain the details of each task
35
36
37Proposed change
38===============
39
40Masakari Recovery Workflow is a certain set of tasks executed to recover
41from failure. Masakari supports three types of recovery failures:
42
43* instance-failure
44* process-failure
45* host-failure
46
47For each of these failures, Masakari executes a workflow to recover from
48failure. Currently Masakari uses taskflow library to execute the workflow
49which consists of recovery actions which are predefined and are executed
50linearly. Proposing here to record these recovery actions with the help of
51Taskflow persistence feature. Masakari will persist the flow so that it can be
52resumed, restarted or rolled-back on engine failure.
53
54Taskflow supports persistence of workflow which helps to persist each task
55details in the database. For more details please refer `persistence-doc`_
56
57Taskflow has below three tables where workflow/task details are getting
58stored:
59
60* logbooks
61* flowdetails
62* atomdetails
63
64In particular, for each flow there is a corresponding flowdetails
65record, and for each task there is a corresponding atomdetails record. These
66form the basic level of information about how a flow will be persisted.
67
68With the help of importing persistence package `taskflow_persistence`_ and by
69accessing Masakari storage via masakari engine, able to import Taskflow tables
70into Masakari. In taskflow library there is workflow, and each workflow has
71task which has state and status. With the help of `notifier_method`_ will
72update progress details for detailed execution flow for each task of recovery.
73
74Saved recovery task details (failures, successes, intermediary results) going
75to render on Horizon on tabular format which helps operators to understand
76progress/status of recovery. Each flow execution details stored with scale
770 to 1, so that operator will able to get progress completion along with
78detailed information of each task.
79
80Explaining below the how actions/events that going to be recorded for
81‘instance-failure recovery workflow’ along with progress details:
82
83* Stop Instance Task: Below listed are possible events along with progress
84 details that will be recorded:
85
86 * Starting of Stop instance task::
87
88 "progress_details" = {
89 "progress": 0.50,
90 "progress_data": "Started execution of StopInstanceTask <INSTANCE_UUID>"
91 }
92
93 * Skipping recovery event if an instance is not HA_Enabled and
94 "process_all_instances" config option is also disabled::
95
96 "progress_details" = {
97 "progress": 1,
98 "progress_data": "Skipping recovery for instance <INSTANCE_UUID> as it is not Ha_Enabled"
99 }
100
101 * Ignored recovery event if an instance VM state is either in 'paused',
102 'rescued'::
103
104 "progress_details" = {
105 "progress": 1,
106 "progress_data": "Ignoring recovery for instance <INSTANCE_UUID> as it is in paused/rescued state"
107 }
108
109 * Stop instance event::
110
111 "progress_details" = {
112 "progress": 1,
113 "progress_data": "Finished execution of StopInstanceTask <INSTANCE_UUID>"
114 }
115
116 * Failure event in case failed to stop instance::
117
118 "progress_details" = {
119 "progress": 1,
120 "progress_data": "Failed to stop instance <INSTANCE_UUID>"
121 }
122
123* Start Instance Task: Below listed are possible events along with progress
124 details that will be recorded:
125
126 * Start instance event::
127
128 "progress_details" = {
129 "progress": 0.5,
130 "progress_data": "Started execution of StartInstanceTask <INSTANCE_UUID>"
131 }
132
133 * Finish of Start instance event::
134
135 "progress_details" = {
136 "progress": 1,
137 "progress_data": "Finished execution of StartInstanceTask <INSTANCE_UUID>"
138 }
139
140 * Failure event in case failed to start instance or if invalid state of it::
141
142 "progress_details" = {
143 "progress": 1,
144 "progress_data": "Failed to start instance <INSTANCE_UUID>"
145 }
146
147* Confirm Instance Active Task: Below listed are possible events along with
148 progress details that will be recorded:
149
150 * Start of Confirm instance event::
151
152 "progress_details" = {
153 "progress": 0.5,
154 "progress_data": "Confirming instance <INSTANCE_UUID> is Active"
155 }
156
157 * Finish of Confirm instance started event::
158
159 "progress_details" = {
160 "progress": 1,
161 "progress_data": "Confirmed instance <INSTANCE_UUID> is Active"
162 }
163
164 * Failure event in case failed to confirm instance::
165
166 "progress_details" = {
167 "progress": 1,
168 "progress_data": "Failed to confirm instance <INSTANCE_UUID>"
169 }
170
171.. note::
172 Events are emitted only when masakari engine starts processing received
173 notifications by executing recovery workflow.
174
175Mentioning below the database entries that going to be recorded for
176‘instance-failure recovery workflow’::
177
178 LogBook: 'instance_recovery'
179 - uuid = 68e86fda-25ba-4b1d-a9fc-d999bc1c796e
180 - created_at = 2019-01-08 08:15:21
181 - updated_at = 2019-01-08 08:15:21
182 - meta: {"notification_uuid": "9ca38361-eef9-4fca-a1fe-49ef0c7e23e8"}
183 FlowDetail: 'instance_recovery_engine'
184 - uuid = 6a780ae7-9c63-42d9-8510-aa020d7ee566
185 - state = SUCCESS
186 TaskDetail: 'StopInstanceTask'
187 - uuid = c165b8c2-5123-4489-99c1-97eafff72d24
188 - state = SUCCESS
189 - version = 1.0
190 - failure = False
191 - meta: {}
192 - results: <CONTEXT_DETAILS>
193 TaskDetail: 'StopInstanceTask'
194 - uuid = c165b8c2-5123-4489-99c1-97eafff72d24
195 - state = SUCCESS
196 - version = 1.0
197 - failure = False
198 - meta:
199 + progress = 100.00%
200 + progress_details = {
201 "progress": 1,
202 "progress_details": {
203 "at_progress": 1,
204 "details": {
205 "progress_details": [
206 "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
207 ]
208 }
209 }
210 }
211 - results: NULL
212 TaskDetail: 'StartInstanceTask'
213 - uuid = a4155556-fb5a-44f8-b8aa-ab8ecfe8f1ce
214 - state = SUCCESS
215 - version = 1.0
216 - failure = False
217 - meta:
218 + progress = 100.00%
219 + progress_details = {
220 "progress": 1,
221 "progress_details": {
222 "at_progress": 1,
223 "details": {
224 "progress_details": [
225 "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
226 ]
227 }
228 }
229 }
230 - results: NULL
231 TaskDetail: 'ConfirmInstanceActiveTask'
232 - uuid = 0ea82633-599b-422d-8fd2-df2057efb29d
233 - state = SUCCESS
234 - version = 1.0
235 - failure = False
236 - meta:
237 + progress = 100.00%
238 + progress_details = {
239 "progress": 1,
240 "progress_details": {
241 "at_progress": 1,
242 "details": {
243 "progress_details": [
244 "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
245 ]
246 }
247 }
248 }
249 - results: NULL
250
251
252Mentioning below how the recorded data will be used to render task details
253in tabular format for ‘instance-failure recovery workflow’ on Horizon::
254
255 * Stop Instance Task
256 ============================================ ========================== ========================== ====================================================
257 Request ID Action Start Time Message
258 ============================================ ========================== ========================== ====================================================
259 req-679033b7-1755-4929-bf85-eb3bfaef7e0b StopInstanceTask Jan 10 2019, 10:40 a.m Started execution of StopInstanceTask <INSTANCE_UUID>
260 req-679033b7-1755-4929-bf85-eb3bfaef7e0b StopInstanceTask Jan 10 2019, 10:41 a.m Finished execution of StopInstanceTask <INSTANCE_UUID>
261 ============================================ ========================== ========================== ====================================================
262
263 * Start Instance Task
264 ============================================ ========================== ========================== ====================================================
265 Request ID Action Start Time Message
266 ============================================ ========================== ========================== ====================================================
267 req-679033b7-1755-4929-bf85-eb3bfaef7e0b StartInstanceTask Jan 10 2019, 10:41 a.m Starting instance <INSTANCE_UUID>
268 req-679033b7-1755-4929-bf85-eb3bfaef7e0b StartInstanceTask Jan 10 2019, 10:42 a.m Started instance <INSTANCE_UUID>
269 ============================================ ========================== ========================== ====================================================
270
271 * Confirm Instance Active Task
272 ============================================ ========================== ========================== ====================================================
273 Request ID Action Start Time Message
274 ============================================ ========================== ========================== ====================================================
275 req-679033b7-1755-4929-bf85-eb3bfaef7e0b ConfirmInstanceActiveTask Jan 10 2019, 10:43 a.m Confirming instance is Active <INSTANCE_UUID>
276 req-679033b7-1755-4929-bf85-eb3bfaef7e0b ConfirmInstanceActiveTask Jan 10 2019, 10:43 a.m Confirmed instance is Active <INSTANCE_UUID>
277 ============================================ ========================== ========================== ====================================================
278
279Alternatives
280------------
281
282Send Versioned notifications similar to the other OpenStack services for
283recovery workflows.
284
285Data model impact
286-----------------
287
288Below tables will get added into Masakari Database
289
290* alembic_version
291* logbooks
292* flowdetails
293* atomdetails
294
295.. note::
296 alembic_version here stores version information of taskflow database
297 version, not of Masakari database.
298 Masakaari database as of now is not under alembic control.
299
300For example in case of ‘instance-failure recovery workflow’, data will be
301stored in below columns
302
303* logbooks: Parent table, one entry for each notification received.
304* flowdetails: Child table for logbooks, one entry for each notification received.
305* atomdetails: Child table for flowdetails, one entry for each task of recovery.
306
307.. note::
308 Foreign key association is not there for taskflow persistence tables.
309 If we delete logbook entry, respective child entries also got deleted.
310
311REST API impact
312---------------
313
314A new microversion will be created to add event details to GET
315/notifications/<notification_uuid> API.
316
317Security impact
318---------------
319
320None
321
322Notifications impact
323--------------------
324
325Masakari recovery failure doesn't support event notification feature.
326This spec will add this feature.
327
328Other end user impact
329---------------------
330
331None
332
333Performance Impact
334------------------
335
336There will be a slight performance impact due to the overhead for storing
337events during processing of each recovery failure into database.
338
339Other deployer impact
340---------------------
341
342None
343
344
345Developer impact
346----------------
347
348None
349
350
351Implementation
352==============
353
354Assignee(s)
355-----------
356
357Primary assignee:
358
359* Jayashri Bidwe <Jayashri.Bidwe@nttdata.com>
360* Vrushali Kamde <Vrushali.Kamde@nttdata.com>
361
362Work Items
363----------
364
365* Fetch backend as Masakari backend for each taskflow
366* Execute taskflow with all details at each task that required
367* Populate meta with progress status
368* Update the notification API for GET /notifications/<notification_uuid> in a
369 new microversion to pass the stored event related information of recovery
370 failure
371* Update unit tests for code coverage
372* Add documentation on how to use this feature at Horizon
373
374
375Dependencies
376============
377
378None
379
380
381Testing
382=======
383
384No need to write tempest tests as unit tests are sufficient to check
385whether the events are sent or not for recovery operations.
386
387
388Documentation Impact
389====================
390
391None
392
393
394References
395==========
396
397.. _`persistence-doc`: https://docs.openstack.org/taskflow/latest/user/persistence.html
398.. _`taskflow_persistence`: https://github.com/openstack/taskflow/tree/master/taskflow/persistence
399.. _`notifier_method`: https://github.com/openstack/taskflow/blob/master/taskflow/types/notifier.py#L186
400
401
402History
403=======
404
405.. list-table:: Revisions
406 :header-rows: 1
407
408 * - Release Name
409 - Description
410 * - Stein
411 - Introduced