summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorsampathP <sam47priya@gmail.com>2016-10-17 17:42:27 +0900
committerAdam Spiers <aspiers@suse.com>2017-02-08 20:48:53 +0000
commit8a4a70db74d767b9ed6915cea9b44a68e30845e7 (patch)
tree1d78fbc878d25c499dc5fe3e68eaf44e1f62dff1
parente243a2c5450a80a6916c1a4c6e1ea7349900e684 (diff)
VM Recovery
The purpose of this spec is to describe a method for recover the VMs from VM failures. Change-Id: I3648aacc2cfefe2bb5981f694415ceab17b2dfb8
Notes
Notes (review): Code-Review+1: Sampath Priyankara (samP) <sam47priya@gmail.com> Code-Review+2: Adam Spiers <aspiers@suse.com> Workflow+1: Adam Spiers <aspiers@suse.com> Verified+2: Jenkins Submitted-by: Jenkins Submitted-at: Wed, 26 Apr 2017 07:33:14 +0000 Reviewed-on: https://review.openstack.org/387262 Project: openstack/openstack-resource-agents-specs Branch: refs/heads/master
-rw-r--r--specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst200
1 files changed, 200 insertions, 0 deletions
diff --git a/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst b/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst
new file mode 100644
index 0000000..58c4dae
--- /dev/null
+++ b/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst
@@ -0,0 +1,200 @@
1..
2 This work is licensed under a Creative Commons Attribution 3.0 Unported
3 License.
4
5 http://creativecommons.org/licenses/by/3.0/legalcode
6
7==========================================
8VM Recovery
9==========================================
10
11The purpose of this spec is to describe a method to recover
12individual virtual machines that are marked as failed by
13the VM monitoring component.
14
15Problem description
16===================
17
18VM failure can be detected by VM monitoring method discussed in
19`vm monitoring spec`__.
20
21__ https://review.openstack.org/#/c/352217/
22
23When VM failure event is detected, appropriate recovery actions must
24be taken. Those recovery actions should be decided using configurable
25policies based on inputs such as the state of storage (shared or
26otherwise), status of the VM, and cause of the VM failure.
27
28Use Cases
29---------
30
31As a cloud operator, I would like to provide my users with highly
32available VMs to meet high SLA requirements. There are several types
33of VM failure events that can occur in OpenStack clouds.
34We need to make sure such events can be detected and recovered
35by the system. Possible VM failure events include:
36
37- VM crashes
38
39- VM hangs
40
41Possible recovery methods include:
42
43- VM restart (stop and start)
44
45- VM restart on different host
46
47Scope
48-----
49
50This spec only addresses recovery from isolated failures of individual
51VMs. Monitoring of the VMs, and detection and recovery from wider
52failures, such as failure of a whole compute host, will be covered by
53separate specs, and are therefore out of scope for this spec.
54
55This spec has the following goals:
56
571. Encourage all implementations of VM recovery, whether upstream or
58 downstream, to receive failure notifications in a standardized
59 manner. This will allow cloud vendors and operators to implement
60 HA of the compute plane via a collection of compatible components
61 (of which one is compute node monitoring), whilst not being tied to
62 any one implementation.
63
642. Suggest appropriate actions which can be taken for each failure
65 case.
66
673. Provide details of and recommend a specific implementation which
68 for the most part already exists and is proven to work.
69
704. Identify gaps with that implementation and corresponding future
71 work required.
72
73Proposed change
74===============
75
76VM monitors send failure events to a recovery workflow service. This
77workflow service can analyze the content of the failure event message
78and execute the appropriate recovery action. This workflow service
79could also handle the advanced recovery options such as maximum
80restart threshold, execute next recovery action or execute multiple
81workflows.
82
83If a VM crashes, the first approach to recovery is stop and start the
84VM from nova-api. The maximum restart threshold should be
85configurable, and it could be 0, which means do not restart and go to
86next recovery method. If restart fails, or threshold is 0, it should
87try to restart the VM on a different host. The threshold could even be
88-1, to indicate an infinite number of retries on this host, preventing
89the VM from ever being restarted on a different host. This might be
90desirable in certain configurations where there is no shared storage
91for ephemeral disks, and rebuild of a disk from a glance image during
92``nova evacuate`` is undesirable.
93
94If a VM hangs due to an I/O error, the recovery service may be
95required to automatically disable the ``nova-compute`` service on that
96host and restart the VM on a different host. It could also migrate
97other VMs from the host, in order to preempt further I/O errors.
98
99Implementation
100==============
101
102There are at least three possible ways to implement the proposed
103change:
104
1051. Use Masakari as recovery workflow service
106
107 VM monitors send the failure events to Masakari using Masakari's
108 notification API. Masakari will execute pre-defined recovery actions.
109
1102. Use Mistral as recovery workflow service
111
112 VM monitors call the Mistral workflow to execute execute appropriate
113 recovery actions.
114
1153. Use Masakari as recovery engine and Mistral as workflow service
116
117 VM monitors send the failure events to Masakari and Masakari will
118 analyze the content of the failure event message and call Mistral
119 workflow to execute recovery actions.
120
121
122Data model impact
123-----------------
124
125None
126
127REST API impact
128---------------
129
130The HTTP API of the VM recovery workflow service needs to be able to
131receive events in the format they are sent by the VM monitor.
132
133Security impact
134---------------
135
136Ideally it should be possible for the VM monitor to send instance
137event data securely to the recovery workflow service (e.g. via TLS),
138without relying on the security of the admin network over which the
139data is sent.
140
141Other end user impact
142---------------------
143
144None
145
146Performance Impact
147------------------
148
149None
150
151Other deployer impact
152---------------------
153
154
155Developer impact
156----------------
157
158Documentation Impact
159--------------------
160
161The service should be documented in the |ha-guide|_.
162
163.. |ha-guide| replace:: OpenStack High Availability Guide
164.. _ha-guide: http://docs.openstack.org/ha-guide/
165
166Assignee(s)
167-----------
168
169Primary assignee:
170 <launchpad-id or None>
171
172Other contributors:
173 <launchpad-id or None>
174
175
176Work Items
177==========
178
179 WIP
180
181Dependencies
182============
183
184
185Testing
186=======
187
188
189Documentation Impact
190====================
191
192
193
194References
195==========
196
197
198
199History
200=======