VM Recovery

The purpose of this spec is to describe a method for recover the VMs from VM failures. Change-Id: I3648aacc2cfefe2bb5981f694415ceab17b2dfb8
2016-10-17 17:42:27 +09:00 · 2016-10-17 17:42:27 +09:00 · 35195b4de0
parent 468d526263
commit 35195b4de0
1 changed files with 172 additions and 0 deletions
--- a/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst
+++ b/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst
@ -0,0 +1,172 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 ==========================================
 VM Recovery
 ==========================================
 The purpose of this spec is to describe a method to recover
 individual virtual machines that are marked as failed by
 the VM monitoring method.
 Problem description
 ===================
 VM failure can be detected by VM monitoring method discussed in
 `vm monitoring spec`__.
 __ https://review.openstack.org/#/c/352217/
 When VM failure event detected, must take the appropriate recovery
 action to recover the VM. Those recovery actions are selected upon
 the shared disk state, status of the VM,  and cause of the VM
 failure and .etc.. These recovery actions should be configurable.
 This spec is to describe what are the appropriate
 actions to take for each cause of failures.
 Use Cases
 ---------
 As a cloud operator, I would like to provide my users with highly
 available VMs to meet high SLA requirements. There are several types
 of VM failure events that can occur in OpenStack clouds.
 We need to make sure such events can be detected and recovered
 by the system. Possible VM failure events include:
 - VM Crashes.
 - VM Hangs.
 Possible recovery methods include:
 - VM restart (stop and start)
 - VM restart on different host
 - Migrate VM (Live/Cold)
 If a VM crashes, first approach to recovery is stop and start the
 VM from nova-api.
 Maximum restart threshold should be configurable and it could be
 0, which means do not restart and go to next recovery method.
 If restart fail or threshold is 0, it should try to restart VM
 on a different host.
 If a VM Hangs due to I/O error, the recovery service should disable
 the ``nova-compute`` service on that host and restart the VM on a
 different host. It could also migrate other VMs from the host, in
 order to pre-empt an further I/O errors.
 Proposed change
 ===============
 VM monitors send failure events to recovery workflow service.
 This workflow service can analyze the content of the failure event message
 and execute the appropriate recovery action. This workflow service could also
 handle the advanced recovery options such as maximum restart threshold,
 execute next recovery action or execute multiple workflows.
 Alternatives
 ------------
 There are three alternatives to the proposed change:
 1. Use Masakari as recovery workflow service
   VM monitors send the failure events to Masakari using Masakari
   notification API. Masakari will execute pre defined recovery actions.
 2. User Mistral as recovery workflow service
   VM monitors call the Mistral workflow to execute execute appropriate
   recovery actions.
 3. User Masakari as recovery engine and Mistral as workflow service
   VM monitors send the failure events to Masakari and Masakari will
   analyze the content of the failure event message and call Mistral
   workflow to execute recovery actions.
 Data model impact
 -----------------
 None
 REST API impact
 ---------------
 None
 Security impact
 ---------------
 None
 Other end user impact
 ---------------------
 None
 Performance Impact
 ------------------
 None
 Other deployer impact
 ---------------------
 Developer impact
 ----------------
 Implementation
 ==============
 WIP
 Assignee(s)
 -----------
 Primary assignee:
  <launchpad-id or None>
 Other contributors:
  <launchpad-id or None>
 Work Items
 ----------
 Dependencies
 ============
 Testing
 =======
 Documentation Impact
 ====================
 References
 ==========
 History
 =======