VM Recovery
The purpose of this spec is to describe a method for recover the VMs from VM failures. Change-Id: I3648aacc2cfefe2bb5981f694415ceab17b2dfb8
This commit is contained in:
parent
468d526263
commit
35195b4de0
|
@ -0,0 +1,172 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
==========================================
|
||||||
|
VM Recovery
|
||||||
|
==========================================
|
||||||
|
|
||||||
|
The purpose of this spec is to describe a method to recover
|
||||||
|
individual virtual machines that are marked as failed by
|
||||||
|
the VM monitoring method.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
VM failure can be detected by VM monitoring method discussed in
|
||||||
|
`vm monitoring spec`__.
|
||||||
|
|
||||||
|
__ https://review.openstack.org/#/c/352217/
|
||||||
|
|
||||||
|
When VM failure event detected, must take the appropriate recovery
|
||||||
|
action to recover the VM. Those recovery actions are selected upon
|
||||||
|
the shared disk state, status of the VM, and cause of the VM
|
||||||
|
failure and .etc.. These recovery actions should be configurable.
|
||||||
|
This spec is to describe what are the appropriate
|
||||||
|
actions to take for each cause of failures.
|
||||||
|
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
---------
|
||||||
|
|
||||||
|
As a cloud operator, I would like to provide my users with highly
|
||||||
|
available VMs to meet high SLA requirements. There are several types
|
||||||
|
of VM failure events that can occur in OpenStack clouds.
|
||||||
|
We need to make sure such events can be detected and recovered
|
||||||
|
by the system. Possible VM failure events include:
|
||||||
|
|
||||||
|
- VM Crashes.
|
||||||
|
|
||||||
|
- VM Hangs.
|
||||||
|
|
||||||
|
Possible recovery methods include:
|
||||||
|
|
||||||
|
- VM restart (stop and start)
|
||||||
|
|
||||||
|
- VM restart on different host
|
||||||
|
|
||||||
|
- Migrate VM (Live/Cold)
|
||||||
|
|
||||||
|
If a VM crashes, first approach to recovery is stop and start the
|
||||||
|
VM from nova-api.
|
||||||
|
Maximum restart threshold should be configurable and it could be
|
||||||
|
0, which means do not restart and go to next recovery method.
|
||||||
|
If restart fail or threshold is 0, it should try to restart VM
|
||||||
|
on a different host.
|
||||||
|
|
||||||
|
|
||||||
|
If a VM Hangs due to I/O error, the recovery service should disable
|
||||||
|
the ``nova-compute`` service on that host and restart the VM on a
|
||||||
|
different host. It could also migrate other VMs from the host, in
|
||||||
|
order to pre-empt an further I/O errors.
|
||||||
|
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
VM monitors send failure events to recovery workflow service.
|
||||||
|
This workflow service can analyze the content of the failure event message
|
||||||
|
and execute the appropriate recovery action. This workflow service could also
|
||||||
|
handle the advanced recovery options such as maximum restart threshold,
|
||||||
|
execute next recovery action or execute multiple workflows.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
There are three alternatives to the proposed change:
|
||||||
|
|
||||||
|
1. Use Masakari as recovery workflow service
|
||||||
|
|
||||||
|
VM monitors send the failure events to Masakari using Masakari
|
||||||
|
notification API. Masakari will execute pre defined recovery actions.
|
||||||
|
|
||||||
|
2. User Mistral as recovery workflow service
|
||||||
|
|
||||||
|
VM monitors call the Mistral workflow to execute execute appropriate
|
||||||
|
recovery actions.
|
||||||
|
|
||||||
|
3. User Masakari as recovery engine and Mistral as workflow service
|
||||||
|
|
||||||
|
VM monitors send the failure events to Masakari and Masakari will
|
||||||
|
analyze the content of the failure event message and call Mistral
|
||||||
|
workflow to execute recovery actions.
|
||||||
|
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
WIP
|
||||||
|
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
<launchpad-id or None>
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
<launchpad-id or None>
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
Loading…
Reference in New Issue