From abc0d5a3ddffbd7f549857bdf0a2e3a4033da42f Mon Sep 17 00:00:00 2001 From: Michele Baldessari Date: Thu, 20 Oct 2016 11:57:54 +0200 Subject: [PATCH] Instance HA Specification Change-Id: I431ddb209e7a13c39b2a9645d39e122db2d9dd30 --- specs/queens/instance-ha.rst | 145 +++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 specs/queens/instance-ha.rst diff --git a/specs/queens/instance-ha.rst b/specs/queens/instance-ha.rst new file mode 100644 index 00000000..0e0da811 --- /dev/null +++ b/specs/queens/instance-ha.rst @@ -0,0 +1,145 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================== +Instance High Availability +========================== + +Include the URL of your launchpad blueprint: + +https://blueprints.launchpad.net/tripleo/+spec/instance-ha + +A very often requested feature by operators and customers is to be able to +automatically resurrect VMs that were running on a compute node that failed (either +due to hardware failures, networking issues or general server problems). +Currently we have a downstream-only procedure which consists of many manual +steps to configure Instance HA: +https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/chapter-1-overview + +What we would like to implement here is basically an optional opt-in automatic +deployment of a cloud that has Instance HA support. + +Problem Description +=================== + +Currently if a compute node has a hardware failure or a kernel panic all the +instances that were running on the node, will be gone and manual intervention +is needed to resurrect these instances on another compute node. + +Proposed Change +=============== + +Overview +-------- + +The proposed change would be to add a few additional puppet-tripleo profiles that would help +us configure the pacemaker resources needed for instance HA. Unlike in previous iterations +we won't need to move nova-compute resources under pacemaker's management. We managed to +achieve the same result without touching the compute nodes (except by setting +up pacemaker_remote on the computes, but that support exists already) + +Alternatives +------------ + +There are a few specs that are modeling host recovery: + +Host Recovery - https://review.openstack.org/#/c/386554/ +Instances auto evacuation - https://review.openstack.org/#/c/257809 + +The first spec uses pacemaker in a very similar way but is too new +and too high level to really be able to comment at this point in time. +The second one has been stalled for a long time and it looks like there +is no consensus yet on the approaches needed. The longterm goal is +to morph the Instance HA deployment into the spec that gets accepted. +We are actively working on both specs as well. In any case we have +discussed the long-term plan with SuSe and NTT and we agreed +on a long-term plan of which this spec is the first step for TripleO. + +Security Impact +--------------- + +No additional security impact. + +Other End User Impact +--------------------- + +End users are not impacted except for the fact that VMs can be resurrected +automatically on a non-failed compute node. + +Performance Impact +------------------ + +There are no performance related impacts as compared to a current deployment. + +Other Deployer Impact +--------------------- + +So this change does not affect the default deployments. What it does it adds a boolean +and some additional profiles so that a deployer can have a cloud configured with Instance +HA support out of the box. + +* One top-level parameter to enable the Instance HA deployment + +* Although fencing configuration is already currently supported by tripleo, we will need + to improve bits and pieces so that we won't need an extra command to generate the + fencing parameters. + +* Upgrades will be impacted by this change in the sense that we will need to make sure to test + them when Instance HA is enabled. + +Developer Impact +---------------- + +No developer impact is planned. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + michele + +Other contributors: + cmsj, abeekhof + +Work Items +---------- + +* Make the fencing configuration fully automated (this is mostly done already, we need oooq integration + and some optimization) + +* Add the logic and needed resources on the control-plane + +* Test the upgrade path when Instance HA is configured + + +Testing +======= + +Testing this manually is fairly simple: + +* Deploy with Instance HA configured and two compute nodes + +* Spawn a test VM + +* Crash the compute node where the VM is running + +* Observe the VM being resurrected on the other compute node + +Testing this in CI is doable but might be a bit more challenging due to resource constraints. + +Documentation Impact +==================== + +A section under advanced configuration is needed explaining the deployment of +a cloud that supports Instance HA. + +References +========== + +* https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/