From abc0d5a3ddffbd7f549857bdf0a2e3a4033da42f Mon Sep 17 00:00:00 2001
From: Michele Baldessari <michele@acksyn.org>
Date: Thu, 20 Oct 2016 11:57:54 +0200
Subject: [PATCH] Instance HA Specification

Change-Id: I431ddb209e7a13c39b2a9645d39e122db2d9dd30
---
 specs/queens/instance-ha.rst | 145 +++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)
 create mode 100644 specs/queens/instance-ha.rst

diff --git a/specs/queens/instance-ha.rst b/specs/queens/instance-ha.rst
new file mode 100644
index 00000000..0e0da811
--- /dev/null
+++ b/specs/queens/instance-ha.rst
@@ -0,0 +1,145 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================
+Instance High Availability
+==========================
+
+Include the URL of your launchpad blueprint:
+
+https://blueprints.launchpad.net/tripleo/+spec/instance-ha
+
+A very often requested feature by operators and customers is to be able to
+automatically resurrect VMs that were running on a compute node that failed (either
+due to hardware failures, networking issues or general server problems).
+Currently we have a downstream-only procedure which consists of many manual
+steps to configure Instance HA:
+https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/chapter-1-overview
+
+What we would like to implement here is basically an optional opt-in automatic
+deployment of a cloud that has Instance HA support.
+
+Problem Description
+===================
+
+Currently if a compute node has a hardware failure or a kernel panic all the
+instances that were running on the node, will be gone and manual intervention 
+is needed to resurrect these instances on another compute node.
+
+Proposed Change
+===============
+
+Overview
+--------
+
+The proposed change would be to add a few additional puppet-tripleo profiles that would help
+us configure the pacemaker resources needed for instance HA. Unlike in previous iterations
+we won't need to move nova-compute resources under pacemaker's management. We managed to
+achieve the same result without touching the compute nodes (except by setting
+up pacemaker_remote on the computes, but that support exists already) 
+
+Alternatives
+------------
+
+There are a few specs that are modeling host recovery:
+
+Host Recovery - https://review.openstack.org/#/c/386554/
+Instances auto evacuation - https://review.openstack.org/#/c/257809
+
+The first spec uses pacemaker in a very similar way but is too new
+and too high level to really be able to comment at this point in time.
+The second one has been stalled for a long time and it looks like there
+is no consensus yet on the approaches needed. The longterm goal is
+to morph the Instance HA deployment into the spec that gets accepted.
+We are actively working on both specs as well. In any case we have
+discussed the long-term plan with SuSe and NTT and we agreed
+on a long-term plan of which this spec is the first step for TripleO.
+
+Security Impact
+---------------
+
+No additional security impact.
+
+Other End User Impact
+---------------------
+
+End users are not impacted except for the fact that VMs can be resurrected
+automatically on a non-failed compute node.
+
+Performance Impact
+------------------
+
+There are no performance related impacts as compared to a current deployment.
+
+Other Deployer Impact
+---------------------
+
+So this change does not affect the default deployments. What it does it adds a boolean
+and some additional profiles so that a deployer can have a cloud configured with Instance
+HA support out of the box.
+
+* One top-level parameter to enable the Instance HA deployment
+
+* Although fencing configuration is already currently supported by tripleo, we will need
+  to improve bits and pieces so that we won't need an extra command to generate the
+  fencing parameters.
+
+* Upgrades will be impacted by this change in the sense that we will need to make sure to test
+  them when Instance HA is enabled.
+
+Developer Impact
+----------------
+
+No developer impact is planned.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  michele
+
+Other contributors:
+  cmsj, abeekhof
+
+Work Items
+----------
+
+* Make the fencing configuration fully automated (this is mostly done already, we need oooq integration
+  and some optimization)
+
+* Add the logic and needed resources on the control-plane
+
+* Test the upgrade path when Instance HA is configured
+
+
+Testing
+=======
+
+Testing this manually is fairly simple:
+
+* Deploy with Instance HA configured and two compute nodes
+
+* Spawn a test VM
+
+* Crash the compute node where the VM is running
+
+* Observe the VM being resurrected on the other compute node
+
+Testing this in CI is doable but might be a bit more challenging due to resource constraints.
+
+Documentation Impact
+====================
+
+A section under advanced configuration is needed explaining the deployment of
+a cloud that supports Instance HA.
+
+References
+==========
+
+* https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/