development-proposals/development-proposals/proposed/ha-vm.rst

======================================
High Availability for Virtual Machines
======================================

Cross Project Spec - None

User Story Tracker - None

Problem description
-------------------

*Problem Definition*
++++++++++++++++++++

Enterprise customers are moving their application workloads into OpenStack
clouds, for example to consolidate virtual estates, and benefit from increased
manageability and other economies of scale which OpenStack can bring.

However, it's typically impractical to re-architect all applications into a
purely cloud-native model at once. Therefore some applications, or parts
thereof, are deployed on non-disposable VMs in a pet model. This requires high
availability of such VMs. Even though VM volumes can be stored on a shared
storage system, such as NFS or Ceph, to improve the availability, VM state on
each hypervisor is not easily replicated to other hypervisors. Therefore, the
system must be able to recover the VM from failure events, preferably in an
automated and cost-effective manner.

Even for applications architected in a cloud-native "cattle" model which can
tolerate failures of individual VMs, at scale it is too impractical and costly
to have to manually recover every failure. Ideally this auto-recovery would be
implemented in the application or PaaS layer, to maximise integration with the
rest of the application. However even if a new feature implemented the
OpenStack layer primarily targeted auto-recovery of pets, it could also serve
as a cheap alternative for auto-recovery of cattle.

Opportunity/Justification
+++++++++++++++++++++++++

Many enterprise customers require highly available VMs in order to satisfy their
workload SLAs. For example, this is a critical requirement for NTT customers.

Requirements Specification
--------------------------

Use Cases
+++++++++

As a cloud operator, I would like to provide my users with highly available
VMs to meet high SLA requirements. There are several types of failure
events that can occur in OpenStack clouds. We need to make sure such events
can be detected and recovered by the system. Possible failure events include:

* VM crashes.

  For example, with the KVM hypervisor, the ``qemu-kvm`` process could crash.

* VM hangs.

  For example, an issue with a VM's block storage (either its
  ephemeral disk or an associated Cinder volume) could cause the VM to
  hang, and the QEMU layer to emit a ``BLOCK_IO_ERROR`` which would
  bubble up through ``libvirt`` and could be detected and handled by
  an automated recovery process.

* ``nova-compute`` service crashes or becomes unresponsive.

* Compute host crashes or hangs.

* Hypervisor fails, e.g. libvirtd process dies or becomes unresponsive.

* Network component fails.

  There are many ways a network component could fail, e.g. NIC
  configuration error, NIC driver failure, NIC hardware failure, cable
  failure, switch failure and so on. Any production environment aiming
  for high availability already requires considerable redundancy at
  the network level, especially voting nodes within a cluster which
  needs its quorum protecting against network partitions. Whilst this
  redundancy will keep most network hardware failures invisible to
  OpenStack, the remainder still need defending against. However, in
  order to fulfill this user story we don't need to be able to
  pinpoint the cause of a network failure; it's enough to recognise
  which network connection failed, and then react accordingly.

* Availability Zone failure

* Data Center / Region failure

  Failure of a whole region or data center is obviously much more severe,
  requiring recovery of not just compute nodes but also OpenStack services in
  the control plane.  It needs to be covered by a Disaster Recovery plan,
  which will vary greatly for each cloud depending on its architecture,
  supported workloads, required SLAs, and organizational structure.  As such,
  a general solution to Disaster Recovery is a problem of considerable
  complexity, therefore it makes sense to keep it out of scope for this user
  story, which should instead be viewed as a necessary and manageable step on
  the long road to that solution.

As a cloud operator, I need to reserve a certain number of hypervisors so that
they can be used for failover hosts in case of a host failure event. This is
required for planning in order to meet an expected SLA. The number of failover
hosts depends on the expectation of VM availability (SLA), the size of the host
pool (failover segment), the possibility of host failures and the MTTR of host
failure, all of which are managed by the cloud operator.

The size of host pool (failover segment) is a pre-defined boundary for hosts
which they can find a healthy host to failover. These boundaries can defined as
"hosts are in same shared storage", "host aggregates", etc..

As a cloud operator, I need to perform host maintenances. I need to temporarily
and safely disable the HA mechanism for the affected hosts in order to perform
the maintenance task. Disabling HA mechanism for a host means that all alerts
from that host shall be neglected and no recovery action shall be taken.
For recovery, the actions are not limited to fencing, but nova server stop and
start, process restart on the host may also be a subject of the recovery
action.

As a cloud operator, I need to respond to customer issues and perform
troubleshooting. I need to know the history of what, when, where and how the
HA mechanism is performed. This information is used to better understand the
state of the system.

N.B. This user story concerns high availability, not 100% availability.
Therefore some service interruption is usually expected when failures occur.
The goal of the user story is to reduce that interruption via automated recovery.

Usage Scenario Examples
+++++++++++++++++++++++

* Recovery from VM failure

  Monitor the VM externally (i.e. as a black box, without requiring
  any knowledge of or invasive changes to the internals of the
  VM). Detect VM failure and notify system to recover the VM on the same
  hypervisor, or if that fails, on another hypervisor.

  Note that failures of the VM which are undetectable from outside it
  are out of scope of this user story, since they would require invasive
  monitoring inside the VM, and there is no general solution to this which
  would work across all guest operating systems and workloads.

* Recovery from ``nova-compute`` failure

  Monitor the provisioning process (nova-compute service). Detect
  process failure and notify system to restart the service.

  If it fails to restart the provisioning process, it should prevent scheduling
  any new VM instance onto the hypervisor/host that the process is running on.
  The operator can evacuate all VMs on this host to another healthy host and
  shutdown this host if it fails to restart the process. Prior to evacuation,
  the hosts must be fenced to prevent two instances writing to the same shared
  storage that lead to data corruption.

* Recovery from hypervisor host failure

  Monitor the hypervisor host. When failure is detected, resurrect
  all VMs from the failed host onto new hosts that enable an
  application workload to resume a process if the VM state is stored in a
  volume even though it loses the state on memory. If shared storage is used
  for instance volumes, these volumes survive outside the failed hypervisor
  host. However this is not required. If shared storage is not available,
  the instance VMs will be automatically rebuilt from their original image, as
  per standard nova evacuate behaviour.

  The design of the infrastructure, and its boundary of each subsystem such as
  shared storage, may restrict the deployment of VM instances and the
  candidates of failover hosts. To use nova-evacuate API to restart VM
  instances, the original hypervisor host and target hypervisor host need to
  connect to the same shared storage. Therefore, a cloud operator defines the
  segment of hypervisor hosts and assigns the failover hosts to each segments.
  These segments can be defined based on the shared storage boundaries or any
  other limitations critical for selecting the failover host.

* Recovery from network failure

  Typically the cloud infrastructure uses multiple networks, e.g.

  - an administrative network used for internal traffic such as the message bus,
    database connections, and Pacemaker cluster communication

  - various neutron networks

  - storage networks

  - remote control of physical hardware via IPMI / iLO / DRAC or similar

  Failures on these networks should not necessarily be handled in the same
  way.  For example:

  - If a compute host loses connection to the storage network, its VMs cannot
    continue to function correctly, so automatic fencing and resurrection is
    probably the only reasonable response.

  - If it loses connection to the admin network, its VMs should still continue
    to function correctly, so the cloud operator might prefer to receive
    alerts via email/SMS instead of any fencing and automated resurrection
    which would be needlessly disruptive.

  - If the compute host loses connection to the project (tenant) network, then
    it may be possible to fix this with minimal downtime by automatically
    migrating the VMs to another compute host.

  The desired response will vary from cloud to cloud, therefore should be
  configurable.

* Capacity Reservation

  In order to ensure the uptime of VM instance, the operator needs to ensure a
  certain amount of host capacity is reserved to cater for a failure event. If
  there is not enough host capacity and a host failure event happens, the VM
  on the failure host cannot be evacuated to another host. It is assumed that
  there is equivalent host within the fault boundaries. If not, a more
  complicated logic (e.g. SR-IOV, DMTC, QoS requirements) will be required in
  order to reserve the capacity.

  The host capacity of the overall system is typically fragmented into segments
  due to the underlying component’s scalability and each segment has a limited
  capacity. To increase resource efficiency, high utilization of host capacity
  is preferred. However, as resources are consumed on demand, each segment
  tends to reach nearly full capacity if the system doesn’t provide a way to
  reserve a portion of host capacity. Therefore, a function to reserve host
  capacity for failover events is important in order to achieve high
  availability of VMs.

* Host Maintenance

  A host has to be temporarily and safely removed from the overall system for
  maintenances such as hardware upgrade and firmware update. Live migration
  should be triggered after putting node into maintenance prior to maintenance.
  During maintenance, the monitoring function on the host should be disabled
  and the monitoring alert for the host should be ignored. There should be no
  triggering of any recovery action of VM instances on the host if it’s
  running. The host should be excluded from reserved hosts as well.

* Event History

  History of the past events such as process failures, VM failures and host
  failures are useful information to determine the required maintenance work of
  a host. An easy mechanism to track past events can save operator time from
  system diagnosis. These APIs can also be used to generate the health or SLA
  report of the VM availability status.

Related User Stories
++++++++++++++++++++

* `Quotas, Usage Plans, and Capacity Management <http://specs.openstack.org/openstack/openstack-user-stories/user-stories/draft/capacity_management.html>`_

  The concept of capacity reservation is common with this story. The difference
  is that the story provides the reservation for users where this VM-HA story
  provides the reservation for specific contexts of resource inquiry such as
  aninstance evacuation, not for an instance creation.

*Requirements*
++++++++++++++

* Flexible configuration of which VMs require HA

  Ideally it should be possible to configure which VMs require HA at
  several different levels of granularity, e.g. per VM, per flavor,
  per project, per availability zone, per host aggregate, per region,
  per cell.  A policy configuring a requirement or non-requirement for
  HA at a finer level of granularity should be able to override
  configuration set at a coarser level.  For example, an availability
  zone could be configured to require HA for all VMs inside it, but
  VMs booted within the availability zone with a flavor configured as
  not requiring HA would override the configuration at the
  availability zone level.

  However, it does not make sense to support configuration per compute
  host, since then VMs would inherit the HA feature
  non-deterministically, depending on whether ``nova-scheduler``
  happened to boot them on an HA compute host or a non-HA compute
  host.

* An ability to non-intrusively monitor VMs for failure

* An ability to monitor provisioning processes on the compute host for failure

  Provisioning processes include ``nova-compute``, associated backend
  hypervisor processes such as ``libvirtd``, and any other dependent
  services, e.g. ``neutron-openvswitch-agent`` if Open vSwitch is in use.

* An ability to monitor hypervisor host failure

* An ability to automatically restart VMs due to VM failure

  The restart should first be attempted on the same compute host, and if that
  fails, it should be attempted elsewhere.

* An ability to restart provisioning process

* An ability to automatically resurrect VMs from a failed hypervisor host
  and restart them on another available host

  The host must be fenced (typically via a STONITH mechanism) prior to the
  resurrection process, to ensure that there are never multiple instances of
  the same VM accidentally running concurrently and conflicting with each
  other.  The conflict could cause data corruption, e.g. if both instances are
  writing to the same non-clustered filesystem backed by a virtual disk on
  shared storage, but it could also cause service-level failures even without
  shared storage.  For example, a VM on a failing host could still be
  unexpectedly communicating on a project network even when the host is
  unreachable via the cluster network, and this could conflict with
  another instance of the same VM resurrected on another compute host.

* An ability to disable the ``nova-compute`` service of a failed host so
  that ``nova-scheduler`` will not attempt to provision new VMs to that
  host before ``nova`` notices.

* An ability to make sure the target host for VM evacuation is aligned with the
  underlying system boundaries and limitations

* An ability to reserve hypervisor host capacity and update the capacity in the
  event of a host failure

* An ability for operator to coordinate with host maintenance tasks

* An ability to check the history of failure and recovery actions

*External References*
+++++++++++++++++++++

* `Automatic Evacuation (Etherpad) <https://etherpad.openstack.org/p/automatic-evacuation>`_

* `Instance Auto-Evacuation Cross Project Spec (In Review) <https://review.openstack.org/#/c/257809>`_

* `Instance HA Discussion (Etherpad) <https://etherpad.openstack.org/p/newton-instance-ha>`_

* `High Availability for Pets and Hypervisors (Video) <https://youtu.be/lddtWUP_IKQ>`_

* `Masakari (GitHub) <https://github.com/ntt-sic/masakari>`_

* `Masakari API Design <https://github.com/ntt-sic/masakari/wiki/Masakari-API-Design>`_

*Rejected User Stories / Usage Scenarios*
-----------------------------------------

None.

Glossary
--------

* **MTTR** - Mean Time To Repair

* `Availability <https://en.wikipedia.org/wiki/Availability>`_ -
  ratio of the expected value of the uptime of a system
  to the aggregate of the expected values of up and down time.
  Not to be confused with
  `reliability <https://en.wikipedia.org/wiki/Reliability_engineering>`_.

* `High Availability <https://en.wikipedia.org/wiki/High_availability>`_ -
  a characteristic of a system which aims to ensure an agreed level of
  operational performance for a higher than normal period.  Not to be
  confused with 100% availability, which is sometimes described as
  `fault tolerance <https://en.wikipedia.org/wiki/Fault_tolerance>`_.

* `Pets and cattle
  <http://www.theregister.co.uk/2013/03/18/servers_pets_or_cattle_cern/>`_ -
  a metaphor commonly used in the OpenStack community to describe the
  difference between two service architecture models: cloud-native,
  stateless, disposable instances with built-in resilience in the
  application layer (cattle), vs. legacy, stateful instances with no
  built-in resilience (pets).