[HA Guide] Update for the current predominant architectures

Begin updating the guide to reflect current architecture best practices. Remove the keepalived architecture, its use is increasingly rare. Change-Id: Id62d09707611f4706620b00f7800b80138afe98d
2016-09-20 12:31:35 +10:00 · 2016-09-20 12:31:35 +10:00 · c5c825fd88
parent 5a45c9ce62
commit c5c825fd88
3 changed files with 27 additions and 107 deletions
--- a/doc/ha-guide/source/figures/keepalived-arch.jpg
+++ b/doc/ha-guide/source/figures/keepalived-arch.jpg
--- a/doc/ha-guide/source/intro-ha-arch-keepalived.rst
+++ b/doc/ha-guide/source/intro-ha-arch-keepalived.rst
@ -1,96 +0,0 @@
-============================
-The keepalived architecture
-============================
-
-High availability strategies
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The following diagram shows a very simplified view of the different
-strategies used to achieve high availability for the OpenStack
-services:
-
-.. image:: /figures/keepalived-arch.jpg
-   :width: 100%
-
-Depending on the method used to communicate with the service, the
-following availability strategies will be followed:
-
-  Keepalived, for the HAProxy instances.
-  Access via an HAProxy virtual IP, for services such as HTTPd that
-   are accessed via a TCP socket that can be load balanced
-  Built-in application clustering, when available from the application.
-   Galera is one example of this.
-  Starting up one instance of the service on several controller nodes,
-   when they can coexist and coordinate by other means. RPC in
-   ``nova-conductor`` is one example of this.
-  No high availability, when the service can only work in
-   active/passive mode.
-
-There are known issues with cinder-volume that recommend setting it as
-active-passive for now, see:
-https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
-
-While there will be multiple neutron LBaaS agents running, each agent
-will manage a set of load balancers, that cannot be failed over to
-another node.
-
-Architecture limitations
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-This architecture has some inherent limitations that should be kept in
-mind during deployment and daily operations.
-The following sections describe these limitations.
-
-#. Keepalived and network partitions
-
-   In case of a network partitioning, there is a chance that two or
-   more nodes running keepalived claim to hold the same VIP, which may
-   lead to an undesired behaviour. Since keepalived uses VRRP over
-   multicast to elect a master (VIP owner), a network partition in
-   which keepalived nodes cannot communicate will result in the VIPs
-   existing on two nodes. When the network partition is resolved, the
-   duplicate VIPs should also be resolved. Note that this network
-   partition problem with VRRP is a known limitation for this
-   architecture.
-
-#. Cinder-volume as a single point of failure
-
-   There are currently concerns over the cinder-volume service ability
-   to run as a fully active-active service. During the Mitaka
-   timeframe, this is being worked on, see:
-   https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
-   Thus, cinder-volume will only be running on one of the controller
-   nodes, even if it will be configured on all nodes. In case of a
-   failure in the node running cinder-volume, it should be started in
-   a surviving controller node.
-
-#. Neutron-lbaas-agent as a single point of failure
-
-   The current design of the neutron LBaaS agent using the HAProxy
-   driver does not allow high availability for the project load
-   balancers. The neutron-lbaas-agent service will be enabled and
-   running on all controllers, allowing for load balancers to be
-   distributed across all nodes. However, a controller node failure
-   will stop all load balancers running on that node until the service
-   is recovered or the load balancer is manually removed and created
-   again.
-
-#. Service monitoring and recovery required
-
-   An external service monitoring infrastructure is required to check
-   the OpenStack service health, and notify operators in case of any
-   failure. This architecture does not provide any facility for that,
-   so it would be necessary to integrate the OpenStack deployment with
-   any existing monitoring environment.
-
-#. Manual recovery after a full cluster restart
-
-   Some support services used by RDO or RHEL OSP use their own form of
-   application clustering. Usually, these services maintain a cluster
-   quorum, that may be lost in case of a simultaneous restart of all
-   cluster nodes, for example during a power outage. Each service will
-   require its own procedure to regain quorum.
-
-If you find any or all of these limitations concerning, you are
-encouraged to refer to the
-:doc:`Pacemaker HA architecture<intro-ha-arch-pacemaker>` instead.
--- a/doc/ha-guide/source/intro-ha-controller.rst
+++ b/doc/ha-guide/source/intro-ha-controller.rst
@ -42,21 +42,37 @@ Networking for high availability.
 Common deployment architectures
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-There are primarily two HA architectures in use today.
+There are primarily two recommended architectures for making OpenStack
+highly available.

-One uses a cluster manager such as Pacemaker or Veritas to co-ordinate
-the actions of the various services across a set of machines. Since
-we are focused on FOSS, we will refer to this as the Pacemaker
-architecture.
+Both use a cluster manager such as Pacemaker or Veritas to
+orchestrate the actions of the various services across a set of
+machines. Since we are focused on FOSS, we will refer to these as
+Pacemaker architectures.

-The other is optimized for Active/Active services that do not require
-any inter-machine coordination. In this setup, services are started by
-your init system (systemd in most modern distributions) and a tool is
-used to move IP addresses between the hosts. The most common package
-for doing this is keepalived.
+The architectures differ in the sets of services managed by the
+cluster.
+
+Traditionally, Pacemaker has been positioned as an all-encompassing
+solution. However, as OpenStack services have matured, they are
+increasingly able to run in an active/active configuration and
+gracefully tolerate the disappearance of the APIs on which they
+depend.
+
+With this in mind, some vendors are restricting Pacemaker's use to
+services that must operate in an active/passive mode (such as
+cinder-volume), those with multiple states (for example, Galera) and
+those with complex bootstrapping procedures (such as RabbitMQ).
+
+The majority of services, needing no real orchestration, are handled
+by Systemd on each node. This approach avoids the need to coordinate
+service upgrades or location changes with the cluster and has the
+added advantage of more easily scaling beyond Corosync's 16 node
+limit. However, it will generally require the addition of an
+enterprise monitoring solution such as Nagios or Sensu for those
+wanting centralized failure reporting.

 .. toctree::
   :maxdepth: 1

   intro-ha-arch-pacemaker.rst
-   intro-ha-arch-keepalived.rst