Support HT-HA deployments
Some applications require Congress to be highly available (HA). Some applications require a Congress policy engine to handle a high volume of queries (high throughput - HT). This spec describes how we can support several deployment schemes that address some HA and HT requirements. blueprint high-availability-design Change-Id: I0ce2fc3895f022d3b363c9674c6dafbbc7447c75
This commit is contained in:
parent
f982d19f70
commit
4379a8c44f
|
@ -0,0 +1,746 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================================================
|
||||
Support high availability, high throughput deployment
|
||||
=====================================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/congress/+spec/high-availability-design
|
||||
|
||||
Some applications require Congress to be highly available (HA). Some
|
||||
applications require a Congress policy engine to handle a high volume of
|
||||
queries (high throughput - HT). This proposal describes how we can support
|
||||
several deployment schemes that address several HA and HT requirements.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
This spec aims to address three main problems:
|
||||
|
||||
1. Congress is not currently able to provide high query throughput because
|
||||
all queries are handled by a single, single-threaded policy engine instance.
|
||||
2. Congress is not currently able to failover quickly when its policy engine
|
||||
becomes unavailable.
|
||||
3. If the policy engine and a push datasource driver both crash, Congress is
|
||||
not currently able to restore the latest data state upon restart or
|
||||
failover.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Implement the required code changes and create deployment guides for the
|
||||
following reference deployments.
|
||||
|
||||
Warm standby for all Congress components in single process
|
||||
----------------------------------------------------------
|
||||
|
||||
- Downtime: ~1 minute (start a new Congress instance and ingest data from
|
||||
scratch)
|
||||
- Reliability: action executions may be lost during downtime.
|
||||
- Performance considerations: uniprocessing query throughput
|
||||
- Code changes: minimal
|
||||
|
||||
Active-active PE replication, DSDs warm-standby
|
||||
----------------------------------------------------
|
||||
|
||||
Run N instances of Congress policy engine in active-active configuration. One
|
||||
datasource driver per physical datasource published data on oslo-messaging to
|
||||
all policy engines.
|
||||
|
||||
::
|
||||
|
||||
+-------------------------------------+ +--------------+
|
||||
| Load Balancer (eg. HAProxy) | <----+ Push client |
|
||||
+----+-------------+-------------+----+ +--------------+
|
||||
| | |
|
||||
PE | PE | PE | all+DSDs node
|
||||
+---------+ +---------+ +---------+ +-----------------+
|
||||
| +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ |
|
||||
| | API | | | | API | | | | API | | | | DSD | | DSD | |
|
||||
| +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ |
|
||||
| +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ |
|
||||
| | PE | | | | PE | | | | PE | | | | DSD | | DSD | |
|
||||
| +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ |
|
||||
+---------+ +---------+ +---------+ +--------+--------+
|
||||
| | | |
|
||||
| | | |
|
||||
+--+----------+-------------+--------+--------+
|
||||
| |
|
||||
| |
|
||||
+-------+----+ +------------------------+-----------------+
|
||||
| Oslo Msg | | DBs (policy, config, push data, exec log)|
|
||||
+------------+ +------------------------------------------+
|
||||
|
||||
|
||||
- Downtime: < 1s for queries, ~2s for reactive enforcement
|
||||
- Deployment considerations:
|
||||
|
||||
- Cluster manager (eg. Pacemaker + Corosync) can be used to manage warm
|
||||
standby
|
||||
- Does not require global leader election
|
||||
- Performance considerations:
|
||||
|
||||
- Multi-process, multi-node query throughput
|
||||
- No redundant data-pulling load on datasources
|
||||
- DSDs node separate from PE, allowing high load DSDs to operate more
|
||||
smoothly and avoid affecting PE performance.
|
||||
- PE nodes are symmetric in configuration, making it easy to load balance
|
||||
evenly.
|
||||
|
||||
- Code changes:
|
||||
|
||||
- New synchronizer and harness to support two different node types:
|
||||
API+PE node and all-DSDs node
|
||||
|
||||
Details
|
||||
###########################################################################
|
||||
|
||||
- Datasource drivers (DSDs):
|
||||
|
||||
- One datasource driver per physical datasource.
|
||||
- All DSDs run in a single DSE node (process)
|
||||
- Push DSDs: optionally persist data in push data DB, so a new snapshot can
|
||||
be obtained whenever needed.
|
||||
|
||||
- Policy engine (PE):
|
||||
|
||||
- Replicate policy engine in active-active configuration.
|
||||
- Policy synchronized across PE instances via Policy DB
|
||||
- Every instance subscribes to the same data on oslo-messaging.
|
||||
- Reactive enforcement:
|
||||
All PE instances initiate reactive policy actions, but each DSD locally
|
||||
selects a "leader" to "listen to". The DSD ignores execution requests
|
||||
initiated by all other PE instances.
|
||||
|
||||
- Every PE instance computes the required reactive enforcement actions and
|
||||
initiate the corresponding execution requests over oslo-messaging.
|
||||
- Each DSD locally picks PE instance as leader (say the first instance the
|
||||
DSD hears from in the asymmetric node deployment, or the PE instance on
|
||||
the same node as the DSD in a symmetric node deployment) and executes
|
||||
only requests from that PE.
|
||||
- If heartbeat contact is lost with the leader, the DSD selects a new
|
||||
leader.
|
||||
- Each PE instance is unaware of whether it is a "leader"
|
||||
|
||||
- API:
|
||||
|
||||
- Each node has an active API service
|
||||
- Each API service routes requests for PE to its associated intranode PE
|
||||
- Requests for any other service(eg. get data source status) are routed to
|
||||
DSE2, which will be fielded by some active instance of the service on some
|
||||
node
|
||||
- Details:
|
||||
- in API models, replace every invoke_rpc with a conditional:
|
||||
|
||||
- if the target is policy engine, target same-node PE
|
||||
eg.::
|
||||
self.invoke_rpc(
|
||||
caller, 'get_row_data', args, server=self.node.pe.rpc_server)
|
||||
|
||||
- otherwise, invoke_rpc stays as is, routing to DSE2
|
||||
eg.::
|
||||
self.invoke_rpc(caller, 'get_row_data', args)
|
||||
|
||||
- Load balancer:
|
||||
|
||||
- Layer 7 load balancer (e.g. HAProxy) distributes incoming API calls among
|
||||
the nodes (each running API service).
|
||||
- load balancer optionally configured to use sticky session to pin each API
|
||||
caller to a particular node. This configuration avoids the experience of
|
||||
going back in time.
|
||||
|
||||
- External components (load balancer, DBs, and oslo messaging bus) can be made
|
||||
highly available using standard solutions (e.g. clustered LB, Galera MySQL
|
||||
cluster, HA rabbitMQ)
|
||||
|
||||
|
||||
Dealing with missed actions during failover
|
||||
###########################################################################
|
||||
|
||||
When a leader fails (global or local), it takes time for the failure to be
|
||||
detected and a new leader anointed. During the failover, reactive enforcement
|
||||
actions expected to be triggered would be missed. Four proposed approaches
|
||||
are discussed below.
|
||||
|
||||
- Tolerate missed actions during failover: for same applications, it may be
|
||||
acceptable to miss actions during failover.
|
||||
|
||||
- Re-execute recent actions after failover
|
||||
|
||||
- Each PE instance remembers its recent action requests (including the
|
||||
requests a follower PE computed but did not send)
|
||||
- On failover, the DSD requests the recent action requests from the new
|
||||
leader and executes them (within a certain recency window)
|
||||
- Duplicate execution expected on failover.
|
||||
- Re-execute recent unmatched actions after failover (possible future work)
|
||||
|
||||
- We can reduce the number of duplicate executions on failover by attempting
|
||||
to match a new leader's recent action requests with the already executed
|
||||
requests, and only additionally executing those unmatched.
|
||||
- DSD logs all recent successfully executed action requests in DB
|
||||
- Request matching can be done by a combination of the following information:
|
||||
|
||||
- the action requested
|
||||
- the timestamp of the request
|
||||
- the sequence number of the data update that triggered the action
|
||||
- the full derivation tree that triggers the action
|
||||
- Matching is imperfect, but still helpful
|
||||
|
||||
- Log and replay data updates (possible future work)
|
||||
|
||||
- Log every data update from every data source, and let a new leader replay
|
||||
the updates where the previous leader left off to generate the needed
|
||||
action requests.
|
||||
- The logging can be directly supported by transport or by additional DB
|
||||
|
||||
- kafka naturally supports this model
|
||||
- hard to do directly with oslo-messaging + RabbitMQ
|
||||
|
||||
- Leaderless de-duplication (possible future work)
|
||||
|
||||
- If a very good matching method is implemented for re-execution for recent
|
||||
unmatched actions after failover, it is possible to go one stop further
|
||||
and simply operate in this mode full time.
|
||||
- Each incoming action request is matched against all recently executed
|
||||
action requests.
|
||||
|
||||
- Discard if matched.
|
||||
- Execute if unmatched.
|
||||
- Eliminates the need for selecting leader (global or local) and improves
|
||||
failover speed
|
||||
|
||||
We propose to focus first on supporting the first two options
|
||||
(deployers' choice). The more complex options may be implemented and supported
|
||||
in future work.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We first discuss the main decision points before detailing several alternative
|
||||
deployments.
|
||||
|
||||
For active-active replication of the PE, here are the main decision points:
|
||||
|
||||
A. node configurations
|
||||
|
||||
- Options:
|
||||
|
||||
1. single node-type (every node has API+PE+DSDs).
|
||||
2. two node-types (API+PE nodes, all-DSDs node). [proposed target]
|
||||
3. many node-types (API+PE nodes, all DSDs in separate nodes).
|
||||
|
||||
- Discussions: The many node-types configuration is most flexible and has
|
||||
the best support for high-load DSDs, but it also requires the most work to
|
||||
dev and to deploy.
|
||||
We propose to target the two node-types configuration because it gives
|
||||
reasonable support for high-load DSDs while keeping both the development
|
||||
and the deployment complexities low.
|
||||
|
||||
B. global vs local leader for action execution
|
||||
|
||||
- Options:
|
||||
|
||||
1. global leader: Pacemaker anoints a global leader among PE instances;
|
||||
only the leader sends action-execution requests.
|
||||
2. local leader: every PE instance sends action-execution requests, but
|
||||
each receiving DSD locally picks a "leader" to listen to.
|
||||
[proposed target]
|
||||
|
||||
- Discussions: Because there is a single active DSD for a given data source,
|
||||
it is a natural spot to locally choose a "leader" among the PE instances
|
||||
sending reactive enforcement action execution requests.
|
||||
We propose to target the local leader style because it avoids the
|
||||
development and deployment complexities associated with global leader
|
||||
election.
|
||||
Furthermore, because all PE instances perform reactive enforcement and send
|
||||
action execution requests, the redundancy opens up the possibility for
|
||||
zero disruption to reactive enforcement when a PE intance fails.
|
||||
|
||||
C. DSD redundancy
|
||||
|
||||
- Options:
|
||||
|
||||
1. warm standby: only one set of DSDs running at a given time; backup
|
||||
instances ready to launch.
|
||||
2. hot standby: multiple instances running, but only one set is active.
|
||||
3. active-active: multiple instances active.
|
||||
|
||||
- Discussions:
|
||||
|
||||
- For pull DSDs, we propose to target warm standby seems most appropriate
|
||||
because warm startup time is low (seconds) relative to frequency of data
|
||||
pulls.
|
||||
- For push DSDs, warm standby is generally sufficient except for use cases
|
||||
that demand sub-second latency even during a failover. Those use cases
|
||||
would require active-active replication of the push DSDs. But even with
|
||||
active-active replication of push DSDs, other unsolved issues in
|
||||
action-execution prevent us from delivering sub-second end-to-end latency
|
||||
(push data to triggered action executed) during failover (see leaderless
|
||||
de-duplication approach for sub-second action execution failover).
|
||||
Since we cannot yet realize the benefit of active-active replication of
|
||||
push DSDs, we propose to target a warm-standby deployment, leaving
|
||||
active-active replication as potential future work.
|
||||
|
||||
Active-active PE replication, DSDs hot-standby, all components in one node
|
||||
###########################################################################
|
||||
|
||||
Run N instances of single-process Congress.
|
||||
|
||||
One instance is selected as leader by Pacemaker. Only the leader has active
|
||||
datasource drivers (which pull data, accept push data, and accept RPC calls
|
||||
from the API service), but all instances subscribes to and processes data on
|
||||
oslo-messaging. Queries are load balanced among instances.
|
||||
|
||||
::
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| Load Balancer (eg. HAProxy) |
|
||||
+----+------------------------+------------------------+----------------+
|
||||
| | |
|
||||
| leader | follower | follower
|
||||
+---------------------+ +---------------------+ +---------------------+
|
||||
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
||||
| | API | |DSD (on) | | | | API | |DSD (off)| | | | API | |DSD (off)| |
|
||||
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
||||
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
||||
| | PE | |DSD (on) | | | | PE | |DSD (off)| | | | PE | |DSD (off)| |
|
||||
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
||||
+---------------------+ +---------------------+ +---------------------+
|
||||
| | |
|
||||
| | |
|
||||
+---------+--------------+-------------------+----+
|
||||
| |
|
||||
| |
|
||||
+--------------+-----------+ +--------------------+---------------------+
|
||||
| Oslo Msg | | DBs (policy, config, push data, exec log)|
|
||||
+--------------------------+ +------------------------------------------+
|
||||
|
||||
|
||||
|
||||
- Downtime: < 1s for queries, ~2s for reactive policy
|
||||
- Deployment considerations:
|
||||
|
||||
- Requires cluster manager (Pacemaker) and cluster messaging (Corosync)
|
||||
- Relatively simple Pacemaker deployment because every node is identical
|
||||
- Requires leader election (handled by Pacemaker+Corosync)
|
||||
- Easy to start new DSD (make API call, all instances sync via DB)
|
||||
- Performance considerations:
|
||||
|
||||
- [Pro] Multi-process query throughput
|
||||
- [Pro] No redundant data-pulling load on datasources
|
||||
- [Con] If some data source drivers have high demand (say telemetry data),
|
||||
performance may suffer when deployed in the same python process as other
|
||||
Congress components.
|
||||
- [Con] Because the leader has the added load of active DSDs, PE performance
|
||||
may be reduced, making it harder to evenly load balance across instances.
|
||||
- Code changes:
|
||||
|
||||
- Add API call to designate a node as leader or follower
|
||||
- Custom resource agent that allows Pacemaker to start, stop, promote, and
|
||||
demote Congress instances
|
||||
|
||||
Details
|
||||
++++++++++++++
|
||||
|
||||
- Pull datasource drivers (pull DSD):
|
||||
|
||||
- One active datasource driver per physical datasource.
|
||||
- Only leader node has active DSDs (active polling loop and active
|
||||
RPC server)
|
||||
- On node failure, new leader node activates DSDs.
|
||||
|
||||
- Push datasource drivers (push DSD):
|
||||
|
||||
- One active datasource driver per physical datasource.
|
||||
- Only leader node has active DSDs (active RPC server)
|
||||
- On node failure, new leader node activates DSDs.
|
||||
- Persist data in push data DB, so a new snapshot can be obtained.
|
||||
|
||||
- Policy engine (PE):
|
||||
|
||||
- Replicate policy engine in active-active configuration.
|
||||
- Policy synchronized across PE instances via Policy DB
|
||||
- Every instance subscribes to the same data on oslo-messaging.
|
||||
- Reactive enforcement: See later section "Reactive enforcement architecture
|
||||
for active-active deployments"
|
||||
|
||||
- API:
|
||||
|
||||
- Add new API calls for designating the receiving node as leader or follower.
|
||||
The call must complete all tasks before returning (eg. start/stop RPC)
|
||||
- Each node has an active API service
|
||||
- Each API service routes requests for PE to its associated intranode PE,
|
||||
bypassing DSE2.
|
||||
- Requests for any other service(eg. get data source status) are routed to
|
||||
DSE2, which will be fielded by some active instance of the service on some
|
||||
node
|
||||
- Details:
|
||||
- in API models, replace every invoke_rpc with a conditional:
|
||||
|
||||
- if the target is policy engine, target same-node PE
|
||||
eg.::
|
||||
self.invoke_rpc(
|
||||
caller, 'get_row_data', args,
|
||||
server=self.node.pe.rpc_server)
|
||||
|
||||
- otherwise, invoke_rpc stays as is, routing to DSE2
|
||||
eg.::
|
||||
self.invoke_rpc(caller, 'get_row_data', args)
|
||||
|
||||
- Load balancer:
|
||||
|
||||
- Layer 7 load balancer (e.g. HAProxy) distributes incoming API calls among
|
||||
the nodes (each running API service).
|
||||
- load balancer optionally configured to use sticky session to pin each API
|
||||
caller to a particular node. This configuration avoids the experience of
|
||||
going back in time.
|
||||
|
||||
- Each DseNode is monitored and managed by a cluster manager (eg. Pacemaker)
|
||||
- External components (load balancer, DBs, and oslo messaging bus) can be made
|
||||
highly available using standard solutions (e.g. clustered LB, Galera MySQL
|
||||
cluster, HA rabbitMQ)
|
||||
|
||||
- Global leader election with Pacemaker:
|
||||
|
||||
- A resource agent contains the scripts that tells a Congress instance it is
|
||||
a leader or follower.
|
||||
- Pacemaker decides which Congress instance to promote to leader (master).
|
||||
- Pacemaker promotes (demotes) the appropriate Congress instance to leader
|
||||
(follower) via the resource agent.
|
||||
- Fencing:
|
||||
|
||||
- If the leader node stops responding, and a new node is promoted to
|
||||
leader, it is possible that the unresponsive node is still doing work
|
||||
(eg. listening on oslo-messaging, issuing action requests).
|
||||
- It is generally not a catastrophe if for a time there is more than one
|
||||
Congress node doing the work of a leader. (Potential effects may include:
|
||||
duplicate action requests and redundant data source pulls)
|
||||
- Pacemaker can be configured with strict fencing and STONITH for
|
||||
deployments that require it. (deployers' choice)
|
||||
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_what_is_stonith
|
||||
|
||||
- In case of network partitions:
|
||||
|
||||
- Pacemaker can be configured to stop each node that is not part of a
|
||||
cluster reaching quorum, or to allow each partition to continue
|
||||
operating. (deployers' choice)
|
||||
http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-options.html
|
||||
|
||||
Active-active PE replication, DSDs warm-standby, each DSD in its own node
|
||||
###########################################################################
|
||||
|
||||
Run N instances of Congress policy engine in active-active configuration. One
|
||||
datasource driver per physical datasource published data on oslo-messaging to
|
||||
all policy engines.
|
||||
|
||||
::
|
||||
|
||||
+-------------------------------------+
|
||||
| Load Balancer (eg. HAProxy) |
|
||||
+----+-------------+-------------+----+
|
||||
| | |
|
||||
| | |
|
||||
+---------+ +---------+ +---------+
|
||||
| +-----+ | | +-----+ | | +-----+ |
|
||||
| | API | | | | API | | | | API | |
|
||||
| +-----+ | | +-----+ | | +-----+ |
|
||||
| +-----+ | | +-----+ | | +-----+ |
|
||||
| | PE | | | | PE | | | | PE | |
|
||||
| +-----+ | | +-----+ | | +-----+ |
|
||||
+---------+ +---------+ +---------+
|
||||
| | |
|
||||
| | |
|
||||
+------------------------------------------+-----------------+
|
||||
| | | | |
|
||||
| | | | |
|
||||
+----+-------------+-------------+---+ +-------+--------+ +-----+-----+
|
||||
| Oslo Msg | | Push Data DB | | DBs |
|
||||
+----+-------------+-------------+---+ ++---------------+ +-----------+
|
||||
| | | | (DBs may be combined)
|
||||
+----------+ +----------+ +----------+
|
||||
| +------+ | | +------+ | | +------+ |
|
||||
| | Poll | | | | Poll | | | | Push | |
|
||||
| | Drv | | | | Drv | | | | Drv | |
|
||||
| | DS 1 | | | | DS 2 | | | | DS 3 | |
|
||||
| +------+ | | +------+ | | +------+ |
|
||||
+----------+ +----------+ +----------+
|
||||
| | |
|
||||
+---+--+ +---+--+ +---+--+
|
||||
| | | | | |
|
||||
| DS 1 | | DS 2 | | DS 3 |
|
||||
| | | | | |
|
||||
+------+ +------+ +------+
|
||||
|
||||
- Downtime: < 1s for queries, ~2s for reactive policy
|
||||
- Deployment considerations:
|
||||
|
||||
- Requires cluster manager (Pacemaker) and cluster messaging (Corosync)
|
||||
- More complex Pacemaker deployment because there are many different
|
||||
kinds of nodes
|
||||
- Does not require global leader election (but that's not a big saving if
|
||||
we're running Pacemaker+Corosync anyway)
|
||||
- Performance considerations:
|
||||
|
||||
- [Pro] Multi-process query throughput
|
||||
- [Pro] No redundant data-pulling load on datasources
|
||||
- [Pro] Each DSD can run in its own node, allowing high load DSDs to operate
|
||||
more smoothly and avoid affecting PE performance.
|
||||
- [Pro] PE nodes are symmetric in configuration, making it easy to load
|
||||
balance evenly.
|
||||
|
||||
- Code changes:
|
||||
|
||||
- New synchronizer and harness and DB schema to support per node
|
||||
configuration
|
||||
|
||||
Hot standby for all Congress components in single process
|
||||
###########################################################################
|
||||
|
||||
Run N instances of single-process Congress, as proposed in:
|
||||
https://blueprints.launchpad.net/congress/+spec/basic-high-availability
|
||||
|
||||
A floating IP points to the primary instance which handles all queries and
|
||||
requests, failing over when primary instance is down.
|
||||
All instances ingest and process data to stay up to date.
|
||||
|
||||
::
|
||||
|
||||
+---------------+
|
||||
| Floating IP | - - - - - - + - - - - - - - - - - - -+
|
||||
+----+----------+ | |
|
||||
|
|
||||
| | |
|
||||
+---------------------+ +---------------------+ +---------------------+
|
||||
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
||||
| | API | | DSD | | | | API | | DSD | | | | API | | DSD | |
|
||||
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
||||
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
||||
| | PE | | DSD | | | | PE | | DSD | | | | PE | | DSD | |
|
||||
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
||||
| +-----------------+ | | +-----------------+ | | +-----------------+ |
|
||||
| | Oslo Msg | | | | Oslo Msg | | | | Oslo Msg | |
|
||||
| +-----------------+ | | +-----------------+ | | +-----------------+ |
|
||||
+---------------------+ +---------------------+ +---------------------+
|
||||
| | |
|
||||
| | |
|
||||
+----------+------------------------+----------------------+------------+
|
||||
| Databases |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
|
||||
- Downtime: < 1s for queries, ~2s for reactive policy
|
||||
- Feature limitations:
|
||||
|
||||
- Limited support for action execution: each action execution is triggered
|
||||
N times)
|
||||
- Limited support for push drivers: push updates only primary instance
|
||||
(optional DB-sync to non-primary instances)
|
||||
|
||||
- Deployment considerations:
|
||||
|
||||
- Very easy to deploy. No need for cluster manager. Just start N independent
|
||||
instances of Congress (in-process messaging) and setup floating IP.
|
||||
- Performance considerations:
|
||||
|
||||
- Performance characteristics similar to single-instance Congress
|
||||
- [Con] uniprocessing query throughput (optional load balancer can be added
|
||||
to balance queries between instances)
|
||||
- [Con] Extra load on data sources from replicated data source drivers
|
||||
pulling same data N times
|
||||
- Code changes:
|
||||
|
||||
- (optional) DB-sync of pushed data to non-primary instances
|
||||
|
||||
|
||||
Policy
|
||||
------
|
||||
|
||||
Not applicable
|
||||
|
||||
Policy actions
|
||||
--------------
|
||||
|
||||
Not applicable
|
||||
|
||||
|
||||
Data sources
|
||||
------------
|
||||
|
||||
Not applicable
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
No impact
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
No impact
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
No major security impact identified compared to a non-HA distributed
|
||||
deployment.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
No impact
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
Proposed changes generally transparent to end user. Some exceptions:
|
||||
|
||||
- Different PE instances may be out-of-sync in their data and policies
|
||||
(eventual consistency). The issue is generally made transparent to the end
|
||||
user by making each user sticky to a particular PE instance. But if
|
||||
a PE instance goes down, the end user reaches a different instance and may
|
||||
experience out-of-sync artifacts.
|
||||
|
||||
Performance impact
|
||||
------------------
|
||||
|
||||
- In single node deployment, there is generally no performance impact.
|
||||
- Increased latency due to network communication required by multi-node
|
||||
deployment
|
||||
- Increased reactive enforcement latency if action executions are persistently
|
||||
logged to facilitate smoother failover
|
||||
- PE replication can achieve greater query throughput
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
- New config settings:
|
||||
|
||||
- set DSE node type to one of the following:
|
||||
|
||||
- PE+API node
|
||||
- a DSDs node
|
||||
- all-in-one node (backward compatible default)
|
||||
|
||||
- set reactive enforcement failover strategy:
|
||||
|
||||
- do not attempt to recover missed actions (backward compatible default)
|
||||
- after failover, repeat recent action requests
|
||||
- after failover, repeat recent action requests not matched to logged
|
||||
executions
|
||||
|
||||
- Proposed changes have no impact on existing single-node deployments.
|
||||
100% backward compatibility expected.
|
||||
- Changes only have effect if deployer chooses to set up a multi-node
|
||||
deployment with the appropriate configs.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
No major impact identified.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Work to be assigned and tracked via launchpad.
|
||||
|
||||
|
||||
Work items
|
||||
----------
|
||||
|
||||
Items with order dependency:
|
||||
|
||||
1. API routing.
|
||||
Change API routing to support active-active PE replication, routing PE-bound
|
||||
RPCs to the PE instance on the same node as the receiving API server.
|
||||
Changes expected to be confined to congress/api/*
|
||||
2. Reactive enforcement.
|
||||
Change datasource_driver.py:ExecutionDriver class to handle action execution
|
||||
requests from replicated PE-nodes (locally choose leader to follow)
|
||||
3. (low priority) Missed actions mitigation.
|
||||
|
||||
- Implement changes to mitigate missed actions during DSD failover
|
||||
- Implement changes to mitigate missed actions during PE failover
|
||||
|
||||
Items without dependency:
|
||||
|
||||
- Push data persistence.
|
||||
Change datasource_driver.py:PushedDataSourceDriver class to support
|
||||
persistence of pushed data. Corresponding DB changes also needed.
|
||||
- (potential) Leaderless de-duplication of action execution requests.
|
||||
- HA guide.
|
||||
Create HA guide sketching the properties and trade-offs of each different
|
||||
deployment types.
|
||||
- Deployment guide.
|
||||
Create deployment guide for active-active PE replication
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- Requires Congress to support distributed deployment (for example with policy
|
||||
engine and datasource drivers on separate DseNodes.) Distributed deployment
|
||||
has been addressed by several implemented blueprints. The following patch is
|
||||
expected to be the final piece required.
|
||||
https://review.openstack.org/#/c/307693/
|
||||
- This spec does not introduce any new code of library dependencies.
|
||||
- In line with OpenStack recommendation
|
||||
(http://docs.openstack.org/ha-guide/controller-ha-pacemaker.html), some
|
||||
reference deployments use open source software outside of OpenStack:
|
||||
- HAProxy: http://www.haproxy.org
|
||||
- Pacemaker: http://clusterlabs.org/wiki/Pacemaker
|
||||
- Corosync: http://corosync.github.io/corosync/
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We propose to add tempest tests for the following scenarios:
|
||||
- Up to (N - 1) PE-nodes killed
|
||||
- Previously killed PE-nodes rejoin.
|
||||
- Kill and restart DSDs-node, possibly at the same time PE-nodes are killed.
|
||||
|
||||
Split brain scenarios can be manually tested.
|
||||
|
||||
|
||||
Documentation impact
|
||||
====================
|
||||
|
||||
Deployment guide to be added for each supported reference deployment. No impact
|
||||
on existing documentation.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
- IRC discussion on major design decisions (#topic HA design):
|
||||
http://eavesdrop.openstack.org/meetings/congressteammeeting/2016/congressteammeeting.2016-06-09-00.04.log.txt
|
||||
- Notes from summit session:
|
||||
https://etherpad.openstack.org/p/newton-congress-availability
|
||||
- OpenStack HA guide:
|
||||
http://docs.openstack.org/ha-guide/controller-ha-pacemaker.html
|
||||
- HAProxy documentation:
|
||||
http://www.haproxy.org/#docs
|
||||
- Pacemaker documentation:
|
||||
|
||||
- directory: http://clusterlabs.org/doc/
|
||||
- Cluster from scratch:
|
||||
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/index.html
|
||||
- Configuration explained:
|
||||
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
|
||||
|
||||
- OCF resource agents:
|
||||
http://www.linux-ha.org/wiki/OCF_Resource_Agents
|
Loading…
Reference in New Issue