746 lines
29 KiB
ReStructuredText
746 lines
29 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
=====================================================
|
|
Support high availability, high throughput deployment
|
|
=====================================================
|
|
|
|
Include the URL of your launchpad blueprint:
|
|
|
|
https://blueprints.launchpad.net/congress/+spec/high-availability-design
|
|
|
|
Some applications require Congress to be highly available (HA). Some
|
|
applications require a Congress policy engine to handle a high volume of
|
|
queries (high throughput - HT). This proposal describes how we can support
|
|
several deployment schemes that address several HA and HT requirements.
|
|
|
|
|
|
Problem description
|
|
===================
|
|
|
|
This spec aims to address three main problems:
|
|
|
|
1. Congress is not currently able to provide high query throughput because
|
|
all queries are handled by a single, single-threaded policy engine instance.
|
|
2. Congress is not currently able to failover quickly when its policy engine
|
|
becomes unavailable.
|
|
3. If the policy engine and a push datasource driver both crash, Congress is
|
|
not currently able to restore the latest data state upon restart or
|
|
failover.
|
|
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
Implement the required code changes and create deployment guides for the
|
|
following reference deployments.
|
|
|
|
Warm standby for all Congress components in single process
|
|
----------------------------------------------------------
|
|
|
|
- Downtime: ~1 minute (start a new Congress instance and ingest data from
|
|
scratch)
|
|
- Reliability: action executions may be lost during downtime.
|
|
- Performance considerations: uniprocessing query throughput
|
|
- Code changes: minimal
|
|
|
|
Active-active PE replication, DSDs warm-standby
|
|
----------------------------------------------------
|
|
|
|
Run N instances of Congress policy engine in active-active configuration. One
|
|
datasource driver per physical datasource published data on oslo-messaging to
|
|
all policy engines.
|
|
|
|
::
|
|
|
|
+-------------------------------------+ +--------------+
|
|
| Load Balancer (eg. HAProxy) | <----+ Push client |
|
|
+----+-------------+-------------+----+ +--------------+
|
|
| | |
|
|
PE | PE | PE | all+DSDs node
|
|
+---------+ +---------+ +---------+ +-----------------+
|
|
| +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ |
|
|
| | API | | | | API | | | | API | | | | DSD | | DSD | |
|
|
| +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ |
|
|
| +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ |
|
|
| | PE | | | | PE | | | | PE | | | | DSD | | DSD | |
|
|
| +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ |
|
|
+---------+ +---------+ +---------+ +--------+--------+
|
|
| | | |
|
|
| | | |
|
|
+--+----------+-------------+--------+--------+
|
|
| |
|
|
| |
|
|
+-------+----+ +------------------------+-----------------+
|
|
| Oslo Msg | | DBs (policy, config, push data, exec log)|
|
|
+------------+ +------------------------------------------+
|
|
|
|
|
|
- Downtime: < 1s for queries, ~2s for reactive enforcement
|
|
- Deployment considerations:
|
|
|
|
- Cluster manager (eg. Pacemaker + Corosync) can be used to manage warm
|
|
standby
|
|
- Does not require global leader election
|
|
- Performance considerations:
|
|
|
|
- Multi-process, multi-node query throughput
|
|
- No redundant data-pulling load on datasources
|
|
- DSDs node separate from PE, allowing high load DSDs to operate more
|
|
smoothly and avoid affecting PE performance.
|
|
- PE nodes are symmetric in configuration, making it easy to load balance
|
|
evenly.
|
|
|
|
- Code changes:
|
|
|
|
- New synchronizer and harness to support two different node types:
|
|
API+PE node and all-DSDs node
|
|
|
|
Details
|
|
###########################################################################
|
|
|
|
- Datasource drivers (DSDs):
|
|
|
|
- One datasource driver per physical datasource.
|
|
- All DSDs run in a single DSE node (process)
|
|
- Push DSDs: optionally persist data in push data DB, so a new snapshot can
|
|
be obtained whenever needed.
|
|
|
|
- Policy engine (PE):
|
|
|
|
- Replicate policy engine in active-active configuration.
|
|
- Policy synchronized across PE instances via Policy DB
|
|
- Every instance subscribes to the same data on oslo-messaging.
|
|
- Reactive enforcement:
|
|
All PE instances initiate reactive policy actions, but each DSD locally
|
|
selects a "leader" to "listen to". The DSD ignores execution requests
|
|
initiated by all other PE instances.
|
|
|
|
- Every PE instance computes the required reactive enforcement actions and
|
|
initiate the corresponding execution requests over oslo-messaging.
|
|
- Each DSD locally picks PE instance as leader (say the first instance the
|
|
DSD hears from in the asymmetric node deployment, or the PE instance on
|
|
the same node as the DSD in a symmetric node deployment) and executes
|
|
only requests from that PE.
|
|
- If heartbeat contact is lost with the leader, the DSD selects a new
|
|
leader.
|
|
- Each PE instance is unaware of whether it is a "leader"
|
|
|
|
- API:
|
|
|
|
- Each node has an active API service
|
|
- Each API service routes requests for PE to its associated intranode PE
|
|
- Requests for any other service(eg. get data source status) are routed to
|
|
DSE2, which will be fielded by some active instance of the service on some
|
|
node
|
|
- Details:
|
|
- in API models, replace every invoke_rpc with a conditional:
|
|
|
|
- if the target is policy engine, target same-node PE
|
|
eg.::
|
|
self.invoke_rpc(
|
|
caller, 'get_row_data', args, server=self.node.pe.rpc_server)
|
|
|
|
- otherwise, invoke_rpc stays as is, routing to DSE2
|
|
eg.::
|
|
self.invoke_rpc(caller, 'get_row_data', args)
|
|
|
|
- Load balancer:
|
|
|
|
- Layer 7 load balancer (e.g. HAProxy) distributes incoming API calls among
|
|
the nodes (each running API service).
|
|
- load balancer optionally configured to use sticky session to pin each API
|
|
caller to a particular node. This configuration avoids the experience of
|
|
going back in time.
|
|
|
|
- External components (load balancer, DBs, and oslo messaging bus) can be made
|
|
highly available using standard solutions (e.g. clustered LB, Galera MySQL
|
|
cluster, HA rabbitMQ)
|
|
|
|
|
|
Dealing with missed actions during failover
|
|
###########################################################################
|
|
|
|
When a leader fails (global or local), it takes time for the failure to be
|
|
detected and a new leader anointed. During the failover, reactive enforcement
|
|
actions expected to be triggered would be missed. Four proposed approaches
|
|
are discussed below.
|
|
|
|
- Tolerate missed actions during failover: for same applications, it may be
|
|
acceptable to miss actions during failover.
|
|
|
|
- Re-execute recent actions after failover
|
|
|
|
- Each PE instance remembers its recent action requests (including the
|
|
requests a follower PE computed but did not send)
|
|
- On failover, the DSD requests the recent action requests from the new
|
|
leader and executes them (within a certain recency window)
|
|
- Duplicate execution expected on failover.
|
|
- Re-execute recent unmatched actions after failover (possible future work)
|
|
|
|
- We can reduce the number of duplicate executions on failover by attempting
|
|
to match a new leader's recent action requests with the already executed
|
|
requests, and only additionally executing those unmatched.
|
|
- DSD logs all recent successfully executed action requests in DB
|
|
- Request matching can be done by a combination of the following information:
|
|
|
|
- the action requested
|
|
- the timestamp of the request
|
|
- the sequence number of the data update that triggered the action
|
|
- the full derivation tree that triggers the action
|
|
- Matching is imperfect, but still helpful
|
|
|
|
- Log and replay data updates (possible future work)
|
|
|
|
- Log every data update from every data source, and let a new leader replay
|
|
the updates where the previous leader left off to generate the needed
|
|
action requests.
|
|
- The logging can be directly supported by transport or by additional DB
|
|
|
|
- kafka naturally supports this model
|
|
- hard to do directly with oslo-messaging + RabbitMQ
|
|
|
|
- Leaderless de-duplication (possible future work)
|
|
|
|
- If a very good matching method is implemented for re-execution for recent
|
|
unmatched actions after failover, it is possible to go one stop further
|
|
and simply operate in this mode full time.
|
|
- Each incoming action request is matched against all recently executed
|
|
action requests.
|
|
|
|
- Discard if matched.
|
|
- Execute if unmatched.
|
|
- Eliminates the need for selecting leader (global or local) and improves
|
|
failover speed
|
|
|
|
We propose to focus first on supporting the first two options
|
|
(deployers' choice). The more complex options may be implemented and supported
|
|
in future work.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
We first discuss the main decision points before detailing several alternative
|
|
deployments.
|
|
|
|
For active-active replication of the PE, here are the main decision points:
|
|
|
|
A. node configurations
|
|
|
|
- Options:
|
|
|
|
1. single node-type (every node has API+PE+DSDs).
|
|
2. two node-types (API+PE nodes, all-DSDs node). [proposed target]
|
|
3. many node-types (API+PE nodes, all DSDs in separate nodes).
|
|
|
|
- Discussions: The many node-types configuration is most flexible and has
|
|
the best support for high-load DSDs, but it also requires the most work to
|
|
dev and to deploy.
|
|
We propose to target the two node-types configuration because it gives
|
|
reasonable support for high-load DSDs while keeping both the development
|
|
and the deployment complexities low.
|
|
|
|
B. global vs local leader for action execution
|
|
|
|
- Options:
|
|
|
|
1. global leader: Pacemaker anoints a global leader among PE instances;
|
|
only the leader sends action-execution requests.
|
|
2. local leader: every PE instance sends action-execution requests, but
|
|
each receiving DSD locally picks a "leader" to listen to.
|
|
[proposed target]
|
|
|
|
- Discussions: Because there is a single active DSD for a given data source,
|
|
it is a natural spot to locally choose a "leader" among the PE instances
|
|
sending reactive enforcement action execution requests.
|
|
We propose to target the local leader style because it avoids the
|
|
development and deployment complexities associated with global leader
|
|
election.
|
|
Furthermore, because all PE instances perform reactive enforcement and send
|
|
action execution requests, the redundancy opens up the possibility for
|
|
zero disruption to reactive enforcement when a PE intance fails.
|
|
|
|
C. DSD redundancy
|
|
|
|
- Options:
|
|
|
|
1. warm standby: only one set of DSDs running at a given time; backup
|
|
instances ready to launch.
|
|
2. hot standby: multiple instances running, but only one set is active.
|
|
3. active-active: multiple instances active.
|
|
|
|
- Discussions:
|
|
|
|
- For pull DSDs, we propose to target warm standby seems most appropriate
|
|
because warm startup time is low (seconds) relative to frequency of data
|
|
pulls.
|
|
- For push DSDs, warm standby is generally sufficient except for use cases
|
|
that demand sub-second latency even during a failover. Those use cases
|
|
would require active-active replication of the push DSDs. But even with
|
|
active-active replication of push DSDs, other unsolved issues in
|
|
action-execution prevent us from delivering sub-second end-to-end latency
|
|
(push data to triggered action executed) during failover (see leaderless
|
|
de-duplication approach for sub-second action execution failover).
|
|
Since we cannot yet realize the benefit of active-active replication of
|
|
push DSDs, we propose to target a warm-standby deployment, leaving
|
|
active-active replication as potential future work.
|
|
|
|
Active-active PE replication, DSDs hot-standby, all components in one node
|
|
###########################################################################
|
|
|
|
Run N instances of single-process Congress.
|
|
|
|
One instance is selected as leader by Pacemaker. Only the leader has active
|
|
datasource drivers (which pull data, accept push data, and accept RPC calls
|
|
from the API service), but all instances subscribes to and processes data on
|
|
oslo-messaging. Queries are load balanced among instances.
|
|
|
|
::
|
|
|
|
+-----------------------------------------------------------------------+
|
|
| Load Balancer (eg. HAProxy) |
|
|
+----+------------------------+------------------------+----------------+
|
|
| | |
|
|
| leader | follower | follower
|
|
+---------------------+ +---------------------+ +---------------------+
|
|
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
|
| | API | |DSD (on) | | | | API | |DSD (off)| | | | API | |DSD (off)| |
|
|
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
|
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
|
| | PE | |DSD (on) | | | | PE | |DSD (off)| | | | PE | |DSD (off)| |
|
|
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
|
+---------------------+ +---------------------+ +---------------------+
|
|
| | |
|
|
| | |
|
|
+---------+--------------+-------------------+----+
|
|
| |
|
|
| |
|
|
+--------------+-----------+ +--------------------+---------------------+
|
|
| Oslo Msg | | DBs (policy, config, push data, exec log)|
|
|
+--------------------------+ +------------------------------------------+
|
|
|
|
|
|
|
|
- Downtime: < 1s for queries, ~2s for reactive policy
|
|
- Deployment considerations:
|
|
|
|
- Requires cluster manager (Pacemaker) and cluster messaging (Corosync)
|
|
- Relatively simple Pacemaker deployment because every node is identical
|
|
- Requires leader election (handled by Pacemaker+Corosync)
|
|
- Easy to start new DSD (make API call, all instances sync via DB)
|
|
- Performance considerations:
|
|
|
|
- [Pro] Multi-process query throughput
|
|
- [Pro] No redundant data-pulling load on datasources
|
|
- [Con] If some data source drivers have high demand (say telemetry data),
|
|
performance may suffer when deployed in the same python process as other
|
|
Congress components.
|
|
- [Con] Because the leader has the added load of active DSDs, PE performance
|
|
may be reduced, making it harder to evenly load balance across instances.
|
|
- Code changes:
|
|
|
|
- Add API call to designate a node as leader or follower
|
|
- Custom resource agent that allows Pacemaker to start, stop, promote, and
|
|
demote Congress instances
|
|
|
|
Details
|
|
++++++++++++++
|
|
|
|
- Pull datasource drivers (pull DSD):
|
|
|
|
- One active datasource driver per physical datasource.
|
|
- Only leader node has active DSDs (active polling loop and active
|
|
RPC server)
|
|
- On node failure, new leader node activates DSDs.
|
|
|
|
- Push datasource drivers (push DSD):
|
|
|
|
- One active datasource driver per physical datasource.
|
|
- Only leader node has active DSDs (active RPC server)
|
|
- On node failure, new leader node activates DSDs.
|
|
- Persist data in push data DB, so a new snapshot can be obtained.
|
|
|
|
- Policy engine (PE):
|
|
|
|
- Replicate policy engine in active-active configuration.
|
|
- Policy synchronized across PE instances via Policy DB
|
|
- Every instance subscribes to the same data on oslo-messaging.
|
|
- Reactive enforcement: See later section "Reactive enforcement architecture
|
|
for active-active deployments"
|
|
|
|
- API:
|
|
|
|
- Add new API calls for designating the receiving node as leader or follower.
|
|
The call must complete all tasks before returning (eg. start/stop RPC)
|
|
- Each node has an active API service
|
|
- Each API service routes requests for PE to its associated intranode PE,
|
|
bypassing DSE2.
|
|
- Requests for any other service(eg. get data source status) are routed to
|
|
DSE2, which will be fielded by some active instance of the service on some
|
|
node
|
|
- Details:
|
|
- in API models, replace every invoke_rpc with a conditional:
|
|
|
|
- if the target is policy engine, target same-node PE
|
|
eg.::
|
|
self.invoke_rpc(
|
|
caller, 'get_row_data', args,
|
|
server=self.node.pe.rpc_server)
|
|
|
|
- otherwise, invoke_rpc stays as is, routing to DSE2
|
|
eg.::
|
|
self.invoke_rpc(caller, 'get_row_data', args)
|
|
|
|
- Load balancer:
|
|
|
|
- Layer 7 load balancer (e.g. HAProxy) distributes incoming API calls among
|
|
the nodes (each running API service).
|
|
- load balancer optionally configured to use sticky session to pin each API
|
|
caller to a particular node. This configuration avoids the experience of
|
|
going back in time.
|
|
|
|
- Each DseNode is monitored and managed by a cluster manager (eg. Pacemaker)
|
|
- External components (load balancer, DBs, and oslo messaging bus) can be made
|
|
highly available using standard solutions (e.g. clustered LB, Galera MySQL
|
|
cluster, HA rabbitMQ)
|
|
|
|
- Global leader election with Pacemaker:
|
|
|
|
- A resource agent contains the scripts that tells a Congress instance it is
|
|
a leader or follower.
|
|
- Pacemaker decides which Congress instance to promote to leader (master).
|
|
- Pacemaker promotes (demotes) the appropriate Congress instance to leader
|
|
(follower) via the resource agent.
|
|
- Fencing:
|
|
|
|
- If the leader node stops responding, and a new node is promoted to
|
|
leader, it is possible that the unresponsive node is still doing work
|
|
(eg. listening on oslo-messaging, issuing action requests).
|
|
- It is generally not a catastrophe if for a time there is more than one
|
|
Congress node doing the work of a leader. (Potential effects may include:
|
|
duplicate action requests and redundant data source pulls)
|
|
- Pacemaker can be configured with strict fencing and STONITH for
|
|
deployments that require it. (deployers' choice)
|
|
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_what_is_stonith
|
|
|
|
- In case of network partitions:
|
|
|
|
- Pacemaker can be configured to stop each node that is not part of a
|
|
cluster reaching quorum, or to allow each partition to continue
|
|
operating. (deployers' choice)
|
|
http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-options.html
|
|
|
|
Active-active PE replication, DSDs warm-standby, each DSD in its own node
|
|
###########################################################################
|
|
|
|
Run N instances of Congress policy engine in active-active configuration. One
|
|
datasource driver per physical datasource published data on oslo-messaging to
|
|
all policy engines.
|
|
|
|
::
|
|
|
|
+-------------------------------------+
|
|
| Load Balancer (eg. HAProxy) |
|
|
+----+-------------+-------------+----+
|
|
| | |
|
|
| | |
|
|
+---------+ +---------+ +---------+
|
|
| +-----+ | | +-----+ | | +-----+ |
|
|
| | API | | | | API | | | | API | |
|
|
| +-----+ | | +-----+ | | +-----+ |
|
|
| +-----+ | | +-----+ | | +-----+ |
|
|
| | PE | | | | PE | | | | PE | |
|
|
| +-----+ | | +-----+ | | +-----+ |
|
|
+---------+ +---------+ +---------+
|
|
| | |
|
|
| | |
|
|
+------------------------------------------+-----------------+
|
|
| | | | |
|
|
| | | | |
|
|
+----+-------------+-------------+---+ +-------+--------+ +-----+-----+
|
|
| Oslo Msg | | Push Data DB | | DBs |
|
|
+----+-------------+-------------+---+ ++---------------+ +-----------+
|
|
| | | | (DBs may be combined)
|
|
+----------+ +----------+ +----------+
|
|
| +------+ | | +------+ | | +------+ |
|
|
| | Poll | | | | Poll | | | | Push | |
|
|
| | Drv | | | | Drv | | | | Drv | |
|
|
| | DS 1 | | | | DS 2 | | | | DS 3 | |
|
|
| +------+ | | +------+ | | +------+ |
|
|
+----------+ +----------+ +----------+
|
|
| | |
|
|
+---+--+ +---+--+ +---+--+
|
|
| | | | | |
|
|
| DS 1 | | DS 2 | | DS 3 |
|
|
| | | | | |
|
|
+------+ +------+ +------+
|
|
|
|
- Downtime: < 1s for queries, ~2s for reactive policy
|
|
- Deployment considerations:
|
|
|
|
- Requires cluster manager (Pacemaker) and cluster messaging (Corosync)
|
|
- More complex Pacemaker deployment because there are many different
|
|
kinds of nodes
|
|
- Does not require global leader election (but that's not a big saving if
|
|
we're running Pacemaker+Corosync anyway)
|
|
- Performance considerations:
|
|
|
|
- [Pro] Multi-process query throughput
|
|
- [Pro] No redundant data-pulling load on datasources
|
|
- [Pro] Each DSD can run in its own node, allowing high load DSDs to operate
|
|
more smoothly and avoid affecting PE performance.
|
|
- [Pro] PE nodes are symmetric in configuration, making it easy to load
|
|
balance evenly.
|
|
|
|
- Code changes:
|
|
|
|
- New synchronizer and harness and DB schema to support per node
|
|
configuration
|
|
|
|
Hot standby for all Congress components in single process
|
|
###########################################################################
|
|
|
|
Run N instances of single-process Congress, as proposed in:
|
|
https://blueprints.launchpad.net/congress/+spec/basic-high-availability
|
|
|
|
A floating IP points to the primary instance which handles all queries and
|
|
requests, failing over when primary instance is down.
|
|
All instances ingest and process data to stay up to date.
|
|
|
|
::
|
|
|
|
+---------------+
|
|
| Floating IP | - - - - - - + - - - - - - - - - - - -+
|
|
+----+----------+ | |
|
|
|
|
|
| | |
|
|
+---------------------+ +---------------------+ +---------------------+
|
|
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
|
| | API | | DSD | | | | API | | DSD | | | | API | | DSD | |
|
|
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
|
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
|
| | PE | | DSD | | | | PE | | DSD | | | | PE | | DSD | |
|
|
| +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ |
|
|
| +-----------------+ | | +-----------------+ | | +-----------------+ |
|
|
| | Oslo Msg | | | | Oslo Msg | | | | Oslo Msg | |
|
|
| +-----------------+ | | +-----------------+ | | +-----------------+ |
|
|
+---------------------+ +---------------------+ +---------------------+
|
|
| | |
|
|
| | |
|
|
+----------+------------------------+----------------------+------------+
|
|
| Databases |
|
|
+-----------------------------------------------------------------------+
|
|
|
|
|
|
- Downtime: < 1s for queries, ~2s for reactive policy
|
|
- Feature limitations:
|
|
|
|
- Limited support for action execution: each action execution is triggered
|
|
N times)
|
|
- Limited support for push drivers: push updates only primary instance
|
|
(optional DB-sync to non-primary instances)
|
|
|
|
- Deployment considerations:
|
|
|
|
- Very easy to deploy. No need for cluster manager. Just start N independent
|
|
instances of Congress (in-process messaging) and setup floating IP.
|
|
- Performance considerations:
|
|
|
|
- Performance characteristics similar to single-instance Congress
|
|
- [Con] uniprocessing query throughput (optional load balancer can be added
|
|
to balance queries between instances)
|
|
- [Con] Extra load on data sources from replicated data source drivers
|
|
pulling same data N times
|
|
- Code changes:
|
|
|
|
- (optional) DB-sync of pushed data to non-primary instances
|
|
|
|
|
|
Policy
|
|
------
|
|
|
|
Not applicable
|
|
|
|
Policy actions
|
|
--------------
|
|
|
|
Not applicable
|
|
|
|
|
|
Data sources
|
|
------------
|
|
|
|
Not applicable
|
|
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
No impact
|
|
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
No impact
|
|
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
No major security impact identified compared to a non-HA distributed
|
|
deployment.
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
No impact
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
Proposed changes generally transparent to end user. Some exceptions:
|
|
|
|
- Different PE instances may be out-of-sync in their data and policies
|
|
(eventual consistency). The issue is generally made transparent to the end
|
|
user by making each user sticky to a particular PE instance. But if
|
|
a PE instance goes down, the end user reaches a different instance and may
|
|
experience out-of-sync artifacts.
|
|
|
|
Performance impact
|
|
------------------
|
|
|
|
- In single node deployment, there is generally no performance impact.
|
|
- Increased latency due to network communication required by multi-node
|
|
deployment
|
|
- Increased reactive enforcement latency if action executions are persistently
|
|
logged to facilitate smoother failover
|
|
- PE replication can achieve greater query throughput
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
- New config settings:
|
|
|
|
- set DSE node type to one of the following:
|
|
|
|
- PE+API node
|
|
- a DSDs node
|
|
- all-in-one node (backward compatible default)
|
|
|
|
- set reactive enforcement failover strategy:
|
|
|
|
- do not attempt to recover missed actions (backward compatible default)
|
|
- after failover, repeat recent action requests
|
|
- after failover, repeat recent action requests not matched to logged
|
|
executions
|
|
|
|
- Proposed changes have no impact on existing single-node deployments.
|
|
100% backward compatibility expected.
|
|
- Changes only have effect if deployer chooses to set up a multi-node
|
|
deployment with the appropriate configs.
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
No major impact identified.
|
|
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Work to be assigned and tracked via launchpad.
|
|
|
|
|
|
Work items
|
|
----------
|
|
|
|
Items with order dependency:
|
|
|
|
1. API routing.
|
|
Change API routing to support active-active PE replication, routing PE-bound
|
|
RPCs to the PE instance on the same node as the receiving API server.
|
|
Changes expected to be confined to congress/api/*
|
|
2. Reactive enforcement.
|
|
Change datasource_driver.py:ExecutionDriver class to handle action execution
|
|
requests from replicated PE-nodes (locally choose leader to follow)
|
|
3. (low priority) Missed actions mitigation.
|
|
|
|
- Implement changes to mitigate missed actions during DSD failover
|
|
- Implement changes to mitigate missed actions during PE failover
|
|
|
|
Items without dependency:
|
|
|
|
- Push data persistence.
|
|
Change datasource_driver.py:PushedDataSourceDriver class to support
|
|
persistence of pushed data. Corresponding DB changes also needed.
|
|
- (potential) Leaderless de-duplication of action execution requests.
|
|
- HA guide.
|
|
Create HA guide sketching the properties and trade-offs of each different
|
|
deployment types.
|
|
- Deployment guide.
|
|
Create deployment guide for active-active PE replication
|
|
|
|
|
|
Dependencies
|
|
============
|
|
|
|
- Requires Congress to support distributed deployment (for example with policy
|
|
engine and datasource drivers on separate DseNodes.) Distributed deployment
|
|
has been addressed by several implemented blueprints. The following patch is
|
|
expected to be the final piece required.
|
|
https://review.openstack.org/#/c/307693/
|
|
- This spec does not introduce any new code of library dependencies.
|
|
- In line with OpenStack recommendation
|
|
(http://docs.openstack.org/ha-guide/controller-ha-pacemaker.html), some
|
|
reference deployments use open source software outside of OpenStack:
|
|
- HAProxy: http://www.haproxy.org
|
|
- Pacemaker: http://clusterlabs.org/wiki/Pacemaker
|
|
- Corosync: http://corosync.github.io/corosync/
|
|
|
|
|
|
Testing
|
|
=======
|
|
|
|
We propose to add tempest tests for the following scenarios:
|
|
- Up to (N - 1) PE-nodes killed
|
|
- Previously killed PE-nodes rejoin.
|
|
- Kill and restart DSDs-node, possibly at the same time PE-nodes are killed.
|
|
|
|
Split brain scenarios can be manually tested.
|
|
|
|
|
|
Documentation impact
|
|
====================
|
|
|
|
Deployment guide to be added for each supported reference deployment. No impact
|
|
on existing documentation.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
- IRC discussion on major design decisions (#topic HA design):
|
|
http://eavesdrop.openstack.org/meetings/congressteammeeting/2016/congressteammeeting.2016-06-09-00.04.log.txt
|
|
- Notes from summit session:
|
|
https://etherpad.openstack.org/p/newton-congress-availability
|
|
- OpenStack HA guide:
|
|
http://docs.openstack.org/ha-guide/controller-ha-pacemaker.html
|
|
- HAProxy documentation:
|
|
http://www.haproxy.org/#docs
|
|
- Pacemaker documentation:
|
|
|
|
- directory: http://clusterlabs.org/doc/
|
|
- Cluster from scratch:
|
|
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/index.html
|
|
- Configuration explained:
|
|
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
|
|
|
|
- OCF resource agents:
|
|
http://www.linux-ha.org/wiki/OCF_Resource_Agents |