From 4379a8c44fa6300fbc4ab728f0672784f3ff5269 Mon Sep 17 00:00:00 2001 From: Eric K Date: Wed, 18 May 2016 16:58:14 -0700 Subject: [PATCH] Support HT-HA deployments Some applications require Congress to be highly available (HA). Some applications require a Congress policy engine to handle a high volume of queries (high throughput - HT). This spec describes how we can support several deployment schemes that address some HA and HT requirements. blueprint high-availability-design Change-Id: I0ce2fc3895f022d3b363c9674c6dafbbc7447c75 --- specs/newton/high-availability-design.rst | 746 ++++++++++++++++++++++ 1 file changed, 746 insertions(+) create mode 100644 specs/newton/high-availability-design.rst diff --git a/specs/newton/high-availability-design.rst b/specs/newton/high-availability-design.rst new file mode 100644 index 0000000..e3b5903 --- /dev/null +++ b/specs/newton/high-availability-design.rst @@ -0,0 +1,746 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +===================================================== +Support high availability, high throughput deployment +===================================================== + +Include the URL of your launchpad blueprint: + +https://blueprints.launchpad.net/congress/+spec/high-availability-design + +Some applications require Congress to be highly available (HA). Some +applications require a Congress policy engine to handle a high volume of +queries (high throughput - HT). This proposal describes how we can support +several deployment schemes that address several HA and HT requirements. + + +Problem description +=================== + +This spec aims to address three main problems: + +1. Congress is not currently able to provide high query throughput because + all queries are handled by a single, single-threaded policy engine instance. +2. Congress is not currently able to failover quickly when its policy engine + becomes unavailable. +3. If the policy engine and a push datasource driver both crash, Congress is + not currently able to restore the latest data state upon restart or + failover. + + +Proposed change +=============== + +Implement the required code changes and create deployment guides for the +following reference deployments. + +Warm standby for all Congress components in single process +---------------------------------------------------------- + +- Downtime: ~1 minute (start a new Congress instance and ingest data from + scratch) +- Reliability: action executions may be lost during downtime. +- Performance considerations: uniprocessing query throughput +- Code changes: minimal + +Active-active PE replication, DSDs warm-standby +---------------------------------------------------- + +Run N instances of Congress policy engine in active-active configuration. One +datasource driver per physical datasource published data on oslo-messaging to +all policy engines. + +:: + + +-------------------------------------+ +--------------+ + | Load Balancer (eg. HAProxy) | <----+ Push client | + +----+-------------+-------------+----+ +--------------+ + | | | + PE | PE | PE | all+DSDs node + +---------+ +---------+ +---------+ +-----------------+ + | +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ | + | | API | | | | API | | | | API | | | | DSD | | DSD | | + | +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ | + | +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ | + | | PE | | | | PE | | | | PE | | | | DSD | | DSD | | + | +-----+ | | +-----+ | | +-----+ | | +-----+ +-----+ | + +---------+ +---------+ +---------+ +--------+--------+ + | | | | + | | | | + +--+----------+-------------+--------+--------+ + | | + | | + +-------+----+ +------------------------+-----------------+ + | Oslo Msg | | DBs (policy, config, push data, exec log)| + +------------+ +------------------------------------------+ + + +- Downtime: < 1s for queries, ~2s for reactive enforcement +- Deployment considerations: + + - Cluster manager (eg. Pacemaker + Corosync) can be used to manage warm + standby + - Does not require global leader election +- Performance considerations: + + - Multi-process, multi-node query throughput + - No redundant data-pulling load on datasources + - DSDs node separate from PE, allowing high load DSDs to operate more + smoothly and avoid affecting PE performance. + - PE nodes are symmetric in configuration, making it easy to load balance + evenly. + +- Code changes: + + - New synchronizer and harness to support two different node types: + API+PE node and all-DSDs node + +Details +########################################################################### + +- Datasource drivers (DSDs): + + - One datasource driver per physical datasource. + - All DSDs run in a single DSE node (process) + - Push DSDs: optionally persist data in push data DB, so a new snapshot can + be obtained whenever needed. + +- Policy engine (PE): + + - Replicate policy engine in active-active configuration. + - Policy synchronized across PE instances via Policy DB + - Every instance subscribes to the same data on oslo-messaging. + - Reactive enforcement: + All PE instances initiate reactive policy actions, but each DSD locally + selects a "leader" to "listen to". The DSD ignores execution requests + initiated by all other PE instances. + + - Every PE instance computes the required reactive enforcement actions and + initiate the corresponding execution requests over oslo-messaging. + - Each DSD locally picks PE instance as leader (say the first instance the + DSD hears from in the asymmetric node deployment, or the PE instance on + the same node as the DSD in a symmetric node deployment) and executes + only requests from that PE. + - If heartbeat contact is lost with the leader, the DSD selects a new + leader. + - Each PE instance is unaware of whether it is a "leader" + +- API: + + - Each node has an active API service + - Each API service routes requests for PE to its associated intranode PE + - Requests for any other service(eg. get data source status) are routed to + DSE2, which will be fielded by some active instance of the service on some + node + - Details: + - in API models, replace every invoke_rpc with a conditional: + + - if the target is policy engine, target same-node PE + eg.:: + self.invoke_rpc( + caller, 'get_row_data', args, server=self.node.pe.rpc_server) + + - otherwise, invoke_rpc stays as is, routing to DSE2 + eg.:: + self.invoke_rpc(caller, 'get_row_data', args) + +- Load balancer: + + - Layer 7 load balancer (e.g. HAProxy) distributes incoming API calls among + the nodes (each running API service). + - load balancer optionally configured to use sticky session to pin each API + caller to a particular node. This configuration avoids the experience of + going back in time. + +- External components (load balancer, DBs, and oslo messaging bus) can be made + highly available using standard solutions (e.g. clustered LB, Galera MySQL + cluster, HA rabbitMQ) + + +Dealing with missed actions during failover +########################################################################### + +When a leader fails (global or local), it takes time for the failure to be +detected and a new leader anointed. During the failover, reactive enforcement +actions expected to be triggered would be missed. Four proposed approaches +are discussed below. + +- Tolerate missed actions during failover: for same applications, it may be + acceptable to miss actions during failover. + +- Re-execute recent actions after failover + + - Each PE instance remembers its recent action requests (including the + requests a follower PE computed but did not send) + - On failover, the DSD requests the recent action requests from the new + leader and executes them (within a certain recency window) + - Duplicate execution expected on failover. +- Re-execute recent unmatched actions after failover (possible future work) + + - We can reduce the number of duplicate executions on failover by attempting + to match a new leader's recent action requests with the already executed + requests, and only additionally executing those unmatched. + - DSD logs all recent successfully executed action requests in DB + - Request matching can be done by a combination of the following information: + + - the action requested + - the timestamp of the request + - the sequence number of the data update that triggered the action + - the full derivation tree that triggers the action + - Matching is imperfect, but still helpful + +- Log and replay data updates (possible future work) + + - Log every data update from every data source, and let a new leader replay + the updates where the previous leader left off to generate the needed + action requests. + - The logging can be directly supported by transport or by additional DB + + - kafka naturally supports this model + - hard to do directly with oslo-messaging + RabbitMQ + +- Leaderless de-duplication (possible future work) + + - If a very good matching method is implemented for re-execution for recent + unmatched actions after failover, it is possible to go one stop further + and simply operate in this mode full time. + - Each incoming action request is matched against all recently executed + action requests. + + - Discard if matched. + - Execute if unmatched. + - Eliminates the need for selecting leader (global or local) and improves + failover speed + +We propose to focus first on supporting the first two options +(deployers' choice). The more complex options may be implemented and supported +in future work. + +Alternatives +------------ + +We first discuss the main decision points before detailing several alternative +deployments. + +For active-active replication of the PE, here are the main decision points: + +A. node configurations + + - Options: + + 1. single node-type (every node has API+PE+DSDs). + 2. two node-types (API+PE nodes, all-DSDs node). [proposed target] + 3. many node-types (API+PE nodes, all DSDs in separate nodes). + + - Discussions: The many node-types configuration is most flexible and has + the best support for high-load DSDs, but it also requires the most work to + dev and to deploy. + We propose to target the two node-types configuration because it gives + reasonable support for high-load DSDs while keeping both the development + and the deployment complexities low. + +B. global vs local leader for action execution + + - Options: + + 1. global leader: Pacemaker anoints a global leader among PE instances; + only the leader sends action-execution requests. + 2. local leader: every PE instance sends action-execution requests, but + each receiving DSD locally picks a "leader" to listen to. + [proposed target] + + - Discussions: Because there is a single active DSD for a given data source, + it is a natural spot to locally choose a "leader" among the PE instances + sending reactive enforcement action execution requests. + We propose to target the local leader style because it avoids the + development and deployment complexities associated with global leader + election. + Furthermore, because all PE instances perform reactive enforcement and send + action execution requests, the redundancy opens up the possibility for + zero disruption to reactive enforcement when a PE intance fails. + +C. DSD redundancy + + - Options: + + 1. warm standby: only one set of DSDs running at a given time; backup + instances ready to launch. + 2. hot standby: multiple instances running, but only one set is active. + 3. active-active: multiple instances active. + + - Discussions: + + - For pull DSDs, we propose to target warm standby seems most appropriate + because warm startup time is low (seconds) relative to frequency of data + pulls. + - For push DSDs, warm standby is generally sufficient except for use cases + that demand sub-second latency even during a failover. Those use cases + would require active-active replication of the push DSDs. But even with + active-active replication of push DSDs, other unsolved issues in + action-execution prevent us from delivering sub-second end-to-end latency + (push data to triggered action executed) during failover (see leaderless + de-duplication approach for sub-second action execution failover). + Since we cannot yet realize the benefit of active-active replication of + push DSDs, we propose to target a warm-standby deployment, leaving + active-active replication as potential future work. + +Active-active PE replication, DSDs hot-standby, all components in one node +########################################################################### + +Run N instances of single-process Congress. + +One instance is selected as leader by Pacemaker. Only the leader has active +datasource drivers (which pull data, accept push data, and accept RPC calls +from the API service), but all instances subscribes to and processes data on +oslo-messaging. Queries are load balanced among instances. + +:: + + +-----------------------------------------------------------------------+ + | Load Balancer (eg. HAProxy) | + +----+------------------------+------------------------+----------------+ + | | | + | leader | follower | follower + +---------------------+ +---------------------+ +---------------------+ + | +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ | + | | API | |DSD (on) | | | | API | |DSD (off)| | | | API | |DSD (off)| | + | +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ | + | +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ | + | | PE | |DSD (on) | | | | PE | |DSD (off)| | | | PE | |DSD (off)| | + | +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ | + +---------------------+ +---------------------+ +---------------------+ + | | | + | | | + +---------+--------------+-------------------+----+ + | | + | | + +--------------+-----------+ +--------------------+---------------------+ + | Oslo Msg | | DBs (policy, config, push data, exec log)| + +--------------------------+ +------------------------------------------+ + + + +- Downtime: < 1s for queries, ~2s for reactive policy +- Deployment considerations: + + - Requires cluster manager (Pacemaker) and cluster messaging (Corosync) + - Relatively simple Pacemaker deployment because every node is identical + - Requires leader election (handled by Pacemaker+Corosync) + - Easy to start new DSD (make API call, all instances sync via DB) +- Performance considerations: + + - [Pro] Multi-process query throughput + - [Pro] No redundant data-pulling load on datasources + - [Con] If some data source drivers have high demand (say telemetry data), + performance may suffer when deployed in the same python process as other + Congress components. + - [Con] Because the leader has the added load of active DSDs, PE performance + may be reduced, making it harder to evenly load balance across instances. +- Code changes: + + - Add API call to designate a node as leader or follower + - Custom resource agent that allows Pacemaker to start, stop, promote, and + demote Congress instances + +Details +++++++++++++++ + +- Pull datasource drivers (pull DSD): + + - One active datasource driver per physical datasource. + - Only leader node has active DSDs (active polling loop and active + RPC server) + - On node failure, new leader node activates DSDs. + +- Push datasource drivers (push DSD): + + - One active datasource driver per physical datasource. + - Only leader node has active DSDs (active RPC server) + - On node failure, new leader node activates DSDs. + - Persist data in push data DB, so a new snapshot can be obtained. + +- Policy engine (PE): + + - Replicate policy engine in active-active configuration. + - Policy synchronized across PE instances via Policy DB + - Every instance subscribes to the same data on oslo-messaging. + - Reactive enforcement: See later section "Reactive enforcement architecture + for active-active deployments" + +- API: + + - Add new API calls for designating the receiving node as leader or follower. + The call must complete all tasks before returning (eg. start/stop RPC) + - Each node has an active API service + - Each API service routes requests for PE to its associated intranode PE, + bypassing DSE2. + - Requests for any other service(eg. get data source status) are routed to + DSE2, which will be fielded by some active instance of the service on some + node + - Details: + - in API models, replace every invoke_rpc with a conditional: + + - if the target is policy engine, target same-node PE + eg.:: + self.invoke_rpc( + caller, 'get_row_data', args, + server=self.node.pe.rpc_server) + + - otherwise, invoke_rpc stays as is, routing to DSE2 + eg.:: + self.invoke_rpc(caller, 'get_row_data', args) + +- Load balancer: + + - Layer 7 load balancer (e.g. HAProxy) distributes incoming API calls among + the nodes (each running API service). + - load balancer optionally configured to use sticky session to pin each API + caller to a particular node. This configuration avoids the experience of + going back in time. + +- Each DseNode is monitored and managed by a cluster manager (eg. Pacemaker) +- External components (load balancer, DBs, and oslo messaging bus) can be made + highly available using standard solutions (e.g. clustered LB, Galera MySQL + cluster, HA rabbitMQ) + +- Global leader election with Pacemaker: + + - A resource agent contains the scripts that tells a Congress instance it is + a leader or follower. + - Pacemaker decides which Congress instance to promote to leader (master). + - Pacemaker promotes (demotes) the appropriate Congress instance to leader + (follower) via the resource agent. + - Fencing: + + - If the leader node stops responding, and a new node is promoted to + leader, it is possible that the unresponsive node is still doing work + (eg. listening on oslo-messaging, issuing action requests). + - It is generally not a catastrophe if for a time there is more than one + Congress node doing the work of a leader. (Potential effects may include: + duplicate action requests and redundant data source pulls) + - Pacemaker can be configured with strict fencing and STONITH for + deployments that require it. (deployers' choice) + http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_what_is_stonith + + - In case of network partitions: + + - Pacemaker can be configured to stop each node that is not part of a + cluster reaching quorum, or to allow each partition to continue + operating. (deployers' choice) + http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-options.html + +Active-active PE replication, DSDs warm-standby, each DSD in its own node +########################################################################### + +Run N instances of Congress policy engine in active-active configuration. One +datasource driver per physical datasource published data on oslo-messaging to +all policy engines. + +:: + + +-------------------------------------+ + | Load Balancer (eg. HAProxy) | + +----+-------------+-------------+----+ + | | | + | | | + +---------+ +---------+ +---------+ + | +-----+ | | +-----+ | | +-----+ | + | | API | | | | API | | | | API | | + | +-----+ | | +-----+ | | +-----+ | + | +-----+ | | +-----+ | | +-----+ | + | | PE | | | | PE | | | | PE | | + | +-----+ | | +-----+ | | +-----+ | + +---------+ +---------+ +---------+ + | | | + | | | + +------------------------------------------+-----------------+ + | | | | | + | | | | | + +----+-------------+-------------+---+ +-------+--------+ +-----+-----+ + | Oslo Msg | | Push Data DB | | DBs | + +----+-------------+-------------+---+ ++---------------+ +-----------+ + | | | | (DBs may be combined) + +----------+ +----------+ +----------+ + | +------+ | | +------+ | | +------+ | + | | Poll | | | | Poll | | | | Push | | + | | Drv | | | | Drv | | | | Drv | | + | | DS 1 | | | | DS 2 | | | | DS 3 | | + | +------+ | | +------+ | | +------+ | + +----------+ +----------+ +----------+ + | | | + +---+--+ +---+--+ +---+--+ + | | | | | | + | DS 1 | | DS 2 | | DS 3 | + | | | | | | + +------+ +------+ +------+ + +- Downtime: < 1s for queries, ~2s for reactive policy +- Deployment considerations: + + - Requires cluster manager (Pacemaker) and cluster messaging (Corosync) + - More complex Pacemaker deployment because there are many different + kinds of nodes + - Does not require global leader election (but that's not a big saving if + we're running Pacemaker+Corosync anyway) +- Performance considerations: + + - [Pro] Multi-process query throughput + - [Pro] No redundant data-pulling load on datasources + - [Pro] Each DSD can run in its own node, allowing high load DSDs to operate + more smoothly and avoid affecting PE performance. + - [Pro] PE nodes are symmetric in configuration, making it easy to load + balance evenly. + +- Code changes: + + - New synchronizer and harness and DB schema to support per node + configuration + +Hot standby for all Congress components in single process +########################################################################### + +Run N instances of single-process Congress, as proposed in: +https://blueprints.launchpad.net/congress/+spec/basic-high-availability + +A floating IP points to the primary instance which handles all queries and +requests, failing over when primary instance is down. +All instances ingest and process data to stay up to date. + +:: + + +---------------+ + | Floating IP | - - - - - - + - - - - - - - - - - - -+ + +----+----------+ | | + | + | | | + +---------------------+ +---------------------+ +---------------------+ + | +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ | + | | API | | DSD | | | | API | | DSD | | | | API | | DSD | | + | +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ | + | +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ | + | | PE | | DSD | | | | PE | | DSD | | | | PE | | DSD | | + | +-----+ +---------+ | | +-----+ +---------+ | | +-----+ +---------+ | + | +-----------------+ | | +-----------------+ | | +-----------------+ | + | | Oslo Msg | | | | Oslo Msg | | | | Oslo Msg | | + | +-----------------+ | | +-----------------+ | | +-----------------+ | + +---------------------+ +---------------------+ +---------------------+ + | | | + | | | + +----------+------------------------+----------------------+------------+ + | Databases | + +-----------------------------------------------------------------------+ + + +- Downtime: < 1s for queries, ~2s for reactive policy +- Feature limitations: + + - Limited support for action execution: each action execution is triggered + N times) + - Limited support for push drivers: push updates only primary instance + (optional DB-sync to non-primary instances) + +- Deployment considerations: + + - Very easy to deploy. No need for cluster manager. Just start N independent + instances of Congress (in-process messaging) and setup floating IP. +- Performance considerations: + + - Performance characteristics similar to single-instance Congress + - [Con] uniprocessing query throughput (optional load balancer can be added + to balance queries between instances) + - [Con] Extra load on data sources from replicated data source drivers + pulling same data N times +- Code changes: + + - (optional) DB-sync of pushed data to non-primary instances + + +Policy +------ + +Not applicable + +Policy actions +-------------- + +Not applicable + + +Data sources +------------ + +Not applicable + + +Data model impact +----------------- + +No impact + + +REST API impact +--------------- + +No impact + + +Security impact +--------------- + +No major security impact identified compared to a non-HA distributed +deployment. + +Notifications impact +-------------------- + +No impact + +Other end user impact +--------------------- + +Proposed changes generally transparent to end user. Some exceptions: + +- Different PE instances may be out-of-sync in their data and policies + (eventual consistency). The issue is generally made transparent to the end + user by making each user sticky to a particular PE instance. But if + a PE instance goes down, the end user reaches a different instance and may + experience out-of-sync artifacts. + +Performance impact +------------------ + +- In single node deployment, there is generally no performance impact. +- Increased latency due to network communication required by multi-node + deployment +- Increased reactive enforcement latency if action executions are persistently + logged to facilitate smoother failover +- PE replication can achieve greater query throughput + +Other deployer impact +--------------------- + +- New config settings: + + - set DSE node type to one of the following: + + - PE+API node + - a DSDs node + - all-in-one node (backward compatible default) + + - set reactive enforcement failover strategy: + + - do not attempt to recover missed actions (backward compatible default) + - after failover, repeat recent action requests + - after failover, repeat recent action requests not matched to logged + executions + +- Proposed changes have no impact on existing single-node deployments. + 100% backward compatibility expected. +- Changes only have effect if deployer chooses to set up a multi-node + deployment with the appropriate configs. + +Developer impact +---------------- + +No major impact identified. + + +Implementation +============== + +Assignee(s) +----------- + +Work to be assigned and tracked via launchpad. + + +Work items +---------- + +Items with order dependency: + +1. API routing. + Change API routing to support active-active PE replication, routing PE-bound + RPCs to the PE instance on the same node as the receiving API server. + Changes expected to be confined to congress/api/* +2. Reactive enforcement. + Change datasource_driver.py:ExecutionDriver class to handle action execution + requests from replicated PE-nodes (locally choose leader to follow) +3. (low priority) Missed actions mitigation. + + - Implement changes to mitigate missed actions during DSD failover + - Implement changes to mitigate missed actions during PE failover + +Items without dependency: + +- Push data persistence. + Change datasource_driver.py:PushedDataSourceDriver class to support + persistence of pushed data. Corresponding DB changes also needed. +- (potential) Leaderless de-duplication of action execution requests. +- HA guide. + Create HA guide sketching the properties and trade-offs of each different + deployment types. +- Deployment guide. + Create deployment guide for active-active PE replication + + +Dependencies +============ + +- Requires Congress to support distributed deployment (for example with policy + engine and datasource drivers on separate DseNodes.) Distributed deployment + has been addressed by several implemented blueprints. The following patch is + expected to be the final piece required. + https://review.openstack.org/#/c/307693/ +- This spec does not introduce any new code of library dependencies. +- In line with OpenStack recommendation + (http://docs.openstack.org/ha-guide/controller-ha-pacemaker.html), some + reference deployments use open source software outside of OpenStack: + - HAProxy: http://www.haproxy.org + - Pacemaker: http://clusterlabs.org/wiki/Pacemaker + - Corosync: http://corosync.github.io/corosync/ + + +Testing +======= + +We propose to add tempest tests for the following scenarios: +- Up to (N - 1) PE-nodes killed +- Previously killed PE-nodes rejoin. +- Kill and restart DSDs-node, possibly at the same time PE-nodes are killed. + +Split brain scenarios can be manually tested. + + +Documentation impact +==================== + +Deployment guide to be added for each supported reference deployment. No impact +on existing documentation. + + +References +========== + +- IRC discussion on major design decisions (#topic HA design): + http://eavesdrop.openstack.org/meetings/congressteammeeting/2016/congressteammeeting.2016-06-09-00.04.log.txt +- Notes from summit session: + https://etherpad.openstack.org/p/newton-congress-availability +- OpenStack HA guide: + http://docs.openstack.org/ha-guide/controller-ha-pacemaker.html +- HAProxy documentation: + http://www.haproxy.org/#docs +- Pacemaker documentation: + + - directory: http://clusterlabs.org/doc/ + - Cluster from scratch: + http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/index.html + - Configuration explained: + http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html + +- OCF resource agents: + http://www.linux-ha.org/wiki/OCF_Resource_Agents \ No newline at end of file