From d959e4b1bac76b721c35dbac599852745a467e9f Mon Sep 17 00:00:00 2001 From: dparalen Date: Fri, 4 Dec 2015 19:49:11 +0100 Subject: [PATCH] High Availability for Ironic Inspector Introduce redundancy and scalability to the ironic inspector service Change-Id: I88667decc4d01a125fc840b9efb448fdba5dec08 Co-Authored-By: Dmitry Tantsur --- specs/HA_inspector.rst | 841 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 841 insertions(+) create mode 100644 specs/HA_inspector.rst diff --git a/specs/HA_inspector.rst b/specs/HA_inspector.rst new file mode 100644 index 0000000..c1d828f --- /dev/null +++ b/specs/HA_inspector.rst @@ -0,0 +1,841 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +======================================= + High Availability for Ironic Inspector +======================================= + +Ironic inspector is a service that allows bare metal nodes to be +introspected dynamically, that currently isn't redundant. The goal of +this blueprint is to suggest *conceptual changes* to the inspector +service that would make inspector redundant while maintaining both the +current inspection feature set and API. + +Problem description +=================== + +Inspector is a compound service consisting of the inspector API +service, the firewall and the DHCP (PXE) service. Currently, all +three components run a single instance on a shared host per OpenStack +deployment. A failure of the host or any of the services renders +introspection unavailable and prevents the cloud administrator from +enrolling new hardware or from booting already enrolled baremetal +nodes. Furthermore, Inspector isn't designed to cope well with the +amount of hardware required for Ironic bare metal usage at large +scale. With a site size of 10k bare metal nodes in mind, we aim at +the inspector sustaining a batch load of a couple of hundred +introspection/enroll requests interleaved with couple of minutes of +silence, maintaining a couple of thousand firewall black list items. +We refer to this use case as *bare metal to tenant*. + +Below we describe the current Inspector service architecture with some +Inspector process instance failure consequences. + +Introspection process +--------------------- + +Node introspection is a sequence of asynchronous steps, controlled by +the inspector API service, that take various amounts of time to +finish. One could describe these steps as states of a transition +system, advanced by events as follows: + +* ``starting`` the initial state; the system is advanced into this + state by receiving an introspect API request. Introspection + configuration and set-up steps are performed while in this state. +* ``waiting`` introspection image is booting on the node. The system + advances to this state automatically. +* ``processing`` introspection image has booted and collected + necessary information from the node. This information is being + processed by plug-ins to validate node status. The system is + advanced to this state having received the ``continue`` REST API + request. +* ``finished`` introspection is done, node powered-off. The system + is advanced to this state automatically. + +In case of an API service failure, nodes in-between the ``starting`` +and ``finished`` state, will lose their state, and may require manual +intervention to recover. No more nodes can be processed either +because the API service runs in a single instance per deployment. + +Firewall configuration +---------------------- + +To minimize interference with normally deployed nodes, inspector +deploys temporary firewall rules so only nodes being inspected can +access its PXE boot service. It is implemented as a blacklist +containing MAC addresses of nodes kept by ironic service but not by +inspector. This is required because the MAC address isn't known +before a node boots for the first time. + +Depending on the spot in which the API service fails while the +firewall and DHCP services are intact, firewall configuration may get +out of sync and may lead to interference with normal node booting: + +* firewall chain set-up (init phase): Inspector's dnsmasq service is + exposed to all nodes +* firewall synchronization periodic task: new nodes added to Ironic + aren't blacklisted +* node introspection finished: the node won't be blacklisted + +On the other hand, no boot interference is expected if running all +services (inspector, firewall and DHCP), on the same host, as all +service are lost together. Losing the API service during clean-up +periodic task, should not matter as the nodes concerned will be kept +blacklisted during service downtime. + +DHCP (PXE) service +------------------ + +Inspector service doesn't manage the DHCP service directly, rather, it +just requires DHCP is properly set up and shares the host of the API +service and the firewall. We'd anyway like to briefly describe the +consequences of the DHCP service failing. + +In case of a DHCP service failure inspected nodes won't be able to +boot the introspection ramdisk and eventually fail to get inspected +because of a timeout. The nodes may loop retrying to boot depending +on their firmware configuration. + +A fail-over of DHCP from active to back-up host (`dnsmasq +`_ usually) would +manifest with booting nodes under introspection timing out or nodes +already booted (with a lease of an address) getting into an address +conflict with another node booting. There's not much to help the +former situation besides retrying. To prevent the latter from +happening, the configuration of DHCP service for the introspection +purpose should consider disjoint address pools served by the DHCP +instances such as recommended in `IP address allocation between +servers +`_ +section of the DHCP Failover Protocol RFC. We also recommend using +the ``dhcp-sequential-ip`` in the dnsmasq configuration file to avoid +conflicts within the address pools. See `related bug report +`_ for more +details on the issue. The introspection being an ephemeral matter, +synchronization of the leases between the DHCP instances isn't +necessary if restarting introspection isn't an issue. + +Other Inspector parts +--------------------- + +* periodic introspection status clean-up, removing old inspection data + and finishing timed-out introspections +* synchronizing set of nodes with ironic +* limiting node power-on rate with a shared lock and a timeout + +Proposed change +=============== + +In considering the problem of high availability, we are proposing a +solution that consists of a distributed, shared-nothing, active-active +implementation of all services that comprise the ironic inspector. +From the user point of view, we suggest API service to serve through a +*load balancer*, such as `HAProxy `_, in +order to maintain a single entry point for the API service (e.g. +floating IP address). + +HA Node Introspection decomposition +----------------------------------- + +Node introspection being a state transition system, we focus on +*decentralizing* it. We therefore replicate the current introspection +state through the distributed store in all inspector process instances +for particular node. We suggest that both the automatic state +advancing requests as well as API state advancing requests are +performed asynchronously by independent workers. + +HA Worker +--------- + +Each inspector process provides a pool of asynchronous workers that +get state transition requests from a queue. We use separate +``queue.get`` and ``queue.consume`` calls to avoid losing state +transition requests due to worker failures. This however introduces +the *at-least-once* delivery semantics to the requests. We therefore +rely on the `transition-function`_ to handle the request delivery +gracefully. We suggest two kinds of state-transition handling with +regards to the at-least-once delivery semantics: + +Strict (non-reentrant-task) Transition Specification +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +* `Getting a request`_ +* `Calculating new node state`_ +* `Updating node state`_ +* `Executing a task`_ +* `Consuming a request`_ + +Reentrant Task Transition Specification +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +* `Getting a request`_ +* `Calculating new node state`_ +* `Executing a task`_ +* `Updating node state`_ +* `Consuming a request`_ + +Strict transition protecting a state change may lead to a situation +that the state of introspection is not in correspondence with the node +in reality --- if a worker partitions right after having successfully +executed the task but just before consuming the request from the +queue. As a consequence the transition request not having been +consumed will be encountered again with (another) worker. One can +refer to this behavior as a *reentrancy glitch or Déjà vu* + +Since the goal is to protect the inspected node from going through the +same task again, we rely on the state transition system to handle this +situation by navigating to the ``error`` state instead. + +Removing a node +^^^^^^^^^^^^^^^ + +`Ironic synchronization periodic task`_ puts node delete requests on +the queue. Workers perform following steps to handle: + +* `Getting a request`_ +* worker removes the node from the store +* `Consuming a request`_ + +Failure of store removing the node isn't a concern here as the +periodic task will try again later. It is therefore safe to always +consume the request here. + +Shutting Down HA Inspector Processes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +All inspector process instances register a ``SIGTERM`` callback. To +notify inspector worker threads, the ``SIGTERM`` callback sets the +``sigterm_flag`` upon the signal delivery. The flag is process-local +and its purpose is to allow inspector processes to perform a +controlled/graceful shutdown. For this mechanism to work, potentially +blocking operations (such as ``queue.get``) have to be used with a +configurable timeout value within the workers. All sleep calls +throughout the process instance should be interruptible, possibly +implemented as ``sigterm_flag.wait(sleep_time)`` or similar. + +Getting a request +^^^^^^^^^^^^^^^^^ + +* any worker instance may execute any request the queue contains +* worker gets state transition or node delete request from the queue +* if ``SIGTERM`` flag is set, worker stops +* if ``queue.get`` timed-out (task is ``None``) poll the queue again +* lock the BM node related to the request +* if locking failed worker polls the queue again not consuming the + request + +Calculating new node state +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +* worker instantiates a state transition system instance for current + node state +* if instantiating failed (e.g. no such node in the store) worker + performs `Retrying a request`_ +* worker advances the state transition system +* if the state machine is jammed (illegal state transition request) + worker performs `Consuming a request`_ + +Updating node state +^^^^^^^^^^^^^^^^^^^ + +The introspection state is kept in the store, visible to all worker +instances. + +* worker saves node state in the store +* if saving node state in the store failed (such as node has been + removed) worker performs `Retrying a request`_ + +Executing a task +^^^^^^^^^^^^^^^^ + +* worker performs the task bound to the transition request +* if the task result is a transition request worker puts it on the + queue + +Consuming a request +^^^^^^^^^^^^^^^^^^^ + +* worker consumes the state transition request from the queue +* worker releases related node lock +* worker continues from the beginning + +Retrying a request +^^^^^^^^^^^^^^^^^^ + +* worker releases node lock +* worker continues from the beginning not consuming the request to + retry later + +Introspection State-Transition System +------------------------------------- + +Node introspection state is managed by a worker-local instance of a +state transition system. The state transition function is as follows. + +.. compound:: + + .. _transition-function: + + .. table:: Transition function + + +----------------+-----------------------+------------------------------------+ + | State | Event | Target | + +================+=======================+====================================+ + | N/A | Inspect | Starting | + +----------------+-----------------------+------------------------------------+ + | Starting* | Inspect | Starting | + +----------------+-----------------------+------------------------------------+ + | Starting* | S~ | Waiting | + +----------------+-----------------------+------------------------------------+ + | Waiting | S~ | Waiting | + +----------------+-----------------------+------------------------------------+ + | Waiting | Timeout | Error | + +----------------+-----------------------+------------------------------------+ + | Waiting | Abort | Error | + +----------------+-----------------------+------------------------------------+ + | Waiting | Continue! | Processing | + +----------------+-----------------------+------------------------------------+ + | Processing | Continue! | Error | + +----------------+-----------------------+------------------------------------+ + | Processing | F~ | Finished | + +----------------+-----------------------+------------------------------------+ + | Finished+ | Inspect | Starting | + +----------------+-----------------------+------------------------------------+ + | Finished+ | Abort | Error | + +----------------+-----------------------+------------------------------------+ + | Error+ | Inspect | Starting | + +----------------+-----------------------+------------------------------------+ + + .. table:: Legend + + +------------+-----------------------------+ + | Expression | Meaning | + +============+=============================+ + | State* | the initial state | + +------------+-----------------------------+ + | State+ | the terminal/accepting state| + +------------+-----------------------------+ + | State~ | the automatic event | + | | originating in State | + +------------+-----------------------------+ + | Event! | strict/non-reentrant | + | | transition event | + +------------+-----------------------------+ + +.. _timer-decomposition: + +HA Singleton Periodic task decomposition +---------------------------------------- + +Ironic inspector service houses a couple of periodic tasks. At any +point, up to a single "instance" of a periodic task flavor should be +running, no matter the process instances count. For this purpose, the +processes form a periodic task distributed management party. + +Process instances register a ``SIGTERM`` callback that, the signal +being delivered, makes the process instance leave the party and switch +the ``reset_flag``. + +The process instances install a watch on the party. Upon the party +shrinkage, the processes reset their periodic task, if they have one +set, triggering the ``reset_flag`` and participate in new distributed +periodic task management leader election. Party growth isn't of +concern to the processes. + +It's because of the task reset due to the party shrinkage a custom +flag has to be used, instead of the ``sigterm_flag``, to stop the +periodic task. Otherwise, setting the ``sigterm_flag`` because of the +party change would stop the whole service. + +The leader process executes the periodic task loop. Upon exception or +partitioning, mind the `partitioning-concerns`_, the leader stops +through flipping the ``sigterm_flag`` in order for the inspector +service to stop. The periodic task loop is stopped eventually as it +performs ``reset_flag.wait(period)`` instead of sleeping. + +The periodic task management should happen in a separate asynchronous +thread instance, one per periodic task. Losing leader due to its +error (or partitioning) isn't a concern --- a new one will eventually +be elected and a couple of periodic task runs will be wasted +(including those that died together with the leader). + +HA Periodic clean-up decomposition +---------------------------------- + +Clean-up should be implemented as independent HA singleton periodic +tasks with configurable time period, one for each of the introspection +timeout and ironic synchronization tasks. + +Introspection timeout periodic task +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To finish introspections that are timing-out: + +* select nodes for which the introspection is timing out +* for each node: +* put a request to time-out the introspection on the queue for a + worker to process + +Ironic synchronization periodic task +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To remove nodes no longer tracked by Ironic: + +* select nodes that are kept by Inspector but not kept by Ironic +* for each node: +* put a request to delete the node on the queue for a worker to + process + +HA Reboot Throttle Decomposition +-------------------------------- + +As a workaround for some hardware, reboot request rate should be +limited. For this purpose, a single distributed lock instance should +be utilized. At any point in time, only a single worker may hold the +lock while performing the reboot (power-on) task. Upon acquiring the +lock, the reboot state transition sleeps in an interruptible fashion +for a configurable quantum of time. If the sleep was indeed +interrupted, the worker should raise an exception stopping the reboot +procedure and the worker itself. This interruption should happen as +part of the graceful shutdown mechanism. This should be implemented +utilizing the same ``SIGTERM`` flag/event workers use to check for +pending shutdown: ``sigterm_flag.wait(timeout=quantum)`` + +Process partitioning isn't a concern here because all workers sleep +while holding the lock. Partitioning therefore slows down the reboot +pace by the amount of time a lock takes to expire. It should be +possible to disable the reboot throttle altogether through the +configuration. + +HA Firewall decomposition +------------------------- + +The PXE boot environment is configured and active on all inspector +hosts. The firewall protection of the PXE environment is active on all +inspector hosts, blocking the hosts' PXE service. At any given point +in time, at most one inspector host's PXE service is available, and it +is available to all inspected nodes. + +Building blocks +^^^^^^^^^^^^^^^ + +The general policy is allow-all, and each node that is not being +inspected has a block-exception to the general policy. Due to its +size, the black-list is maintained locally on all inspector hosts, +pulling items from ironic periodically or asynchronously from a +pub--sub channel. + +Nodes that are being introspected are white-listed in a separate set +of firewall rules. Nodes that are being discovered for the first time +fall through the black-list due to the general allow-all black-list +policy. + +Nodes the HA firewall is supposed to allow access to the PXE service, +are kept in a distributed store or obtained asynchronously from a +pub--sub channel. Process instance workers add (subtract) firewall +rules to (from) the distributed store as necessary or announce the +changes on the pub--sub channels. Firewall rules are ``(port_ID, +port_MAC)`` tuples to be white-/black-listed. + +Process instances use custom chains to implement the firewall: the +white-list chain and the black-list chain. Failing through the +white-list chain, a packet "proceeds" to the black-list chain. Failing +through the black-list chain, a packet is allowed to access the PXE +service port. A node port rule may be present both in the white-list +and the black-list chain at the same time if being introspected. + +HA Decomposition +^^^^^^^^^^^^^^^^ + +Starting, the processes poll Ironic to build their black-list chains +for the first time and set up *local* periodic Ironic black-list +synchronisation task or set callbacks on the black-list pub--sub +channel. + +Process instances form a distributed firewall management party that +they watch for changes. Process instances register a ``SIGTERM`` +callback that, the signal being delivered, makes the process instance +leave the party and reset the firewall, completely blocking their PXE +service. + +Upon the party shrinkage, processes reset their firewall white-list +chain, the *pass* rule in the black-list chain, and the rule set watch +(should they have one set) and participate in a distributed firewall +management leader election. Party growth isn't of concern to the +processes. + +The leader process' black-list chain contains the *pass* rule while +other process's black-list chains don't. Having been elected, the +leader process builds the white-list and registers a watch on the +distributed store or a white-list pub--sub channel callback in order +to keep the white-list firewall chain up-to-date. Other process +instances don't maintain a white-list chain, that chain is empty for +them. + +Upon any exception (or process instance partitioning), a process +resets its firewall to completely protect its PXE service. + +Notes +^^^^^ + +Periodic white-list store polling and the white-list pub--sub channel +callbacks are mutually optional facilities to enhance the +responsiveness of the firewall, and the user may prefer enabling one +or the other or both simultaneously as necessary. The same holds for +the black-list Ironic polling and the black-list pub--sub channel +callbacks. + +To assemble the blacklist of MAC addresses, the processes may need to +poll the ironic service periodically for node information. A +cache/proxy of this information might be kept optionally to reduce the +load on Ironic. + +The firewall management should be implemented as a separate +asynchronous thread in each inspector process instance. Firewall being +lost due to the leader failure isn't a concern --- new leader will be +eventually elected. Some nodes being introspected may experience a +timeout in the waiting state and fail the introspection though. + +Periodic Ironic--firewall node synchronization and white-list store +polling should be implemented as independent threads with configurable +time period, ``0<=period<=30s``, ideally ``0<=period<=15s`` so the +window between introducing a node to ironic and blacklisting it in +inspector firewall is kept below user's resolution. + +As an optimization, the implementation may consider offloading the MAC +address rules of node ports from firewall chains into `IP sets +`_ + +HA HTTP API Decomposition +------------------------- + +We assume a Load Balancer (HAProxy) shielding the user from the +inspector service. All the inspector API process instances should +export the same REST API. Each API Request should be handled in a +separate asynchronous thread instance (as is the case now with the +`Flask `_ framework). At any point +in time, any of the process instances may serve any request. + +.. _partitioning-concerns: + +Partitioning concerns +--------------------- + +Upon connection exception/worker process partitioning, affected entity +should retry connection establishing before announcing failure. The +retry count and timeout should be configurable for each of the ironic, +database, distributed store, lock and queue services. The timeout +should be interruptible, possibly implemented as waiting for +appropriate termination/``SIGTERM`` flag, +e.g. ``sigterm_flag.wait(timeout)``. Should the retrying fail, +affected entity breaks the worker inspector service altogether, +setting the flag, to avoid damage to resources --- most of the time, +other worker service entities would be equally affected by the +partition anyway. User may consider restarting affected worker +service process instance when the partitioning issue is resolved. + +Partitioning of HTTP API service instances isn't a concern as those +are stateless and accessed through a load balancer. + +Alternatives +------------ + +HA Worker Decomposition +^^^^^^^^^^^^^^^^^^^^^^^ + +We've briefly examined the `TaskFlow +`_ library as alternate +tasking mechanism. Currently, TaskFlow does support only `directed +acyclic graphs as dependency structure +`_ between +particular steps. Inspector service has to however support restarting +of the introspection for a particular node, bringing loops into the +graph; see `transition-function`_. Moreover TaskFlow does not +`support external event propagating +`_ to a running +flow, such as the ``continue`` call from the bare metal node. Because +of that, the overall state of the introspection of particular node has +to be maintained explicitly if TaskFlow is adopted. TaskFlow, too, +requires tasks to be reentrant/idempotent. + +HA Firewall decomposition +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The firewall facility can be replaced by Neutron once it adopts +`enhancements to subnet DHCP options +`_ and `allows serving DHCP +to unknown hosts `_. We're +keeping Inspector's firewall facility for users that are interested in +stand-alone deployments. + +Data model impact +----------------- + +Queue +^^^^^ + +State transition request item is introduced, it should contain these +attributes (as an oslo.versioned) object: + +* node ID +* transition event + +A clean-up request item is introduced removing a node. Attributes +comprising the request: + +* node ID + +Pub--sub channels +^^^^^^^^^^^^^^^^^ + +Two channels are introduced: firewall white-list and black-list. The +message format is as follows: + +* add/remove +* port ID, MAC address + +Store +^^^^^ + +Node state column is introduced to the node table. + +HTTP API impact +--------------- + +API service is provided by dedicated processes. + +Client (CLI) impact +------------------- + +None planned. + +Performance and scalability impact +---------------------------------- + +We hope this change brings in desired redundancy and scaling for the +inspector service. We however expect the change to have a negative +network utilization impact as the introspection task requires a queue +and a DLM to coordinate. + +The inspector firewall facility requires periodic polling of the +ironic service inventory in each inspector instance. Therefore we +expect increased load on the ironic service. + +Firewall facility leader partitioning causes boot service outage for +the election period. Some nodes may therefore timeout booting. + +Each time the firewall leader updates the hosts firewall node +information is polled from ironic service. This may introduce delays +in firewall availability. If a node being introspected is removed +from the ironic service, the change will not propagate to Inspector +until the introspection finishes. + +Security impact +--------------- + +New services introduced that might require hardening and protection: + +* load balancer +* distributed locking facility +* queue +* pub--sub channels + +Deployer impact +--------------- + +Inspector Service Configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +* distributed locking facility, queue, firewall pub--sub channels and + load balancer introduce new configuration options, especially + URLs/hosts and credentials +* worker pool size, integral, ``0`_; primary +* `divius `_ + +Work Items +---------- + +* replace current locking with Tooz DLM +* introduce state machine +* split API service and introduce conductors and queue +* split cleaning into a separate timeout and synchronization handlers + and introduce leader-election to these periodic procedures +* introduce leader-election to the firewall facility +* introduce the pub--sub channels to the firewall facility + +Dependencies +============ + +We require proper inspector `grenade testing +`_ before landing HA so we +avoid breaking users as much as possible. + +Testing +======= + +All work items should be tested as separate patches both with +functional and unit tests as well as upgrade tests with Grenade. + +Having landed all the required work items it should be possible to +test Inspector with focus on redundancy and scaling. + +References +========== + +During the analysis process we considered these blueprints: + +* `Abort introspection + `_ +* `Node States + `_ +* `Node Locking `_ +* `Oslo.messaging at-least-once semantics + `_ + +RFEs: + +* `TaskFlow: flow suspend&continue + `_ +* `TaskFlow: non-DAG flow patterns + `_ +* `HA for Ironic Inspector + `_ +* `Safe queue for Tooz + `_ +* `Watchable store for Tooz + `_ +* `Enhanced Network/Subnet DHCP Options + `_ +* `Neutron DHCP serve unknown hosts + `_ + +Community sources: + +* `DLM options discussion + `_ +* `TaskFlow with external events and Non-DAG flows + `_ +* Joshua Harlow's comment that `Tooz should implement the + at-least-once semantics not Oslo.messaging + `_ + +RFCs: + +* `DHCP Failover Protocol: IP address allocation between servers `_ + +Tools: + +* `IP Sets `_ +* `Dnsmasq `_