From d959e4b1bac76b721c35dbac599852745a467e9f Mon Sep 17 00:00:00 2001
From: dparalen <vetrisko@gmail.com>
Date: Fri, 4 Dec 2015 19:49:11 +0100
Subject: [PATCH] High Availability for Ironic Inspector

Introduce redundancy and scalability to the ironic
inspector service

Change-Id: I88667decc4d01a125fc840b9efb448fdba5dec08
Co-Authored-By: Dmitry Tantsur <divius.inside@gmail.com>
---
 specs/HA_inspector.rst | 841 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 841 insertions(+)
 create mode 100644 specs/HA_inspector.rst

diff --git a/specs/HA_inspector.rst b/specs/HA_inspector.rst
new file mode 100644
index 0000000..c1d828f
--- /dev/null
+++ b/specs/HA_inspector.rst
@@ -0,0 +1,841 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=======================================
+ High Availability for Ironic Inspector
+=======================================
+
+Ironic inspector is a service that allows bare metal nodes to be
+introspected dynamically, that currently isn't redundant.  The goal of
+this blueprint is to suggest *conceptual changes* to the inspector
+service that would make inspector redundant while maintaining both the
+current inspection feature set and API.
+
+Problem description
+===================
+
+Inspector is a compound service consisting of the inspector API
+service, the firewall and the DHCP (PXE) service.  Currently, all
+three components run a single instance on a shared host per OpenStack
+deployment.  A failure of the host or any of the services renders
+introspection unavailable and prevents the cloud administrator from
+enrolling new hardware or from booting already enrolled baremetal
+nodes.  Furthermore, Inspector isn't designed to cope well with the
+amount of hardware required for Ironic bare metal usage at large
+scale.  With a site size of 10k bare metal nodes in mind, we aim at
+the inspector sustaining a batch load of a couple of hundred
+introspection/enroll requests interleaved with couple of minutes of
+silence, maintaining a couple of thousand firewall black list items.
+We refer to this use case as *bare metal to tenant*.
+
+Below we describe the current Inspector service architecture with some
+Inspector process instance failure consequences.
+
+Introspection process
+---------------------
+
+Node introspection is a sequence of asynchronous steps, controlled by
+the inspector API service, that take various amounts of time to
+finish.  One could describe these steps as states of a transition
+system, advanced by events as follows:
+
+* ``starting`` the initial state; the system is advanced into this
+  state by receiving an introspect API request.  Introspection
+  configuration and set-up steps are performed while in this state.
+* ``waiting`` introspection image is booting on the node.  The system
+  advances to this state automatically.
+* ``processing`` introspection image has booted and collected
+  necessary information from the node.  This information is being
+  processed by plug-ins to validate node status.  The system is
+  advanced to this state having received the ``continue`` REST API
+  request.
+* ``finished`` introspection is done, node powered-off.  The system
+  is advanced to this state automatically.
+
+In case of an API service failure, nodes in-between the ``starting``
+and ``finished`` state, will lose their state, and may require manual
+intervention to recover.  No more nodes can be processed either
+because the API service runs in a single instance per deployment.
+
+Firewall configuration
+----------------------
+
+To minimize interference with normally deployed nodes, inspector
+deploys temporary firewall rules so only nodes being inspected can
+access its PXE boot service.  It is implemented as a blacklist
+containing MAC addresses of nodes kept by ironic service but not by
+inspector.  This is required because the MAC address isn't known
+before a node boots for the first time.
+
+Depending on the spot in which the API service fails while the
+firewall and DHCP services are intact, firewall configuration may get
+out of sync and may lead to interference with normal node booting:
+
+* firewall chain set-up (init phase): Inspector's dnsmasq service is
+  exposed to all nodes
+* firewall synchronization periodic task: new nodes added to Ironic
+  aren't blacklisted
+* node introspection finished: the node won't be blacklisted
+
+On the other hand, no boot interference is expected if running all
+services (inspector, firewall and DHCP), on the same host, as all
+service are lost together.  Losing the API service during clean-up
+periodic task, should not matter as the nodes concerned will be kept
+blacklisted during service downtime.
+
+DHCP (PXE) service
+------------------
+
+Inspector service doesn't manage the DHCP service directly, rather, it
+just requires DHCP is properly set up and shares the host of the API
+service and the firewall.  We'd anyway like to briefly describe the
+consequences of the DHCP service failing.
+
+In case of a DHCP service failure inspected nodes won't be able to
+boot the introspection ramdisk and eventually fail to get inspected
+because of a timeout.  The nodes may loop retrying to boot depending
+on their firmware configuration.
+
+A fail-over of DHCP from active to back-up host (`dnsmasq
+<http://www.thekelleys.org.uk/dnsmasq/doc.html>`_ usually) would
+manifest with booting nodes under introspection timing out or nodes
+already booted (with a lease of an address) getting into an address
+conflict with another node booting.  There's not much to help the
+former situation besides retrying.  To prevent the latter from
+happening, the configuration of DHCP service for the introspection
+purpose should consider disjoint address pools served by the DHCP
+instances such as recommended in `IP address allocation between
+servers
+<https://tools.ietf.org/html/draft-ietf-dhc-failover-12#section-5.4>`_
+section of the DHCP Failover Protocol RFC.  We also recommend using
+the ``dhcp-sequential-ip`` in the dnsmasq configuration file to avoid
+conflicts within the address pools.  See `related bug report
+<https://bugzilla.redhat.com/show_bug.cgi?id=1301659#c20>`_ for more
+details on the issue.  The introspection being an ephemeral matter,
+synchronization of the leases between the DHCP instances isn't
+necessary if restarting introspection isn't an issue.
+
+Other Inspector parts
+---------------------
+
+* periodic introspection status clean-up, removing old inspection data
+  and finishing timed-out introspections
+* synchronizing set of nodes with ironic
+* limiting node power-on rate with a shared lock and a timeout
+
+Proposed change
+===============
+
+In considering the problem of high availability, we are proposing a
+solution that consists of a distributed, shared-nothing, active-active
+implementation of all services that comprise the ironic inspector.
+From the user point of view, we suggest API service to serve through a
+*load balancer*, such as `HAProxy <http://www.haproxy.org/>`_, in
+order to maintain a single entry point for the API service (e.g.
+floating IP address).
+
+HA Node Introspection decomposition
+-----------------------------------
+
+Node introspection being a state transition system, we focus on
+*decentralizing* it.  We therefore replicate the current introspection
+state through the distributed store in all inspector process instances
+for particular node.  We suggest that both the automatic state
+advancing requests as well as API state advancing requests are
+performed asynchronously by independent workers.
+
+HA Worker
+---------
+
+Each inspector process provides a pool of asynchronous workers that
+get state transition requests from a queue.  We use separate
+``queue.get`` and ``queue.consume`` calls to avoid losing state
+transition requests due to worker failures.  This however introduces
+the *at-least-once* delivery semantics to the requests.  We therefore
+rely on the `transition-function`_ to handle the request delivery
+gracefully.  We suggest two kinds of state-transition handling with
+regards to the at-least-once delivery semantics:
+
+Strict (non-reentrant-task) Transition Specification
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* `Getting a request`_
+* `Calculating new node state`_
+* `Updating node state`_
+* `Executing a task`_
+* `Consuming a request`_
+
+Reentrant Task Transition Specification
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* `Getting a request`_
+* `Calculating new node state`_
+* `Executing a task`_
+* `Updating node state`_
+* `Consuming a request`_
+
+Strict transition protecting a state change may lead to a situation
+that the state of introspection is not in correspondence with the node
+in reality --- if a worker partitions right after having successfully
+executed the task but just before consuming the request from the
+queue.  As a consequence the transition request not having been
+consumed will be encountered again with (another) worker.  One can
+refer to this behavior as a *reentrancy glitch or Déjà vu*
+
+Since the goal is to protect the inspected node from going through the
+same task again, we rely on the state transition system to handle this
+situation by navigating to the ``error`` state instead.
+
+Removing a node
+^^^^^^^^^^^^^^^
+
+`Ironic synchronization periodic task`_ puts node delete requests on
+the queue.  Workers perform following steps to handle:
+
+* `Getting a request`_
+* worker removes the node from the store
+* `Consuming a request`_
+
+Failure of store removing the node isn't a concern here as the
+periodic task will try again later.  It is therefore safe to always
+consume the request here.
+
+Shutting Down HA Inspector Processes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All inspector process instances register a ``SIGTERM`` callback. To
+notify inspector worker threads, the ``SIGTERM`` callback sets the
+``sigterm_flag`` upon the signal delivery. The flag is process-local
+and its purpose is to allow inspector processes to perform a
+controlled/graceful shutdown. For this mechanism to work, potentially
+blocking operations (such as ``queue.get``) have to be used with a
+configurable timeout value within the workers. All sleep calls
+throughout the process instance should be interruptible, possibly
+implemented as ``sigterm_flag.wait(sleep_time)`` or similar.
+
+Getting a request
+^^^^^^^^^^^^^^^^^
+
+* any worker instance may execute any request the queue contains
+* worker gets state transition or node delete request from the queue
+* if ``SIGTERM`` flag is set, worker stops
+* if ``queue.get`` timed-out (task is ``None``) poll the queue again
+* lock the BM node related to the request
+* if locking failed worker polls the queue again not consuming the
+  request
+
+Calculating new node state
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* worker instantiates a state transition system instance for current
+  node state
+* if instantiating failed (e.g. no such node in the store) worker
+  performs `Retrying a request`_
+* worker advances the state transition system
+* if the state machine is jammed (illegal state transition request)
+  worker performs `Consuming a request`_
+
+Updating node state
+^^^^^^^^^^^^^^^^^^^
+
+The introspection state is kept in the store, visible to all worker
+instances.
+
+* worker saves node state in the store
+* if saving node state in the store failed (such as node has been
+  removed) worker performs `Retrying a request`_
+
+Executing a task
+^^^^^^^^^^^^^^^^
+
+* worker performs the task bound to the transition request
+* if the task result is a transition request worker puts it on the
+  queue
+
+Consuming a request
+^^^^^^^^^^^^^^^^^^^
+
+* worker consumes the state transition request from the queue
+* worker releases related node lock
+* worker continues from the beginning
+
+Retrying a request
+^^^^^^^^^^^^^^^^^^
+
+* worker releases node lock
+* worker continues from the beginning not consuming the request to
+  retry later
+
+Introspection State-Transition System
+-------------------------------------
+
+Node introspection state is managed by a worker-local instance of a
+state transition system.  The state transition function is as follows.
+
+.. compound::
+
+   .. _transition-function:
+
+   .. table:: Transition function
+
+      +----------------+-----------------------+------------------------------------+
+      | State          | Event                 | Target                             |
+      +================+=======================+====================================+
+      | N/A            | Inspect               | Starting                           |
+      +----------------+-----------------------+------------------------------------+
+      | Starting*      | Inspect               | Starting                           |
+      +----------------+-----------------------+------------------------------------+
+      | Starting*      | S~                    | Waiting                            |
+      +----------------+-----------------------+------------------------------------+
+      | Waiting        | S~                    | Waiting                            |
+      +----------------+-----------------------+------------------------------------+
+      | Waiting        | Timeout               | Error                              |
+      +----------------+-----------------------+------------------------------------+
+      | Waiting        | Abort                 | Error                              |
+      +----------------+-----------------------+------------------------------------+
+      | Waiting        | Continue!             | Processing                         |
+      +----------------+-----------------------+------------------------------------+
+      | Processing     | Continue!             | Error                              |
+      +----------------+-----------------------+------------------------------------+
+      | Processing     | F~                    | Finished                           |
+      +----------------+-----------------------+------------------------------------+
+      | Finished+      | Inspect               | Starting                           |
+      +----------------+-----------------------+------------------------------------+
+      | Finished+      | Abort                 | Error                              |
+      +----------------+-----------------------+------------------------------------+
+      | Error+         | Inspect               | Starting                           |
+      +----------------+-----------------------+------------------------------------+
+
+   .. table:: Legend
+
+      +------------+-----------------------------+
+      | Expression | Meaning                     |
+      +============+=============================+
+      | State*     | the initial state           |
+      +------------+-----------------------------+
+      | State+     | the terminal/accepting state|
+      +------------+-----------------------------+
+      | State~     | the automatic event         |
+      |            | originating in State        |
+      +------------+-----------------------------+
+      | Event!     | strict/non-reentrant        |
+      |            | transition event            |
+      +------------+-----------------------------+
+
+.. _timer-decomposition:
+
+HA Singleton Periodic task decomposition
+----------------------------------------
+
+Ironic inspector service houses a couple of periodic tasks. At any
+point, up to a single "instance" of a periodic task flavor should be
+running, no matter the process instances count. For this purpose, the
+processes form a periodic task distributed management party.
+
+Process instances register a ``SIGTERM`` callback that, the signal
+being delivered, makes the process instance leave the party and switch
+the ``reset_flag``.
+
+The process instances install a watch on the party. Upon the party
+shrinkage, the processes reset their periodic task, if they have one
+set, triggering the ``reset_flag`` and participate in new distributed
+periodic task management leader election.  Party growth isn't of
+concern to the processes.
+
+It's because of the task reset due to the party shrinkage a custom
+flag has to be used, instead of the ``sigterm_flag``, to stop the
+periodic task.  Otherwise, setting the ``sigterm_flag`` because of the
+party change would stop the whole service.
+
+The leader process executes the periodic task loop.  Upon exception or
+partitioning, mind the `partitioning-concerns`_, the leader stops
+through flipping the ``sigterm_flag`` in order for the inspector
+service to stop.  The periodic task loop is stopped eventually as it
+performs ``reset_flag.wait(period)`` instead of sleeping.
+
+The periodic task management should happen in a separate asynchronous
+thread instance, one per periodic task.  Losing leader due to its
+error (or partitioning) isn't a concern --- a new one will eventually
+be elected and a couple of periodic task runs will be wasted
+(including those that died together with the leader).
+
+HA Periodic clean-up decomposition
+----------------------------------
+
+Clean-up should be implemented as independent HA singleton periodic
+tasks with configurable time period, one for each of the introspection
+timeout and ironic synchronization tasks.
+
+Introspection timeout periodic task
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To finish introspections that are timing-out:
+
+* select nodes for which the introspection is timing out
+* for each node:
+* put a request to time-out the introspection on the queue for a
+  worker to process
+
+Ironic synchronization periodic task
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To remove nodes no longer tracked by Ironic:
+
+* select nodes that are kept by Inspector but not kept by Ironic
+* for each node:
+* put a request to delete the node on the queue for a worker to
+  process
+
+HA Reboot Throttle Decomposition
+--------------------------------
+
+As a workaround for some hardware, reboot request rate should be
+limited. For this purpose, a single distributed lock instance should
+be utilized. At any point in time, only a single worker may hold the
+lock while performing the reboot (power-on) task. Upon acquiring the
+lock, the reboot state transition sleeps in an interruptible fashion
+for a configurable quantum of time. If the sleep was indeed
+interrupted, the worker should raise an exception stopping the reboot
+procedure and the worker itself. This interruption should happen as
+part of the graceful shutdown mechanism. This should be implemented
+utilizing the same ``SIGTERM`` flag/event workers use to check for
+pending shutdown: ``sigterm_flag.wait(timeout=quantum)``
+
+Process partitioning isn't a concern here because all workers sleep
+while holding the lock. Partitioning therefore slows down the reboot
+pace by the amount of time a lock takes to expire.  It should be
+possible to disable the reboot throttle altogether through the
+configuration.
+
+HA Firewall decomposition
+-------------------------
+
+The PXE boot environment is configured and active on all inspector
+hosts. The firewall protection of the PXE environment is active on all
+inspector hosts, blocking the hosts' PXE service.  At any given point
+in time, at most one inspector host's PXE service is available, and it
+is available to all inspected nodes.
+
+Building blocks
+^^^^^^^^^^^^^^^
+
+The general policy is allow-all, and each node that is not being
+inspected has a block-exception to the general policy.  Due to its
+size, the black-list is maintained locally on all inspector hosts,
+pulling items from ironic periodically or asynchronously from a
+pub--sub channel.
+
+Nodes that are being introspected are white-listed in a separate set
+of firewall rules.  Nodes that are being discovered for the first time
+fall through the black-list due to the general allow-all black-list
+policy.
+
+Nodes the HA firewall is supposed to allow access to the PXE service,
+are kept in a distributed store or obtained asynchronously from a
+pub--sub channel.  Process instance workers add (subtract) firewall
+rules to (from) the distributed store as necessary or announce the
+changes on the pub--sub channels.  Firewall rules are ``(port_ID,
+port_MAC)`` tuples to be white-/black-listed.
+
+Process instances use custom chains to implement the firewall: the
+white-list chain and the black-list chain.  Failing through the
+white-list chain, a packet "proceeds" to the black-list chain. Failing
+through the black-list chain, a packet is allowed to access the PXE
+service port.  A node port rule may be present both in the white-list
+and the black-list chain at the same time if being introspected.
+
+HA Decomposition
+^^^^^^^^^^^^^^^^
+
+Starting, the processes poll Ironic to build their black-list chains
+for the first time and set up *local* periodic Ironic black-list
+synchronisation task or set callbacks on the black-list pub--sub
+channel.
+
+Process instances form a distributed firewall management party that
+they watch for changes.  Process instances register a ``SIGTERM``
+callback that, the signal being delivered, makes the process instance
+leave the party and reset the firewall, completely blocking their PXE
+service.
+
+Upon the party shrinkage, processes reset their firewall white-list
+chain, the *pass* rule in the black-list chain, and the rule set watch
+(should they have one set) and participate in a distributed firewall
+management leader election.  Party growth isn't of concern to the
+processes.
+
+The leader process' black-list chain contains the *pass* rule while
+other process's black-list chains don't.  Having been elected, the
+leader process builds the white-list and registers a watch on the
+distributed store or a white-list pub--sub channel callback in order
+to keep the white-list firewall chain up-to-date.  Other process
+instances don't maintain a white-list chain, that chain is empty for
+them.
+
+Upon any exception (or process instance partitioning), a process
+resets its firewall to completely protect its PXE service.
+
+Notes
+^^^^^
+
+Periodic white-list store polling and the white-list pub--sub channel
+callbacks are mutually optional facilities to enhance the
+responsiveness of the firewall, and the user may prefer enabling one
+or the other or both simultaneously as necessary.  The same holds for
+the black-list Ironic polling and the black-list pub--sub channel
+callbacks.
+
+To assemble the blacklist of MAC addresses, the processes may need to
+poll the ironic service periodically for node information.  A
+cache/proxy of this information might be kept optionally to reduce the
+load on Ironic.
+
+The firewall management should be implemented as a separate
+asynchronous thread in each inspector process instance. Firewall being
+lost due to the leader failure isn't a concern --- new leader will be
+eventually elected.  Some nodes being introspected may experience a
+timeout in the waiting state and fail the introspection though.
+
+Periodic Ironic--firewall node synchronization and white-list store
+polling should be implemented as independent threads with configurable
+time period, ``0<=period<=30s``, ideally ``0<=period<=15s`` so the
+window between introducing a node to ironic and blacklisting it in
+inspector firewall is kept below user's resolution.
+
+As an optimization, the implementation may consider offloading the MAC
+address rules of node ports from firewall chains into `IP sets
+<http://ipset.netfilter.org/changelog.html>`_
+
+HA HTTP API Decomposition
+-------------------------
+
+We assume a Load Balancer (HAProxy) shielding the user from the
+inspector service. All the inspector API process instances should
+export the same REST API. Each API Request should be handled in a
+separate asynchronous thread instance (as is the case now with the
+`Flask <https://pypi.python.org/pypi/Flask>`_ framework). At any point
+in time, any of the process instances may serve any request.
+
+.. _partitioning-concerns:
+
+Partitioning concerns
+---------------------
+
+Upon connection exception/worker process partitioning, affected entity
+should retry connection establishing before announcing failure.  The
+retry count and timeout should be configurable for each of the ironic,
+database, distributed store, lock and queue services.  The timeout
+should be interruptible, possibly implemented as waiting for
+appropriate termination/``SIGTERM`` flag,
+e.g. ``sigterm_flag.wait(timeout)``.  Should the retrying fail,
+affected entity breaks the worker inspector service altogether,
+setting the flag, to avoid damage to resources --- most of the time,
+other worker service entities would be equally affected by the
+partition anyway.  User may consider restarting affected worker
+service process instance when the partitioning issue is resolved.
+
+Partitioning of HTTP API service instances isn't a concern as those
+are stateless and accessed through a load balancer.
+
+Alternatives
+------------
+
+HA Worker Decomposition
+^^^^^^^^^^^^^^^^^^^^^^^
+
+We've briefly examined the `TaskFlow
+<https://wiki.openstack.org/wiki/taskflow>`_ library as alternate
+tasking mechanism.  Currently, TaskFlow does support only `directed
+acyclic graphs as dependency structure
+<https://bugs.launchpad.net/taskflow/+bug/1527690>`_ between
+particular steps. Inspector service has to however support restarting
+of the introspection for a particular node, bringing loops into the
+graph; see `transition-function`_.  Moreover TaskFlow does not
+`support external event propagating
+<https://bugs.launchpad.net/taskflow/+bug/1527678>`_ to a running
+flow, such as the ``continue`` call from the bare metal node.  Because
+of that, the overall state of the introspection of particular node has
+to be maintained explicitly if TaskFlow is adopted.  TaskFlow, too,
+requires tasks to be reentrant/idempotent.
+
+HA Firewall decomposition
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The firewall facility can be replaced by Neutron once it adopts
+`enhancements to subnet DHCP options
+<https://review.openstack.org/#/c/247027/>`_ and `allows serving DHCP
+to unknown hosts <https://review.openstack.org/#/c/255240/>`_.  We're
+keeping Inspector's firewall facility for users that are interested in
+stand-alone deployments.
+
+Data model impact
+-----------------
+
+Queue
+^^^^^
+
+State transition request item is introduced, it should contain these
+attributes (as an oslo.versioned) object:
+
+* node ID
+* transition event
+
+A clean-up request item is introduced removing a node. Attributes
+comprising the request:
+
+* node ID
+
+Pub--sub channels
+^^^^^^^^^^^^^^^^^
+
+Two channels are introduced: firewall white-list and black-list.  The
+message format is as follows:
+
+* add/remove
+* port ID, MAC address
+
+Store
+^^^^^
+
+Node state column is introduced to the node table.
+
+HTTP API impact
+---------------
+
+API service is provided by dedicated processes.
+
+Client (CLI) impact
+-------------------
+
+None planned.
+
+Performance and scalability impact
+----------------------------------
+
+We hope this change brings in desired redundancy and scaling for the
+inspector service.  We however expect the change to have a negative
+network utilization impact as the introspection task requires a queue
+and a DLM to coordinate.
+
+The inspector firewall facility requires periodic polling of the
+ironic service inventory in each inspector instance.  Therefore we
+expect increased load on the ironic service.
+
+Firewall facility leader partitioning causes boot service outage for
+the election period. Some nodes may therefore timeout booting.
+
+Each time the firewall leader updates the hosts firewall node
+information is polled from ironic service. This may introduce delays
+in firewall availability.  If a node being introspected is removed
+from the ironic service, the change will not propagate to Inspector
+until the introspection finishes.
+
+Security impact
+---------------
+
+New services introduced that might require hardening and protection:
+
+* load balancer
+* distributed locking facility
+* queue
+* pub--sub channels
+
+Deployer impact
+---------------
+
+Inspector Service Configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* distributed locking facility, queue, firewall pub--sub channels and
+  load balancer introduce new configuration options, especially
+  URLs/hosts and credentials
+* worker pool size, integral, ``0<size;
+  size.default==processor.count``
+* worker ``queue.get(timeout); 0.0s<timeout; timeout.default==3.0s``
+* clean-up period  ``0.0s<period; period.default==30s``
+* clean-up introspection report expiration threshold ``0.0s<threshold;
+  threshold.default==86400.0s``
+* clean-up introspection time-out threshold ``0.0s<threshold<=900.0s``
+* ironic firewall black-list synchronization polling period
+  ``0.0s<=period<=30.0s; period.default==15.0s; period==0.0`` to disable
+* firewall white-list store watcher polling period
+  ``0.0s<=period<=30.0s; period.default==15.0s; period==0.0`` to
+  disable
+* bare metal reboot throttle, ``0.0s<=value; value.default==0.0s``
+  disabling this feature altogether
+* for each of the ironic service, database, distributed locking
+  facility and the queue, a connection retry count and connection
+  retry timeout should be configured
+* all inspector hosts should share same configuration, save for the
+  update situation
+
+New services and minimal Topology
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* floating IP address shared by load balancers
+* load balancers, wired for redundancy
+* WSGI HTTP API instances (httpd), addressed by load balancers in a
+  round-robin fashion
+* 3 inspector hosts each running a worker process instance, dnsmasq
+  instance and iptables
+* distributed synchronization facility hosts, wired for redundancy,
+  accessed by all inspector workers
+* queue hosts, wired for redundancy, accessed by all API instances and
+  workers
+* database cluster, wired for redundancy, accessed by all API
+  instances and workers
+* NTP set up and configured for all the services
+
+Please note, all inspector hosts require access to the PXE LAN for
+bare metal nodes to boot.
+
+Serviceability considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Considering service update, we suggest following procedure to be
+adopted for each inspector host, one at a time:
+
+HTTP API services:
+
+* remove selected host from the load balancer service
+* stop the HTTP API service on the host
+* upgrade the service and configuration files
+* start the HTTP API service on the host
+* enroll the host to the load balancer service
+
+Worker services:
+
+* for each worker host:
+* stop the worker service instance on the host
+* update the worker service and configuration files
+* start the worker service
+
+Shutting down the inspector worker service may hang for some time due
+to worker threads executing a long synchronous procedure or waiting in
+the ``queue.get(timeout)`` method while polling for new task.
+
+This approach may lead to introspection (task) failures for nodes that
+are being handled on inspector host under update.  Especially changes
+of the transition function (new states etc) may induce introspection
+errors.  Ideally, the update should therefore happen with no ongoing
+introspections.  Failed node introspections may be restarted.
+
+A couple of periodic task "instances" may be lost due to the updated
+leader partitioning each time a host is updated.  HA firewall may be
+lost for the leader election period each time a host is updated,
+expected delay should be less than 10 seconds so that booting of
+inspected nodes isn't affected.
+
+Upgrade from non-HA Inspector Service
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Because the non-HA inspector service is a single-process entity and
+because the HA services aren't internally backwards compatible with it
+(to allow taking-over running node inspections), to perform an
+upgrade, the non-HA service has to be stopped first while no
+inspections are ongoing.  Data migration is necessary before the
+upgrade.  As the new services require the queue and the DLM for their
+operation those have to be introduced before the upgrade.  The worker
+services have to be started before HTTP API services.  Having started,
+the HTTP API services have to be introduced to the load balancer.
+
+Developer impact
+----------------
+
+None planned.
+
+Implementation
+==============
+
+We consider following implementations for the facilities we rely on:
+
+* load balancer: HAProxy
+* queue: Oslo messaging
+* pub--sub firewall channels: Oslo messaging
+* store: a database service
+* distributed synchronization facility: Tooz
+* HTTP API service: WSGI and httpd
+
+Assignee(s)
+-----------
+
+* `vetrisko <https://launchpad.net/~vetrisko>`_; primary
+* `divius  <https://launchpad.net/~divius>`_
+
+Work Items
+----------
+
+* replace current locking with Tooz DLM
+* introduce state machine
+* split API service and introduce conductors and queue
+* split cleaning into a separate timeout and synchronization handlers
+  and introduce leader-election to these periodic procedures
+* introduce leader-election to the firewall facility
+* introduce the pub--sub channels to the firewall facility
+
+Dependencies
+============
+
+We require proper inspector `grenade testing
+<https://wiki.openstack.org/wiki/Grenade>`_ before landing HA so we
+avoid breaking users as much as possible.
+
+Testing
+=======
+
+All work items should be tested as separate patches both with
+functional and unit tests as well as upgrade tests with Grenade.
+
+Having landed all the required work items it should be possible to
+test Inspector with focus on redundancy and scaling.
+
+References
+==========
+
+During the analysis process we considered these blueprints:
+
+* `Abort introspection
+  <https://blueprints.launchpad.net/ironic-inspector/+spec/abort-introspection>`_
+* `Node States
+  <https://blueprints.launchpad.net/ironic-inspector/+spec/node-states>`_
+* `Node Locking <https://review.openstack.org/#/c/244750/5>`_
+* `Oslo.messaging at-least-once semantics
+  <https://review.openstack.org/#/c/256342/>`_
+
+RFEs:
+
+* `TaskFlow: flow suspend&continue
+  <https://bugs.launchpad.net/taskflow/+bug/1527678>`_
+* `TaskFlow: non-DAG flow patterns
+  <https://bugs.launchpad.net/taskflow/+bug/1527690>`_
+* `HA for Ironic Inspector
+  <https://bugs.launchpad.net/ironic-inspector/+bug/1525218>`_
+* `Safe queue for Tooz
+  <https://bugs.launchpad.net/python-tooz/+bug/1528490>`_
+* `Watchable store for Tooz
+  <https://bugs.launchpad.net/python-tooz/+bug/1528495>`_
+* `Enhanced Network/Subnet DHCP Options
+  <https://review.openstack.org/#/c/247027/>`_
+* `Neutron DHCP serve unknown hosts
+  <https://review.openstack.org/#/c/255240/>`_
+
+Community sources:
+
+* `DLM options discussion
+  <https://etherpad.openstack.org/p/mitaka-cross-project-dlm>`_
+* `TaskFlow with external events and Non-DAG flows
+  <http://lists.openstack.org/pipermail/openstack-dev/2015-November/080622.html>`_
+* Joshua Harlow's comment that `Tooz should implement the
+  at-least-once semantics not Oslo.messaging
+  <https://review.openstack.org/#/c/256342/7/specs/mitaka/at-least-once-guarantee.rst@305>`_
+
+RFCs:
+
+* `DHCP Failover Protocol: IP address allocation between servers <https://tools.ietf.org/html/draft-ietf-dhc-failover-12#section-5.4>`_
+
+Tools:
+
+* `IP Sets <http://ipset.netfilter.org/changelog.html>`_
+* `Dnsmasq <http://www.thekelleys.org.uk/dnsmasq/doc.html>`_