High Availability for Ironic Inspector
Introduce redundancy and scalability to the ironic inspector service Change-Id: I88667decc4d01a125fc840b9efb448fdba5dec08 Co-Authored-By: Dmitry Tantsur <divius.inside@gmail.com>
This commit is contained in:
parent
bd92b96fe3
commit
d959e4b1ba
|
@ -0,0 +1,841 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=======================================
|
||||
High Availability for Ironic Inspector
|
||||
=======================================
|
||||
|
||||
Ironic inspector is a service that allows bare metal nodes to be
|
||||
introspected dynamically, that currently isn't redundant. The goal of
|
||||
this blueprint is to suggest *conceptual changes* to the inspector
|
||||
service that would make inspector redundant while maintaining both the
|
||||
current inspection feature set and API.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Inspector is a compound service consisting of the inspector API
|
||||
service, the firewall and the DHCP (PXE) service. Currently, all
|
||||
three components run a single instance on a shared host per OpenStack
|
||||
deployment. A failure of the host or any of the services renders
|
||||
introspection unavailable and prevents the cloud administrator from
|
||||
enrolling new hardware or from booting already enrolled baremetal
|
||||
nodes. Furthermore, Inspector isn't designed to cope well with the
|
||||
amount of hardware required for Ironic bare metal usage at large
|
||||
scale. With a site size of 10k bare metal nodes in mind, we aim at
|
||||
the inspector sustaining a batch load of a couple of hundred
|
||||
introspection/enroll requests interleaved with couple of minutes of
|
||||
silence, maintaining a couple of thousand firewall black list items.
|
||||
We refer to this use case as *bare metal to tenant*.
|
||||
|
||||
Below we describe the current Inspector service architecture with some
|
||||
Inspector process instance failure consequences.
|
||||
|
||||
Introspection process
|
||||
---------------------
|
||||
|
||||
Node introspection is a sequence of asynchronous steps, controlled by
|
||||
the inspector API service, that take various amounts of time to
|
||||
finish. One could describe these steps as states of a transition
|
||||
system, advanced by events as follows:
|
||||
|
||||
* ``starting`` the initial state; the system is advanced into this
|
||||
state by receiving an introspect API request. Introspection
|
||||
configuration and set-up steps are performed while in this state.
|
||||
* ``waiting`` introspection image is booting on the node. The system
|
||||
advances to this state automatically.
|
||||
* ``processing`` introspection image has booted and collected
|
||||
necessary information from the node. This information is being
|
||||
processed by plug-ins to validate node status. The system is
|
||||
advanced to this state having received the ``continue`` REST API
|
||||
request.
|
||||
* ``finished`` introspection is done, node powered-off. The system
|
||||
is advanced to this state automatically.
|
||||
|
||||
In case of an API service failure, nodes in-between the ``starting``
|
||||
and ``finished`` state, will lose their state, and may require manual
|
||||
intervention to recover. No more nodes can be processed either
|
||||
because the API service runs in a single instance per deployment.
|
||||
|
||||
Firewall configuration
|
||||
----------------------
|
||||
|
||||
To minimize interference with normally deployed nodes, inspector
|
||||
deploys temporary firewall rules so only nodes being inspected can
|
||||
access its PXE boot service. It is implemented as a blacklist
|
||||
containing MAC addresses of nodes kept by ironic service but not by
|
||||
inspector. This is required because the MAC address isn't known
|
||||
before a node boots for the first time.
|
||||
|
||||
Depending on the spot in which the API service fails while the
|
||||
firewall and DHCP services are intact, firewall configuration may get
|
||||
out of sync and may lead to interference with normal node booting:
|
||||
|
||||
* firewall chain set-up (init phase): Inspector's dnsmasq service is
|
||||
exposed to all nodes
|
||||
* firewall synchronization periodic task: new nodes added to Ironic
|
||||
aren't blacklisted
|
||||
* node introspection finished: the node won't be blacklisted
|
||||
|
||||
On the other hand, no boot interference is expected if running all
|
||||
services (inspector, firewall and DHCP), on the same host, as all
|
||||
service are lost together. Losing the API service during clean-up
|
||||
periodic task, should not matter as the nodes concerned will be kept
|
||||
blacklisted during service downtime.
|
||||
|
||||
DHCP (PXE) service
|
||||
------------------
|
||||
|
||||
Inspector service doesn't manage the DHCP service directly, rather, it
|
||||
just requires DHCP is properly set up and shares the host of the API
|
||||
service and the firewall. We'd anyway like to briefly describe the
|
||||
consequences of the DHCP service failing.
|
||||
|
||||
In case of a DHCP service failure inspected nodes won't be able to
|
||||
boot the introspection ramdisk and eventually fail to get inspected
|
||||
because of a timeout. The nodes may loop retrying to boot depending
|
||||
on their firmware configuration.
|
||||
|
||||
A fail-over of DHCP from active to back-up host (`dnsmasq
|
||||
<http://www.thekelleys.org.uk/dnsmasq/doc.html>`_ usually) would
|
||||
manifest with booting nodes under introspection timing out or nodes
|
||||
already booted (with a lease of an address) getting into an address
|
||||
conflict with another node booting. There's not much to help the
|
||||
former situation besides retrying. To prevent the latter from
|
||||
happening, the configuration of DHCP service for the introspection
|
||||
purpose should consider disjoint address pools served by the DHCP
|
||||
instances such as recommended in `IP address allocation between
|
||||
servers
|
||||
<https://tools.ietf.org/html/draft-ietf-dhc-failover-12#section-5.4>`_
|
||||
section of the DHCP Failover Protocol RFC. We also recommend using
|
||||
the ``dhcp-sequential-ip`` in the dnsmasq configuration file to avoid
|
||||
conflicts within the address pools. See `related bug report
|
||||
<https://bugzilla.redhat.com/show_bug.cgi?id=1301659#c20>`_ for more
|
||||
details on the issue. The introspection being an ephemeral matter,
|
||||
synchronization of the leases between the DHCP instances isn't
|
||||
necessary if restarting introspection isn't an issue.
|
||||
|
||||
Other Inspector parts
|
||||
---------------------
|
||||
|
||||
* periodic introspection status clean-up, removing old inspection data
|
||||
and finishing timed-out introspections
|
||||
* synchronizing set of nodes with ironic
|
||||
* limiting node power-on rate with a shared lock and a timeout
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
In considering the problem of high availability, we are proposing a
|
||||
solution that consists of a distributed, shared-nothing, active-active
|
||||
implementation of all services that comprise the ironic inspector.
|
||||
From the user point of view, we suggest API service to serve through a
|
||||
*load balancer*, such as `HAProxy <http://www.haproxy.org/>`_, in
|
||||
order to maintain a single entry point for the API service (e.g.
|
||||
floating IP address).
|
||||
|
||||
HA Node Introspection decomposition
|
||||
-----------------------------------
|
||||
|
||||
Node introspection being a state transition system, we focus on
|
||||
*decentralizing* it. We therefore replicate the current introspection
|
||||
state through the distributed store in all inspector process instances
|
||||
for particular node. We suggest that both the automatic state
|
||||
advancing requests as well as API state advancing requests are
|
||||
performed asynchronously by independent workers.
|
||||
|
||||
HA Worker
|
||||
---------
|
||||
|
||||
Each inspector process provides a pool of asynchronous workers that
|
||||
get state transition requests from a queue. We use separate
|
||||
``queue.get`` and ``queue.consume`` calls to avoid losing state
|
||||
transition requests due to worker failures. This however introduces
|
||||
the *at-least-once* delivery semantics to the requests. We therefore
|
||||
rely on the `transition-function`_ to handle the request delivery
|
||||
gracefully. We suggest two kinds of state-transition handling with
|
||||
regards to the at-least-once delivery semantics:
|
||||
|
||||
Strict (non-reentrant-task) Transition Specification
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* `Getting a request`_
|
||||
* `Calculating new node state`_
|
||||
* `Updating node state`_
|
||||
* `Executing a task`_
|
||||
* `Consuming a request`_
|
||||
|
||||
Reentrant Task Transition Specification
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
* `Getting a request`_
|
||||
* `Calculating new node state`_
|
||||
* `Executing a task`_
|
||||
* `Updating node state`_
|
||||
* `Consuming a request`_
|
||||
|
||||
Strict transition protecting a state change may lead to a situation
|
||||
that the state of introspection is not in correspondence with the node
|
||||
in reality --- if a worker partitions right after having successfully
|
||||
executed the task but just before consuming the request from the
|
||||
queue. As a consequence the transition request not having been
|
||||
consumed will be encountered again with (another) worker. One can
|
||||
refer to this behavior as a *reentrancy glitch or Déjà vu*
|
||||
|
||||
Since the goal is to protect the inspected node from going through the
|
||||
same task again, we rely on the state transition system to handle this
|
||||
situation by navigating to the ``error`` state instead.
|
||||
|
||||
Removing a node
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
`Ironic synchronization periodic task`_ puts node delete requests on
|
||||
the queue. Workers perform following steps to handle:
|
||||
|
||||
* `Getting a request`_
|
||||
* worker removes the node from the store
|
||||
* `Consuming a request`_
|
||||
|
||||
Failure of store removing the node isn't a concern here as the
|
||||
periodic task will try again later. It is therefore safe to always
|
||||
consume the request here.
|
||||
|
||||
Shutting Down HA Inspector Processes
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
All inspector process instances register a ``SIGTERM`` callback. To
|
||||
notify inspector worker threads, the ``SIGTERM`` callback sets the
|
||||
``sigterm_flag`` upon the signal delivery. The flag is process-local
|
||||
and its purpose is to allow inspector processes to perform a
|
||||
controlled/graceful shutdown. For this mechanism to work, potentially
|
||||
blocking operations (such as ``queue.get``) have to be used with a
|
||||
configurable timeout value within the workers. All sleep calls
|
||||
throughout the process instance should be interruptible, possibly
|
||||
implemented as ``sigterm_flag.wait(sleep_time)`` or similar.
|
||||
|
||||
Getting a request
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
* any worker instance may execute any request the queue contains
|
||||
* worker gets state transition or node delete request from the queue
|
||||
* if ``SIGTERM`` flag is set, worker stops
|
||||
* if ``queue.get`` timed-out (task is ``None``) poll the queue again
|
||||
* lock the BM node related to the request
|
||||
* if locking failed worker polls the queue again not consuming the
|
||||
request
|
||||
|
||||
Calculating new node state
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* worker instantiates a state transition system instance for current
|
||||
node state
|
||||
* if instantiating failed (e.g. no such node in the store) worker
|
||||
performs `Retrying a request`_
|
||||
* worker advances the state transition system
|
||||
* if the state machine is jammed (illegal state transition request)
|
||||
worker performs `Consuming a request`_
|
||||
|
||||
Updating node state
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The introspection state is kept in the store, visible to all worker
|
||||
instances.
|
||||
|
||||
* worker saves node state in the store
|
||||
* if saving node state in the store failed (such as node has been
|
||||
removed) worker performs `Retrying a request`_
|
||||
|
||||
Executing a task
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
* worker performs the task bound to the transition request
|
||||
* if the task result is a transition request worker puts it on the
|
||||
queue
|
||||
|
||||
Consuming a request
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* worker consumes the state transition request from the queue
|
||||
* worker releases related node lock
|
||||
* worker continues from the beginning
|
||||
|
||||
Retrying a request
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* worker releases node lock
|
||||
* worker continues from the beginning not consuming the request to
|
||||
retry later
|
||||
|
||||
Introspection State-Transition System
|
||||
-------------------------------------
|
||||
|
||||
Node introspection state is managed by a worker-local instance of a
|
||||
state transition system. The state transition function is as follows.
|
||||
|
||||
.. compound::
|
||||
|
||||
.. _transition-function:
|
||||
|
||||
.. table:: Transition function
|
||||
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| State | Event | Target |
|
||||
+================+=======================+====================================+
|
||||
| N/A | Inspect | Starting |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Starting* | Inspect | Starting |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Starting* | S~ | Waiting |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Waiting | S~ | Waiting |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Waiting | Timeout | Error |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Waiting | Abort | Error |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Waiting | Continue! | Processing |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Processing | Continue! | Error |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Processing | F~ | Finished |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Finished+ | Inspect | Starting |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Finished+ | Abort | Error |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
| Error+ | Inspect | Starting |
|
||||
+----------------+-----------------------+------------------------------------+
|
||||
|
||||
.. table:: Legend
|
||||
|
||||
+------------+-----------------------------+
|
||||
| Expression | Meaning |
|
||||
+============+=============================+
|
||||
| State* | the initial state |
|
||||
+------------+-----------------------------+
|
||||
| State+ | the terminal/accepting state|
|
||||
+------------+-----------------------------+
|
||||
| State~ | the automatic event |
|
||||
| | originating in State |
|
||||
+------------+-----------------------------+
|
||||
| Event! | strict/non-reentrant |
|
||||
| | transition event |
|
||||
+------------+-----------------------------+
|
||||
|
||||
.. _timer-decomposition:
|
||||
|
||||
HA Singleton Periodic task decomposition
|
||||
----------------------------------------
|
||||
|
||||
Ironic inspector service houses a couple of periodic tasks. At any
|
||||
point, up to a single "instance" of a periodic task flavor should be
|
||||
running, no matter the process instances count. For this purpose, the
|
||||
processes form a periodic task distributed management party.
|
||||
|
||||
Process instances register a ``SIGTERM`` callback that, the signal
|
||||
being delivered, makes the process instance leave the party and switch
|
||||
the ``reset_flag``.
|
||||
|
||||
The process instances install a watch on the party. Upon the party
|
||||
shrinkage, the processes reset their periodic task, if they have one
|
||||
set, triggering the ``reset_flag`` and participate in new distributed
|
||||
periodic task management leader election. Party growth isn't of
|
||||
concern to the processes.
|
||||
|
||||
It's because of the task reset due to the party shrinkage a custom
|
||||
flag has to be used, instead of the ``sigterm_flag``, to stop the
|
||||
periodic task. Otherwise, setting the ``sigterm_flag`` because of the
|
||||
party change would stop the whole service.
|
||||
|
||||
The leader process executes the periodic task loop. Upon exception or
|
||||
partitioning, mind the `partitioning-concerns`_, the leader stops
|
||||
through flipping the ``sigterm_flag`` in order for the inspector
|
||||
service to stop. The periodic task loop is stopped eventually as it
|
||||
performs ``reset_flag.wait(period)`` instead of sleeping.
|
||||
|
||||
The periodic task management should happen in a separate asynchronous
|
||||
thread instance, one per periodic task. Losing leader due to its
|
||||
error (or partitioning) isn't a concern --- a new one will eventually
|
||||
be elected and a couple of periodic task runs will be wasted
|
||||
(including those that died together with the leader).
|
||||
|
||||
HA Periodic clean-up decomposition
|
||||
----------------------------------
|
||||
|
||||
Clean-up should be implemented as independent HA singleton periodic
|
||||
tasks with configurable time period, one for each of the introspection
|
||||
timeout and ironic synchronization tasks.
|
||||
|
||||
Introspection timeout periodic task
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To finish introspections that are timing-out:
|
||||
|
||||
* select nodes for which the introspection is timing out
|
||||
* for each node:
|
||||
* put a request to time-out the introspection on the queue for a
|
||||
worker to process
|
||||
|
||||
Ironic synchronization periodic task
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To remove nodes no longer tracked by Ironic:
|
||||
|
||||
* select nodes that are kept by Inspector but not kept by Ironic
|
||||
* for each node:
|
||||
* put a request to delete the node on the queue for a worker to
|
||||
process
|
||||
|
||||
HA Reboot Throttle Decomposition
|
||||
--------------------------------
|
||||
|
||||
As a workaround for some hardware, reboot request rate should be
|
||||
limited. For this purpose, a single distributed lock instance should
|
||||
be utilized. At any point in time, only a single worker may hold the
|
||||
lock while performing the reboot (power-on) task. Upon acquiring the
|
||||
lock, the reboot state transition sleeps in an interruptible fashion
|
||||
for a configurable quantum of time. If the sleep was indeed
|
||||
interrupted, the worker should raise an exception stopping the reboot
|
||||
procedure and the worker itself. This interruption should happen as
|
||||
part of the graceful shutdown mechanism. This should be implemented
|
||||
utilizing the same ``SIGTERM`` flag/event workers use to check for
|
||||
pending shutdown: ``sigterm_flag.wait(timeout=quantum)``
|
||||
|
||||
Process partitioning isn't a concern here because all workers sleep
|
||||
while holding the lock. Partitioning therefore slows down the reboot
|
||||
pace by the amount of time a lock takes to expire. It should be
|
||||
possible to disable the reboot throttle altogether through the
|
||||
configuration.
|
||||
|
||||
HA Firewall decomposition
|
||||
-------------------------
|
||||
|
||||
The PXE boot environment is configured and active on all inspector
|
||||
hosts. The firewall protection of the PXE environment is active on all
|
||||
inspector hosts, blocking the hosts' PXE service. At any given point
|
||||
in time, at most one inspector host's PXE service is available, and it
|
||||
is available to all inspected nodes.
|
||||
|
||||
Building blocks
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The general policy is allow-all, and each node that is not being
|
||||
inspected has a block-exception to the general policy. Due to its
|
||||
size, the black-list is maintained locally on all inspector hosts,
|
||||
pulling items from ironic periodically or asynchronously from a
|
||||
pub--sub channel.
|
||||
|
||||
Nodes that are being introspected are white-listed in a separate set
|
||||
of firewall rules. Nodes that are being discovered for the first time
|
||||
fall through the black-list due to the general allow-all black-list
|
||||
policy.
|
||||
|
||||
Nodes the HA firewall is supposed to allow access to the PXE service,
|
||||
are kept in a distributed store or obtained asynchronously from a
|
||||
pub--sub channel. Process instance workers add (subtract) firewall
|
||||
rules to (from) the distributed store as necessary or announce the
|
||||
changes on the pub--sub channels. Firewall rules are ``(port_ID,
|
||||
port_MAC)`` tuples to be white-/black-listed.
|
||||
|
||||
Process instances use custom chains to implement the firewall: the
|
||||
white-list chain and the black-list chain. Failing through the
|
||||
white-list chain, a packet "proceeds" to the black-list chain. Failing
|
||||
through the black-list chain, a packet is allowed to access the PXE
|
||||
service port. A node port rule may be present both in the white-list
|
||||
and the black-list chain at the same time if being introspected.
|
||||
|
||||
HA Decomposition
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
Starting, the processes poll Ironic to build their black-list chains
|
||||
for the first time and set up *local* periodic Ironic black-list
|
||||
synchronisation task or set callbacks on the black-list pub--sub
|
||||
channel.
|
||||
|
||||
Process instances form a distributed firewall management party that
|
||||
they watch for changes. Process instances register a ``SIGTERM``
|
||||
callback that, the signal being delivered, makes the process instance
|
||||
leave the party and reset the firewall, completely blocking their PXE
|
||||
service.
|
||||
|
||||
Upon the party shrinkage, processes reset their firewall white-list
|
||||
chain, the *pass* rule in the black-list chain, and the rule set watch
|
||||
(should they have one set) and participate in a distributed firewall
|
||||
management leader election. Party growth isn't of concern to the
|
||||
processes.
|
||||
|
||||
The leader process' black-list chain contains the *pass* rule while
|
||||
other process's black-list chains don't. Having been elected, the
|
||||
leader process builds the white-list and registers a watch on the
|
||||
distributed store or a white-list pub--sub channel callback in order
|
||||
to keep the white-list firewall chain up-to-date. Other process
|
||||
instances don't maintain a white-list chain, that chain is empty for
|
||||
them.
|
||||
|
||||
Upon any exception (or process instance partitioning), a process
|
||||
resets its firewall to completely protect its PXE service.
|
||||
|
||||
Notes
|
||||
^^^^^
|
||||
|
||||
Periodic white-list store polling and the white-list pub--sub channel
|
||||
callbacks are mutually optional facilities to enhance the
|
||||
responsiveness of the firewall, and the user may prefer enabling one
|
||||
or the other or both simultaneously as necessary. The same holds for
|
||||
the black-list Ironic polling and the black-list pub--sub channel
|
||||
callbacks.
|
||||
|
||||
To assemble the blacklist of MAC addresses, the processes may need to
|
||||
poll the ironic service periodically for node information. A
|
||||
cache/proxy of this information might be kept optionally to reduce the
|
||||
load on Ironic.
|
||||
|
||||
The firewall management should be implemented as a separate
|
||||
asynchronous thread in each inspector process instance. Firewall being
|
||||
lost due to the leader failure isn't a concern --- new leader will be
|
||||
eventually elected. Some nodes being introspected may experience a
|
||||
timeout in the waiting state and fail the introspection though.
|
||||
|
||||
Periodic Ironic--firewall node synchronization and white-list store
|
||||
polling should be implemented as independent threads with configurable
|
||||
time period, ``0<=period<=30s``, ideally ``0<=period<=15s`` so the
|
||||
window between introducing a node to ironic and blacklisting it in
|
||||
inspector firewall is kept below user's resolution.
|
||||
|
||||
As an optimization, the implementation may consider offloading the MAC
|
||||
address rules of node ports from firewall chains into `IP sets
|
||||
<http://ipset.netfilter.org/changelog.html>`_
|
||||
|
||||
HA HTTP API Decomposition
|
||||
-------------------------
|
||||
|
||||
We assume a Load Balancer (HAProxy) shielding the user from the
|
||||
inspector service. All the inspector API process instances should
|
||||
export the same REST API. Each API Request should be handled in a
|
||||
separate asynchronous thread instance (as is the case now with the
|
||||
`Flask <https://pypi.python.org/pypi/Flask>`_ framework). At any point
|
||||
in time, any of the process instances may serve any request.
|
||||
|
||||
.. _partitioning-concerns:
|
||||
|
||||
Partitioning concerns
|
||||
---------------------
|
||||
|
||||
Upon connection exception/worker process partitioning, affected entity
|
||||
should retry connection establishing before announcing failure. The
|
||||
retry count and timeout should be configurable for each of the ironic,
|
||||
database, distributed store, lock and queue services. The timeout
|
||||
should be interruptible, possibly implemented as waiting for
|
||||
appropriate termination/``SIGTERM`` flag,
|
||||
e.g. ``sigterm_flag.wait(timeout)``. Should the retrying fail,
|
||||
affected entity breaks the worker inspector service altogether,
|
||||
setting the flag, to avoid damage to resources --- most of the time,
|
||||
other worker service entities would be equally affected by the
|
||||
partition anyway. User may consider restarting affected worker
|
||||
service process instance when the partitioning issue is resolved.
|
||||
|
||||
Partitioning of HTTP API service instances isn't a concern as those
|
||||
are stateless and accessed through a load balancer.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
HA Worker Decomposition
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
We've briefly examined the `TaskFlow
|
||||
<https://wiki.openstack.org/wiki/taskflow>`_ library as alternate
|
||||
tasking mechanism. Currently, TaskFlow does support only `directed
|
||||
acyclic graphs as dependency structure
|
||||
<https://bugs.launchpad.net/taskflow/+bug/1527690>`_ between
|
||||
particular steps. Inspector service has to however support restarting
|
||||
of the introspection for a particular node, bringing loops into the
|
||||
graph; see `transition-function`_. Moreover TaskFlow does not
|
||||
`support external event propagating
|
||||
<https://bugs.launchpad.net/taskflow/+bug/1527678>`_ to a running
|
||||
flow, such as the ``continue`` call from the bare metal node. Because
|
||||
of that, the overall state of the introspection of particular node has
|
||||
to be maintained explicitly if TaskFlow is adopted. TaskFlow, too,
|
||||
requires tasks to be reentrant/idempotent.
|
||||
|
||||
HA Firewall decomposition
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The firewall facility can be replaced by Neutron once it adopts
|
||||
`enhancements to subnet DHCP options
|
||||
<https://review.openstack.org/#/c/247027/>`_ and `allows serving DHCP
|
||||
to unknown hosts <https://review.openstack.org/#/c/255240/>`_. We're
|
||||
keeping Inspector's firewall facility for users that are interested in
|
||||
stand-alone deployments.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Queue
|
||||
^^^^^
|
||||
|
||||
State transition request item is introduced, it should contain these
|
||||
attributes (as an oslo.versioned) object:
|
||||
|
||||
* node ID
|
||||
* transition event
|
||||
|
||||
A clean-up request item is introduced removing a node. Attributes
|
||||
comprising the request:
|
||||
|
||||
* node ID
|
||||
|
||||
Pub--sub channels
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
Two channels are introduced: firewall white-list and black-list. The
|
||||
message format is as follows:
|
||||
|
||||
* add/remove
|
||||
* port ID, MAC address
|
||||
|
||||
Store
|
||||
^^^^^
|
||||
|
||||
Node state column is introduced to the node table.
|
||||
|
||||
HTTP API impact
|
||||
---------------
|
||||
|
||||
API service is provided by dedicated processes.
|
||||
|
||||
Client (CLI) impact
|
||||
-------------------
|
||||
|
||||
None planned.
|
||||
|
||||
Performance and scalability impact
|
||||
----------------------------------
|
||||
|
||||
We hope this change brings in desired redundancy and scaling for the
|
||||
inspector service. We however expect the change to have a negative
|
||||
network utilization impact as the introspection task requires a queue
|
||||
and a DLM to coordinate.
|
||||
|
||||
The inspector firewall facility requires periodic polling of the
|
||||
ironic service inventory in each inspector instance. Therefore we
|
||||
expect increased load on the ironic service.
|
||||
|
||||
Firewall facility leader partitioning causes boot service outage for
|
||||
the election period. Some nodes may therefore timeout booting.
|
||||
|
||||
Each time the firewall leader updates the hosts firewall node
|
||||
information is polled from ironic service. This may introduce delays
|
||||
in firewall availability. If a node being introspected is removed
|
||||
from the ironic service, the change will not propagate to Inspector
|
||||
until the introspection finishes.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
New services introduced that might require hardening and protection:
|
||||
|
||||
* load balancer
|
||||
* distributed locking facility
|
||||
* queue
|
||||
* pub--sub channels
|
||||
|
||||
Deployer impact
|
||||
---------------
|
||||
|
||||
Inspector Service Configuration
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* distributed locking facility, queue, firewall pub--sub channels and
|
||||
load balancer introduce new configuration options, especially
|
||||
URLs/hosts and credentials
|
||||
* worker pool size, integral, ``0<size;
|
||||
size.default==processor.count``
|
||||
* worker ``queue.get(timeout); 0.0s<timeout; timeout.default==3.0s``
|
||||
* clean-up period ``0.0s<period; period.default==30s``
|
||||
* clean-up introspection report expiration threshold ``0.0s<threshold;
|
||||
threshold.default==86400.0s``
|
||||
* clean-up introspection time-out threshold ``0.0s<threshold<=900.0s``
|
||||
* ironic firewall black-list synchronization polling period
|
||||
``0.0s<=period<=30.0s; period.default==15.0s; period==0.0`` to disable
|
||||
* firewall white-list store watcher polling period
|
||||
``0.0s<=period<=30.0s; period.default==15.0s; period==0.0`` to
|
||||
disable
|
||||
* bare metal reboot throttle, ``0.0s<=value; value.default==0.0s``
|
||||
disabling this feature altogether
|
||||
* for each of the ironic service, database, distributed locking
|
||||
facility and the queue, a connection retry count and connection
|
||||
retry timeout should be configured
|
||||
* all inspector hosts should share same configuration, save for the
|
||||
update situation
|
||||
|
||||
New services and minimal Topology
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* floating IP address shared by load balancers
|
||||
* load balancers, wired for redundancy
|
||||
* WSGI HTTP API instances (httpd), addressed by load balancers in a
|
||||
round-robin fashion
|
||||
* 3 inspector hosts each running a worker process instance, dnsmasq
|
||||
instance and iptables
|
||||
* distributed synchronization facility hosts, wired for redundancy,
|
||||
accessed by all inspector workers
|
||||
* queue hosts, wired for redundancy, accessed by all API instances and
|
||||
workers
|
||||
* database cluster, wired for redundancy, accessed by all API
|
||||
instances and workers
|
||||
* NTP set up and configured for all the services
|
||||
|
||||
Please note, all inspector hosts require access to the PXE LAN for
|
||||
bare metal nodes to boot.
|
||||
|
||||
Serviceability considerations
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Considering service update, we suggest following procedure to be
|
||||
adopted for each inspector host, one at a time:
|
||||
|
||||
HTTP API services:
|
||||
|
||||
* remove selected host from the load balancer service
|
||||
* stop the HTTP API service on the host
|
||||
* upgrade the service and configuration files
|
||||
* start the HTTP API service on the host
|
||||
* enroll the host to the load balancer service
|
||||
|
||||
Worker services:
|
||||
|
||||
* for each worker host:
|
||||
* stop the worker service instance on the host
|
||||
* update the worker service and configuration files
|
||||
* start the worker service
|
||||
|
||||
Shutting down the inspector worker service may hang for some time due
|
||||
to worker threads executing a long synchronous procedure or waiting in
|
||||
the ``queue.get(timeout)`` method while polling for new task.
|
||||
|
||||
This approach may lead to introspection (task) failures for nodes that
|
||||
are being handled on inspector host under update. Especially changes
|
||||
of the transition function (new states etc) may induce introspection
|
||||
errors. Ideally, the update should therefore happen with no ongoing
|
||||
introspections. Failed node introspections may be restarted.
|
||||
|
||||
A couple of periodic task "instances" may be lost due to the updated
|
||||
leader partitioning each time a host is updated. HA firewall may be
|
||||
lost for the leader election period each time a host is updated,
|
||||
expected delay should be less than 10 seconds so that booting of
|
||||
inspected nodes isn't affected.
|
||||
|
||||
Upgrade from non-HA Inspector Service
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Because the non-HA inspector service is a single-process entity and
|
||||
because the HA services aren't internally backwards compatible with it
|
||||
(to allow taking-over running node inspections), to perform an
|
||||
upgrade, the non-HA service has to be stopped first while no
|
||||
inspections are ongoing. Data migration is necessary before the
|
||||
upgrade. As the new services require the queue and the DLM for their
|
||||
operation those have to be introduced before the upgrade. The worker
|
||||
services have to be started before HTTP API services. Having started,
|
||||
the HTTP API services have to be introduced to the load balancer.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None planned.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
We consider following implementations for the facilities we rely on:
|
||||
|
||||
* load balancer: HAProxy
|
||||
* queue: Oslo messaging
|
||||
* pub--sub firewall channels: Oslo messaging
|
||||
* store: a database service
|
||||
* distributed synchronization facility: Tooz
|
||||
* HTTP API service: WSGI and httpd
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
* `vetrisko <https://launchpad.net/~vetrisko>`_; primary
|
||||
* `divius <https://launchpad.net/~divius>`_
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* replace current locking with Tooz DLM
|
||||
* introduce state machine
|
||||
* split API service and introduce conductors and queue
|
||||
* split cleaning into a separate timeout and synchronization handlers
|
||||
and introduce leader-election to these periodic procedures
|
||||
* introduce leader-election to the firewall facility
|
||||
* introduce the pub--sub channels to the firewall facility
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
We require proper inspector `grenade testing
|
||||
<https://wiki.openstack.org/wiki/Grenade>`_ before landing HA so we
|
||||
avoid breaking users as much as possible.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
All work items should be tested as separate patches both with
|
||||
functional and unit tests as well as upgrade tests with Grenade.
|
||||
|
||||
Having landed all the required work items it should be possible to
|
||||
test Inspector with focus on redundancy and scaling.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
During the analysis process we considered these blueprints:
|
||||
|
||||
* `Abort introspection
|
||||
<https://blueprints.launchpad.net/ironic-inspector/+spec/abort-introspection>`_
|
||||
* `Node States
|
||||
<https://blueprints.launchpad.net/ironic-inspector/+spec/node-states>`_
|
||||
* `Node Locking <https://review.openstack.org/#/c/244750/5>`_
|
||||
* `Oslo.messaging at-least-once semantics
|
||||
<https://review.openstack.org/#/c/256342/>`_
|
||||
|
||||
RFEs:
|
||||
|
||||
* `TaskFlow: flow suspend&continue
|
||||
<https://bugs.launchpad.net/taskflow/+bug/1527678>`_
|
||||
* `TaskFlow: non-DAG flow patterns
|
||||
<https://bugs.launchpad.net/taskflow/+bug/1527690>`_
|
||||
* `HA for Ironic Inspector
|
||||
<https://bugs.launchpad.net/ironic-inspector/+bug/1525218>`_
|
||||
* `Safe queue for Tooz
|
||||
<https://bugs.launchpad.net/python-tooz/+bug/1528490>`_
|
||||
* `Watchable store for Tooz
|
||||
<https://bugs.launchpad.net/python-tooz/+bug/1528495>`_
|
||||
* `Enhanced Network/Subnet DHCP Options
|
||||
<https://review.openstack.org/#/c/247027/>`_
|
||||
* `Neutron DHCP serve unknown hosts
|
||||
<https://review.openstack.org/#/c/255240/>`_
|
||||
|
||||
Community sources:
|
||||
|
||||
* `DLM options discussion
|
||||
<https://etherpad.openstack.org/p/mitaka-cross-project-dlm>`_
|
||||
* `TaskFlow with external events and Non-DAG flows
|
||||
<http://lists.openstack.org/pipermail/openstack-dev/2015-November/080622.html>`_
|
||||
* Joshua Harlow's comment that `Tooz should implement the
|
||||
at-least-once semantics not Oslo.messaging
|
||||
<https://review.openstack.org/#/c/256342/7/specs/mitaka/at-least-once-guarantee.rst@305>`_
|
||||
|
||||
RFCs:
|
||||
|
||||
* `DHCP Failover Protocol: IP address allocation between servers <https://tools.ietf.org/html/draft-ietf-dhc-failover-12#section-5.4>`_
|
||||
|
||||
Tools:
|
||||
|
||||
* `IP Sets <http://ipset.netfilter.org/changelog.html>`_
|
||||
* `Dnsmasq <http://www.thekelleys.org.uk/dnsmasq/doc.html>`_
|
Loading…
Reference in New Issue