Zuul v3: use Zookeeper for Nodepool-Zuul protocol

This updates the Nodepool-Zuul protocol to use Zookeeper after the
fashion of the spec to use Zookeeper for Nodepool image builds.

Despite this having more words than the section it replaces, this
is ultimately a much more robust protocol (particularly around
race conditions and edge cases related to Gearman disconnections)
and should be simpler to implement in Nodepool.  Notably, this
removes the last central component of Nodepool allowing it to be
a truly decentralised (and much more fault-tolerant) app.

Change-Id: I354f3eb8de3a8218cea04f5b19bb4b05f4e0cb60
This commit is contained in:
James E. Blair 2016-04-13 15:05:54 -07:00 committed by James E. Blair
parent c0939e9d6c
commit 1250865012
1 changed files with 152 additions and 34 deletions

View File

@ -80,63 +80,181 @@ releases. That is to say, it should act more like its name -- a node
pool. It should support the existing model of single-use nodes as
well as long-term nodes that need mediated access.
Nodepool should implement the following gearman function to get one or
more nodes::
Nodepool should use ZooKeeper to fulfill node requests from Zuul. A
request should be made using the Zookeeper `priority queue`_ construct
at the path::
request-nodes:
input: {node-types: [list of node types],
request: a unique id for the request,
requestor: unique gearman worker id (eg zuul)}
/nodepool/requests/500-123
node_types: [list of node types]
requestor: descriptive string of requestor (eg zuul)
created_time: <unix timestamp>
state: requested | pending | fulfilled | failed
state_time: <unix timestamp>
nodes: [list of node ids]
declined_by: [list of launchers declining this request]
When multiple nodes are requested together, nodepool will return nodes
within the same AZ of the same provider.
The name of the request node, "500-123", is composed of the priority
("500") followed by the sequence number ("123"). After creating the
request node, Zuul should read the request node back and set a watch
on it. If the read associated with the watch set indicates that the
request has already been fulfilled, it should proceed to use the
nodes, otherwise, it should wait to be notified by the watch. Note
special care will need to be taken to re-set watches if the connection
to ZooKeeper is reset. The pattern of read to test whether request is
fulfilled and set watch if not can be repeated as many times as
necessary until the request is fulfilled.
Requests for nodes will go into a FIFO queue and be satisfied in the
This model is much more efficient for multi-node tests, where we will
no longer have to have special multinode labels. Instead the
multinode configuration can be much more ad-hoc and vary per job.
Requests for nodes are in a FIFO queue and will be satisfied in the
order received according to node availability. This should make
demand and allocation calculations much simpler.
A node type is simply a string such as 'trusty', that corresponds to
an entry in the nodepool config file.
The requestor is used to identify the system that is requesting the
node. To handle the case where the requesting system (eg Zuul) exits
abruptly and fails to return a node to the pool, Nodepool will reverse
the direction of gearman function invocation when supplying a set of
nodes. When completing allocation of a node, nodepool invokes the
following gearman function::
The component of Nodepool which will process these requests is known
as a "launcher". A Nodepool system may consiste of multiple launchers
(for instance, one launcher for each cloud provider). Each launcher
will continuously scan the request queue (sorted by request id) and
attempt to process each request in sorted order. A single launcher
may be engaged in satisfying multiple requests simultaneously.
accept-nodes:<requestor>:
input: {nodes: [list of node records],
request: the unique id from the request}
output: {used: boolean}
When satisfying a request, Nodepool will first obtain a lock on the
request using the Zookeeper `lock construct`_ at the path::
If `zuul` was the requestor supplied with request-nodes, then the
actual function invoked would be `accept-nodes:zuul`.
/nodepool/requests-lock/005-123
A node record is a dictionary with the following records: id,
public-ipv4, public-ipv6, private-ipv4, private-ipv6, hostname,
node-type. The list should be in the same order as the types
specified in the request.
It will then attempt to satisfy the request from available nodes, and
failing that, cause new nodes to be created. When multiple nodes are
requested together, nodepool will return nodes within the same AZ of
the same provider.
When the job is complete it will return a WORK_COMPLETE packet with
`used` set to true if any nodes were used. `used` will be set to
false if all nodes were unused (for instance, if Zuul no longer needs
the requested nodes). In this case, the nodes may be reassigned to
another request. If a WORK_FAIL packet is received, including due to
disconnection, the nodes will be treated as used.
A simple algorithm which does not require that any launcher know about
any other launchers is:
# Obtain next request
# If image not available, decline
# If request > quota, decline
# If request < quota and request > available nodes (due to current
usage), begin satisfying the request and do not process further
requests until satisfied
# If request < quota and request < available nodes, satisfy the
request and continue processing further requests
Since Nodepool consists of multiple launchers, each of which is only
aware of its own configuration, there is no single component of the
system that can determine if a request is permanently unsatisfiable.
In order to avoid requests remaining in the queue indefinitely, each
launcher will register itself at the path::
/nodepool/launchers/<hostname>-<pid>-<tid>
When a launcher is unable to satisfy a request, it will modify the
request node (while still holding the lock) and add its identifier to
the field `declined_by`. It should then check the contents of this
field and compare it to the current contents of `/nodepool/launchers`.
If all of the currently on-line launchers are represented in
`declined_by` the request should be marked `failed` in the `state`
field. The update of the request node will notify Zuul via the
previously set watch, however, it will check the state, and if the
request is not failed or fulfilled, will simply re-set the watch. The
launcher will then release the lock and, if the request is not yet
failed, other launchers will be able to attempt to process the
request. When processing the request queue, the launcher should avoid
obtaining the lock on any request it has already declined (though it
should always perform a check for whether the request should be marked
as failed in case the last launcher went off-line shortly after it
declined the request).
Requests should not be marked as failed for transient errors (if a
node destined for a request fails to boot, another node should take
its place). Only in the case where it is impossible for Nodepool to
satisfy a request should it be marked as failed. In that case, Zuul
may report job failure as a result.
If at any point Nodepool detects that the ephemeral request node has
been deleted, it should return any allocated nodes to the pool.
Each node should have a record in Zookeeper at the path::
/nodepool/nodes/456
type: ubuntu-trusty
provider: rax
region: ord
az: None
public_ipv4: <IPv4 address>
private_ipv4: <IPv4 address>
public_ipv6: <IPv6 address>
allocated_to: <request id>
state: building | testing | ready | in-use | used | hold | deleting
created_time: <unix timestamp>
updated_time: <unix timestamp>
image_id: /nodepool/image/ubuntu-trusty/builds/123/provider/rax/images/456
launcher: <hostname>-<pid>-<tid>
The node should start in the `building` state and if being created in
response to demand, set `allocated_to` to the id of the node request.
While building, Nodepool should hold a lock on the node at::
/nodepool/nodes/456/lock
Once complete, the metadata should be updated, the state set to
`ready`, and the lock released. Once all of the nodes in a request
are ready, Nodepool should update the state of the request to
`fulfilled` and release the lock. Zuul, which will have been notified
of the change by the watch it set, should then obtain the lock on each
node in the request and update its state to 'in-use'. It should then
delete the request node.
When Zuul is finished with the nodes, it should set their states to
`used` and release their locks.
Nodepool will then decide whether the nodes should be returned to the
pool, rebuilt, or deleted according to the type of node and current
demand.
This model is much more efficient for multi-node tests, where we will
no longer have to have special multinode labels. Instead the
multinode configuration can be much more ad-hoc and vary per job.
If any Nodepool or Zuul component fails at any point in this process,
it should be possible to determine this and either recover or at least
avoid leaking nodes. Nodepool should periodically examine all of the
nodes and look for the following conditions:
* A node allocated to a request that does not exist where the node is
in the `ready` state for more than a short period of time (e.g., 300
seconds). This is a node that was either part of a fulfilled
request and given to a requestor but the requestor has done nothing
with it yet, or the request was canceled immediately after being
fulfilled.
* A node in the `building` or `testing` states without a lock. This
means the Nodepool launcher handling that node died; it should be
deleted.
* A node in the `in-use` state without a lock. This means the Zuul
launcher using the node died.
This should allow the main work of nodepool to be performed by
multiple independent launchers, each of which is capable of processing
the request queue and modifying the pool state as represented in
Zookeeper.
The initial implementation will assume only one launcher is running
for each provider in order to avoid complexities involving quota
spanning across launchers, rate limits, and how to prevent request
starvation in the case of multiple launchers for the same provider
where one is handling a very large request. However, future work may
enable this with more coordination between launchers in zk.
Nodepool should also allow the specification of static inventory of
non-dynamic nodes. These may be nodes that are running on real
hardware, for instance.
.. _lock construct:
http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks
.. _priority queue:
https://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_priorityQueues
Zuul
----