Commit Graph

86 Commits

Author SHA1 Message Date
Simon Westphahl e41af7c312
Add job name back to node request data
With the circular dependency refactoring we also removed the job name
from the requestor data in the node request. However, this could
previously be used as part of the dynamic-tags in Nodepool which might
be useful for billing and cost calculations.

Add back the job name so those use-cases start working again.

Change-Id: Ie3be39819bf84d05a7427cd0e859f485de90835d
2024-03-07 08:02:30 +01:00
Zuul 4afe5cfab5 Merge "Fix nodepool stats calculation in Zuul" 2024-02-27 19:08:19 +00:00
James E. Blair 1f026bd49c Finish circular dependency refactor
This change completes the circular dependency refactor.

The principal change is that queue items may now include
more than one change simultaneously in the case of circular
dependencies.

In dependent pipelines, the two-phase reporting process is
simplified because it happens during processing of a single
item.

In independent pipelines, non-live items are still used for
linear depnedencies, but multi-change items are used for
circular dependencies.

Previously changes were enqueued recursively and then
bundles were made out of the resulting items.  Since we now
need to enqueue entire cycles in one queue item, the
dependency graph generation is performed at the start of
enqueing the first change in a cycle.

Some tests exercise situations where Zuul is processing
events for old patchsets of changes.  The new change query
sequence mentioned in the previous paragraph necessitates
more accurate information about out-of-date patchsets than
the previous sequence, therefore the Gerrit driver has been
updated to query and return more data about non-current
patchsets.

This change is not backwards compatible with the existing
ZK schema, and will require Zuul systems delete all pipeline
states during the upgrade.  A later change will implement
a helper command for this.

All backwards compatability handling for the last several
model_api versions which were added to prepare for this
upgrade have been removed.  In general, all model data
structures involving frozen jobs are now indexed by the
frozen job's uuid and no longer include the job name since
a job name no longer uniquely identifies a job in a buildset
(either the uuid or the (job name, change) tuple must be
used to identify it).

Job deduplication is simplified and now only needs to
consider jobs within the same buildset.

The fake github driver had a bug (fakegithub.py line 694) where
it did not correctly increment the check run counter, so our
tests that verified that we closed out obsolete check runs
when re-enqueing were not valid.  This has been corrected, and
in doing so, has necessitated some changes around quiet dequeing
when we re-enqueue a change.

The reporting in several drivers has been updated to support
reporting information about multiple changes in a queue item.

Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1
Depends-On: https://review.opendev.org/907627
2024-02-09 07:39:40 -08:00
James E. Blair 7262ef7f6f Include job_uuid in NodeRequests
This is part of the circular dependency refactor.  It updates the
NodeRequest object to include the job_uuid in addition to the job_name
(which is temporarily kept for backwards compatability).  When node
requests are completed, we now look up the job by uuid if supplied.

Change-Id: I57d4ab6c241b03f76f80346b5567600e1692947a
2023-12-20 10:44:04 -08:00
James E. Blair e5bfebc660 Fix nodepool stats calculation in Zuul
When emitting nodepool stats, Zuul incorrectly assumes the format
of the user_data dict on nodes.  It could be a different format on
nodes that it doesn't own.  It correctly checks this elsewhere,
but was missed in this one spot.

Change-Id: I399047b9ddac6af855392d5df23bfb34a1cfcc56
2023-11-20 06:22:07 -08:00
Simon Westphahl a4337b1475
Force Nodepool re-election on connection suspended
When the Zookeeper connection is suspended we might miss some events
during the time until the client is able to reconnect. This can lead to
jobs waiting for node requests that are already fulfilled.

To fix this edge-case we force a re-election of the Nodepool event
watcher when the connection is suspended. This fixes the issue with lost
events as the event watcher will re-send nodes-provisioned events for
all ready requests when the election was won.

Change-Id: I69b39bb02481241d584253906922ae74b94060cf
2023-10-23 09:25:59 +02:00
Simon Westphahl d864d83ade
End node request span when result event is sent
The node request span needs to be ended whenever we add a result event
to the pipeline. Before we only did that when iterating over the node
requests after we've won the nodepool election.

Change-Id: I0276d5498b243522540657352a733d663ae71918
2022-10-07 15:29:49 +02:00
Simon Westphahl 937e25432f
Trace node request phase
Since we are mainly interested in the time taken until the request is
failed or fulfilled we won't create a span for full the lifetime of the
node request.

Change-Id: Ia8d9aaaac3ab4a4791eace2024c1ecb1b9c7a6bd
2022-09-19 11:25:49 +02:00
Benjamin Schanzel eac322d252 Report gross/total tenant resource usage stats
Export a new statsd gauge with the total resources of a tenant.
Currently, we only export resources of in-use nodes. With this, we
additionally report the cummulative resources of all of a tenants nodes
(i.e. ready, deleting, ...).

This also renames the existing in-use resource stat to distinguish those
clearly.

Change-Id: I76a8c1212c7e9b476782403d52e4e22c030d1371
2022-03-17 14:51:18 +01:00
Zuul 88ea050f68 Merge "Add pipeline timing metrics" 2022-02-23 19:29:07 +00:00
Zuul b7fd46e48c Merge "Don't submit empty node requests to Zookeeper" 2022-02-22 14:08:44 +00:00
Simon Westphahl 69c9ec33ae Annotate logs in Nodepool API where possible
Some methods in the Nodepool API did not use the annotated logger that
adds the zuul event id to the log lines.

Change-Id: Iff99b0be5791abb0cc3eac3546f36994b8c6fdfe
2022-02-21 11:28:11 +01:00
James E. Blair c522bfa460 Add pipeline timing metrics
This adds several metrics for different phases of processing an item
in a pipeline:

* How long we wait for a response from mergers
* How long it takes to get or compute a layout
* How long it takes to freeze jobs
* How long we wait for node requests to complete
* How long we wait for an executor to start running a job
  after the request

And finally, the total amount of time from the original event until
the first job starts.  We already report that at the tenant level,
this duplicates that for a pipeline-specific metric.

Several of these would also make sense as job metrics, but since they
are mainly intended to diagnose Zuul system performance and not
individual jobs, that would be a waste of storage space due to the
extremely high cardinality.

Additionally, two other timing metrics are added: the cumulative time
spent reading and writing ZKObject data to ZK during pipeline
processing.  These can help determine whether more effort should be
spent optimizing ZK data transfer.

In preparing this change, I noticed that python statsd emits floating
point values for timing.  It's not clear whether this strictly matches
the statsd spec, but since it does emit values with that precision,
I have removed several int() casts in order to maintain the precision
through to the statsd client.

I also noticed a place where we were writing a monotonic timestamp
value in a JSON serialized string to ZK.  I do not believe this value
is currently being used, therefore there is no further error to correct,
however, we should not use time.monotonic() for values that are
serialized since the reference clock will be different on different
systems.

Several new attributes are added to the QueueItem and Build classes,
but are done so in a way that is backwards compatible, so no model api
schema upgrade is needed.  The code sites where they are used protect
against the null values which will occur in a mixed-version cluster
(the components will just not emit these stats in those cases).

Change-Id: Iaacbef7fa2ed93bfc398a118c5e8cfbc0a67b846
2022-02-20 16:55:34 -08:00
Simon Westphahl e6e4588bb7 Don't submit empty node requests to Zookeeper
We are currently also submitting empty node requests to Zookeeper. As
far as I could see that was only necessary as an intermediate step
towards the scale-out scheduler. Since we have all state in Zookeeper by
now it seems that we don't need this anymore.

Since we will no longer receive a nodes provisioned event for the empty
node request, we will set the empty nodeset immediately after requesting
the nodes if the request is fulfilled at this point.

By not submitting empty node requests to Zookeeper, we can also save one
run handler cycle for jobs that don't need any nodes.

Change-Id: I4f9cbc7555591bb8817e3596edf4b9af99efd998
2022-02-16 11:27:08 +01:00
James E. Blair 9e1118615c Don't add node resources to nonexistent tenant
The periodic stats emitter may try to add up node resource usage
for nodes which belong to tenants the scheduler doesn't know about
yet because it's still starting up.  This causes the following
error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/zuul/scheduler.py", line 314, in runStats
    self._runStats()
  File "/usr/local/lib/python3.8/site-packages/zuul/scheduler.py", line 459, in _runStats
    self.nodepool.emitStatsTotals(self.abide)
  File "/usr/local/lib/python3.8/site-packages/zuul/nodepool.py", line 526, in emitStatsTotals
    self.addResources(resources_by_tenant[tenant_name],
  File "/usr/local/lib/python3.8/site-packages/zuul/nodepool.py", line 65, in addResources
    target[key] += value
TypeError: 'int' object is not subscriptable

First, we shouldn't initialize that dictionary to a defaultdict(int)
because it's actually a dict of dicts.  The (int) was leftover from
a previous implementation.

Second, we should just ignore nodes which belong to tenants or projects
we don't know about yet.  If we're supposed to track them, then we will
do so later once configuration is complete.

Change-Id: I552943ef9135041704f9849ef241c68d6b758a8a
2021-09-29 15:07:42 -07:00
Felix Edel 5997228e46 Remove job_name attribute form NodesProvisionedEvent
When moving the NodesProvisionedEvents to ZooKeeper, this was needed to
lookup empty NodeRequests in the scheduler as those requests weren't
stored in ZooKeeper and didn't provide any ID.

Some of the newer changes which are related to how node requests are
handled in Zuul made this attribute obsolete.

Change-Id: I382473bd10150bd47237cecc05ebcd345cf98ba8
2021-09-21 15:47:53 +02:00
James E. Blair cebebf0f42 Fix test race with node allocation
In order to determine if the system is settled, the test framework
checks whether all node requests have been fulfilled (and, implicitly,
that Zuul has seen and processed that fulfillment).  That sequence
is now:

1. Request state in ZK is set to fulfilled by nodepool
2. Scheduler receives watch that request is fulfilled
3. Internal cache is updated to reflect new state
4. NodesProvisionedEvent is added to ZK result event queue

There is a window between 3 and 4 where, to an external observer, the
system looks the same as after 4.  We used to have an internal list of
pending requests in the scheduler which we would not clear until after
4, however that has been removed since it doesn't make sense with
multiple schedulers.

To resolve this, we wrap steps 3 and 4 in a lock so they act as a
critical section, and then in the test framework, we grab that same lock
and check both the internal cache (3) and the result event queue(4) to
determine if we're between 3 and 4 (not settled) or after 4 (settled).

Change-Id: Ib6d0ad826cc4d3fad9d1b59434971f48bab7d23a
2021-09-11 11:02:05 -07:00
Felix Edel 38776452bb Don't use the AnsibleJob in the nodepool client
This change follows up on a few TODOs left by the lock/unlock nodes on
executor change.

When locking the nodes on the executor we used the AnsibleJob as a
replacement for the old build parameter that was provided to the
nodepool client methods as they were originally called by the scheduler.

However, the AnsibleJob class should only be used internally by the
executor, so we now provide all parameters directly to the nodepool
methods.

This also annotates the logger in the updated nodepool client methods
and fixes an outdated method signature in
test_scheduler.TestSemaphore.test_semaphore_zk_error.

Remove two comments about storing timestamps on the build request in
ZooKeeper as this doesn't make much sense. It sounded like a good idea
in the beginning, but with the current solution, the scheduler doesn't
need to care about the build request anymore after it was submitted
(except for canceling/cleanup purposes) and the result data is
self-contained.

Change-Id: I2d1005f69904c6ace8f79523133f382af0024c52
2021-09-10 10:55:01 -07:00
James E. Blair 65cac91e6c Add ZK session-aware elections
This creates a session-aware election class which will set a flag
that indicates it has lost the underlying lock.  We can check this
flag when iterating to make sure that we don't continue to attempt
to operate when we have lost the lock underlying an election.

Some drivers had connection lost handling for the EventReceiverElection
at the driver level.  Those are updated to use the handling at the
election level for consistency as well as brevity.

Change-Id: I776f88d015acdfbf1487a85d8473cd174917e90f
2021-09-10 10:55:00 -07:00
James E. Blair aee6ef6f7f Report nodepool resource stats gauges in scheduler
We currently report nodepool resource usage whenever we use or return
nodes.  This now happens on the executors, and they don't have a
global view of all nodes used.  The schedulers do, and they already
have a periodic stats reporting method.

Shift the reporting of node resource gauges to the scheduler.  To make
this efficient, use a tree cache for nodes.  Because node records
alone don't have enough information to tie them back to a tenant or
project, use the new user_data field on the Node object to store that
info when we mark a node in use.  Also, store the zuul system id on
the node, so that we can ensure we're only reporting nodes that belong
to us.

Update the node list in the REST API to use the cache as well, and
also filter its results by zuul system id and tenant.

Depends-On: https://review.opendev.org/807362
Change-Id: I9d0987b250b8fb54b3b937c86db327d255e54abd
2021-09-10 10:54:59 -07:00
James E. Blair b41f467340 Remove internal nodepool request cache
The internal zuul.nodepool.Nodepool.requests dictionary is used so
the scheduler can keep track of its requests.  Since we will have
multiple schedulers emitting requests, we can't use that any more.
Remove any remaining uses of it.

The NodeRequest uid was only used to index that dictionary (and
was used to persist a request across resubmission).  Since it isn't
needed any more, it is removed.

Change-Id: I7c82485d95979c6c9a246c3dc3954bae3c65ac13
2021-09-10 10:53:47 -07:00
James E. Blair 6dc1178fc3 Don't store node requests/nodesets on queue items
To prepare for queue items moving into ZooKeeper, stop storing the
NodeRequest and NodeSet objects on them.  Instead, reference requests
by ID and consult ZK when necessary, and store only the info about
nodesets that the scheduler needs.  The result are simple dicts
than can easily be serialized.

The deleteNodeRequest method is updated to accept IDs instead of
NodeRequest objects to minimize the number of times we need to
use a full NodeRequest object.

Change-Id: I3587a42eb5a151f41369385e482b7f36b1c41bf6
2021-09-10 08:51:20 -07:00
James E. Blair 514f62ea31 Refactor the checkNodeRequest method
We perform some checks which aren't necessary any more.  This
method is better thought of as a method of getting a nodeset from
a fulfilled node request, so update it accordingly.

Change-Id: I1113820115af68b706b6fe06d6d03cd35ae6b382
2021-09-10 08:46:42 -07:00
James E. Blair dbab353ca3 Remove unecessary node request cancelation code
The only use of request.canceled that matters at this point is
in emitting stats.  Otherwise, since the canceled flag isn't stored
in ZK (the request is just deleted instead!) there isn't a point
to using it.  Remove those tests.

Change-Id: I82d17f2832ae8fe14cf365b302a454caec5bef3c
2021-09-10 08:46:04 -07:00
James E. Blair 678bc4846c Remove unneeded scheduler.zk_nodepool object
The scheduler has a Nodeool object, and the Nodepool object has
a ZooKeeperNodepool object.  Separately, the scheduler also has a
standalone ZooKeeperNodepool object.  Rather than having a second
zk_nodepool object, just reach into Nodepool object and use its
zk_nodepool object directly.

This is more important now that ZooKeeperNodepool maintains a
node request cache (and will also maintain a node cache in a future
change).  This means that the scheduler was keeping two in-memory
caches, which is extra work being performed.

Because one of the zk_nodepool objects was being used to generate
nodes provisioned events, and the other was being used to process
them, if their caches weren't in sync, the scheduler could end up
marking node requests as failed when they actually succeeded.

The dual cache issue is why we saw this issue in tests, but the
same issue would be present with multiple schedulers too, so we
also update the getNodeRequest method to make the cache optional.
We bypass the cache where we must be certain we have the most
up to date info.

Change-Id: I89242a01f656abce143bfb991670d452deae8b72
2021-09-10 08:05:07 -07:00
James E. Blair bb94937ea3 Wrap nodepool request completed events with election
So that only one scheduler puts nodepool request completed events
in the queue, wrap that with an election.  There is a dedicated
thread to try to win the election, and if it does, it emits
complete events for every completed request (in case we missed
some during the handover).  Other than that, the process stays
the same.

If we encounter a problem putting the event on the ZK queue, we
tell the election thread to re-run the election.

Change-Id: I3dadf5524dc3d931415e20267d36030e945a3000
2021-09-06 15:27:16 -07:00
Felix Edel 4e2985638c Add node request cache to zk nodepool interface
This adds a TreeCache to the ZK nodepool interface; it's nearly
identical to the one on the nodepool side.

Co-Authored-By: James E. Blair <jim@acmegating.com>
Change-Id: Ie972c397cf235d637619d1e40c5e7ff78431ac0d
2021-09-06 15:26:39 -07:00
James E. Blair e225a28fa5 Make node requests persistent
The original Nodepool protocol specified that node requests should
be ephemeral, that way if the requestor crashed before accepting
the nodes, the request would automatically be cleaned up and the
nodes returned.  This doesn't comport with multiple schedulers, as
we will soon expect schedulers to stop and start routinely while
we want the node requests they spawn to persist and be handled by
other schedulers.

Fortunately, Nodepool doesn't really care if the request is
ephemeral or not.  So we'll drop the "ephemeral" flag.

But in the short term, we will be stopping the scheduler and that
will leave orphan node requests.  And even in the long term, we
may have a complete Zuul system shutdown or even a bug which may
leak node requests, so we still need a way of deleting node requests
which don't belong.  To handle that, we add a cleanup routine which
we run immediately on startup and every hour that looks for node
requests created by this Zuul system but don't correspond to any
queue entries.  We create a new UUID to identify the Zuul system
and store it in ZK (so that if Nodepool has any other users we
don't delete their requests).

We no longer need to resubmit requests on connection loss, so tests
addressing that behavior are removed.

Change-Id: Ie22e99ef71cbe6b31d40c25a21498c1e867ca777
2021-09-03 16:17:15 -07:00
James E. Blair dbe13ce076 Remove nodeset from NodeRequest
To make things simpler for schedulers to handle node provisioned
events for node requests which they may not have in their local
pipeline state, we need to make the pipeline storage of node requests
simpler.  That starts by removing the nodeset object as an attribute
of the NodeRequest object.  This means that the scheduler can work
with a node request object without relying on having the associated
nodeset.  It also simplifies the ZooKeeper code that deserializes
NodeRequests (as it doesn't have to create fake NodeSet objects too).
And finally, it simplifies what must be stored in the pipeline
and queue item structures, which will also come in handy later.

Two tests designed to verify that the request->nodeset magic
deserialization worked have been removed since they are no longer
applicable.

Change-Id: I70ae083765d5cd9a4fd1afc2442bf22d6c52ba0b
2021-09-02 09:29:44 -07:00
Benjamin Schanzel e577ec90bd Add tenant name on NodeRequests for Nodepool
This change adds the tenant name of the current events' context to
NodeRequests and exposes it as a new field on ZooKeeper.  It prepares
for a tenant-aware Nodepool Launcher for it to enforce per-tenant
resource quota.  In addition, Zuul exposes a new statsd metric
``zuul.nodepool.tenant.<tenant>.current_requests`` that drills down the
overall current_requests metric per tenant.

Corresponding Spec can be found here
https://review.opendev.org/c/zuul/zuul/+/788481

Change-Id: I6d47431e939aba2c80f30504b7a48c15f9fc8fb7
2021-09-02 09:26:34 -07:00
Simon Westphahl 919c5a3654 Fix wrong varible use when updating resource stats
The code path for updating the nodepool resource stats was still
assuming a full Project instance that we no longer have when requesting
hold of a node set.

Change-Id: I03a11bc21ae519229fff05b6bff7b9dbb4ae9253
2021-08-20 07:46:03 -07:00
James E. Blair d87a9a8b8f Clear nodeset when re-submitting node requests
We encountered an issue where Zuul:

* submitted a node request
* nodepool fulfilled it
* zuul received the ZK watch and refreshed the NodeRequest object
* zuul submitted the node provisioned event to the event queue
* ZK was disconnected/reconnected
* zuul processed the node provisioned event
* zuul found the node request no longer existed (because it's ephemeral)
* zuul resubmitted the node request

Because the NodeRequest object had the provisioned node information
attached to it, the re-submitted request was created with an
existing 'nodes' list.  Nodepool appended to that list and fulfilled
the new request (which requested 1 but received 2 nodes).  This caused
an exception in Zuul's nodepool request watch callback, which caused
Zuul to ignore that and all future updates to the node request.

To address this, we make a new copy of the nodeset without any allocated
node info when re-submitting a request.

This contains an unrelated change to the event id handling from an earlier
revision; it is kept because it will simplify future changes which eliminate
the node request cache altogether.

Change-Id: I72f5ed7ad53e44d77b37870546daf61b8a4e7e09
2021-08-04 12:20:11 -07:00
James E. Blair 4dabbd9502 Fix race when canceling a node request
When we cancel a node request, we delete the request from ZK.  We
might get the callback from ZK to update the node request object
(due to the delete event) in a seprate thread while the first thread
is between the lines where we delete the request and set the internal
flag indicating it was canceled.

That would cause the update callback to think that the request was
externally deleted (not by us) and resubmit it.

To correct this, set the internal canceled flag before performing the
ZK delete.

Change-Id: I1b4771b5840cb168b01939bd8590534ef618d878
2021-07-15 14:00:30 -07:00
Felix Edel 040c5c8032 Move parent provider determination to pipeline manager
Moving the parent provider determination into the pipeline manager
allows us to remove the buildset and job objects from the NodeRequest
constructor. This way we can fully serialize the NodeRequest to
ZooKeeper and restore it without missing important information.

This has also an impact on the NodeRequest's priority property. As this
is only needed to determine the znode path when the NodeRequest is
submitted, we can provide it directly as parameter to the
submitNodeRequest call (and the related update callbacks).

To ensure that NodePool doesn't strip those additional information when
it fulfills the NodeRequest, we use the new "requestor_data" field which
is implemented in [1].

To make this work, we also have to look up the buildset by its UUID from
the active tenants and pipelines when the NodesProvisioned event is
handled in the scheduler. Something similar was already done for
handling the other result events as well.

[1]: https://review.opendev.org/c/zuul/nodepool/+/798746/

Depends-On: https://review.opendev.org/c/zuul/nodepool/+/798746/
Change-Id: Id794643dcf26b0565499d20adba99d3b0518fdf1
2021-07-08 13:27:08 -07:00
Felix Edel fee46c25bc Lock/unlock nodes on executor server
Currently, the nodes are locked in the scheduler/pipeline manager before
the actual build is created in the executor client. When the nodes are
locked, the corresponding NodeRequest is also deleted.

With this change, the executor will lock the nodes directly before
starting the build and unlock them when the build is completed.

To keep the order of events intact, the nodepool.acceptNodes() method is
split up into two:
    1. nodepool.acceptNodeRequest() does most of the old acceptNodes()
       method except for locking the nodes and deleting the node
       request. It is called on the scheduler side when the
       NodesProvisionedEvent is handled (which is also where
       acceptNodes() was previously called).
    2. nodepool.acceptNodes() is now called on the executor side when
       the job is started. It locks the nodes and deletes the node
       request in ZooKeeper.

Finally, it's also necessary to move the autohold processing to the
executor, as this requires a lock on the node. To allow processing of
autoholds, the executor now also determines the build attempts and sets
the RETRY_LIMIT result if necessary.

Change-Id: I7392ce47e84dcfb8079c16e34e0ed2062ebf4136
2021-07-01 05:46:02 +00:00
Zuul bd1a669cc8 Merge "statsd: decrement resources gauge for held node" 2021-05-28 17:36:47 +00:00
Zuul 7e802df42d Merge "Remove use of item's layout in Nodepool API" 2021-05-12 07:21:33 +00:00
Clark Boylan f2982dc152 Check if statsd is set before using it
We don't require a statsd config which means we must check that the
statsd objects are valid before using them to send data. Do this in two
places that were missed.

Change-Id: Ifda150d5305ea0cadf2865cdb691263e32476b94
2021-05-11 10:40:32 -07:00
Simon Westphahl 336e48d824 Remove use of item's layout in Nodepool API
The useNodeSet() method was still using an item's layout to get the
tenant name. Since the layout might be set to None during a re-enqueue
(see Id7cef4f1fa222b1491418ea2449687964fcfb361) we need to get the
tenant name via the pipeline instead.

Change-Id: I3835b5082681930b962cecf7fe6edcf2a211465a
2021-05-11 14:19:51 +02:00
Felix Edel ba7f81be2d Provide statsd client to Nodepool and make scheduler optional
To lock/unlock the nodes directly in the executor server, we have to
make the Nodepool API work without a scheduler instance.

To keep the stats emitting intact, we provide a statsd client directly
to the Nodepool instance.

This leaves only one place where the scheduler is used in the Nodepool
class, which is the onNodesProvisioned() callback.
This callback won't be necessary anymore when the nodes are locked on
the executor and thus this function call and the scheduler parameter
itself can be removed.

Change-Id: I3f3e4bfff08e244f68a9be7c6a4efcc194a23332
2021-04-30 12:12:28 +02:00
Tristan Cacqueray da60b252a7 statsd: decrement resources gauge for held node
This change fixes an issue where held node are never substracted
from the resources gauge, resulting in a ever increasing resources
usage metric.

Change-Id: Id87fcf95a8224492f0335dbb357977865f4fd45f
2021-04-26 14:56:09 +00:00
Jan Kubovy d518e56208 Prepare Zookeeper for scale-out scheduler
This change is a common root for other
Zookeeper related changed regarding
scale-out-scheduler. Zookeeper becoming
a central component requires to increase
"maxClientCnxns".

Since the ZooKeeper class is expected to grow
significantly (ZooKeeper is becoming a central part
of Zuul) a split of the ZooKeeper class (zk.py) into
zk module is done here to avoid the current god-class.

Also the zookeeper log is copied to the "zuul_output_dir".

Change-Id: I714c06052b5e17269a6964892ad53b48cf65db19
Story: 2007192
2021-02-15 14:44:18 +01:00
Tobias Henkel 4205740b67
Fix memleak on zk session loss
When the scheduler looses its zk session it resubmits all lost node
requests as new ones. However it didn't stop the watch for the old one
which keeps being registered in Kazoo. The watch contains the
NodeRequest object since it's bound to the callback. Thus by leaking
the Watch we also leak the NodeRequest, the attached BuildSet,
QueueItem and finally the Tenant and Layout as well.

This can be fixed by stopping the watch in this case.

Change-Id: I3b05ec92816ab5eb06ad40dfad85ddfebfbf2cc4
2020-09-11 09:23:21 +02:00
Tristan Cacqueray e85fb93d1d Store a list of held nodes per held build in hold request
Instead of storing a flat list of nodes per hold request, this
change updates the request nodes attribute to become a list of
dictionary with the build uuid and the held node list.

Change-Id: I9e50e7ccadc58fb80d5e80d9f5aac70eb7501a36
2019-10-24 13:39:16 -04:00
David Shrewsbury 9f5743366d Auto-delete expired autohold requests
When a request is created with a node expiration, set a request
expiration for 24 hours after the nodes expire.

Change-Id: I0fbf59eb00d047e5b066d2f7347b77a48f8fb0e7
2019-09-18 10:09:08 -04:00
David Shrewsbury 2c1c9ae662 Record held node IDs with autohold request
These node IDs will be output with the 'zuul autohold-info' command.

Change-Id: I8f52d2b87b3bec6d3b8ecc2f69507049d905cad5
2019-09-16 10:48:41 -04:00
David Shrewsbury 716ac1f2e1 Store autohold requests in zookeeper
Storing autohold requests in ZooKeeper, rather than in-memory,
allows us to remember requests across restarts, and is a necessity
for future work to scale out the scheduler.

Future changes to build on this will allow us to store held node
information with the change for easy node identification, and to
delete any held nodes for a request using the zuul CLI.

A new 'zuul autohold-delete' command is added since hold requests
are no longer automatically deleted.

This makes the autohold API:
   zuul autohold: Create a new hold request
   zuul autohold-list: List current hold requests
   zuul autohold-delete: Delete a hold request

Change-Id: I6130175d1dc7d6c8ce8667f9b14ae9377737d280
2019-09-16 08:47:53 -04:00
Tobias Henkel 6931703536
Annotate logs around finished builds
We should annotate the logs around finished builds with event ids.

Change-Id: I44ba4219f6d602aeab1f0d5829dfcb107341cf6d
2019-05-30 19:21:31 +02:00
Tobias Henkel 6f3bcdd6b6
Annotate builds with event id
It's useful to be able to trace an event through the system including
the builds.

Change-Id: If852cbe8aecc4cf346dccc1b8fc34272c8ff483d
2019-05-30 19:18:00 +02:00
Tobias Henkel e90fe41bfe Report tenant and project specific resource usage stats
We currently lack means to support resource accounting of tenants or
projects. Together with an addition to nodepool that adds resource
metadata to nodes we can emit statsd statistics per tenant and per
project.

The following statistics are emitted:
* zuul.nodepool.resources.tenant.{tenant}.{resource}.current
  Gauge with the currently used resources by tenant

* zuul.nodepool.resources.project.{project}.{resource}.current
  Gauge with the currently used resources by project

* zuul.nodepool.resources.tenant.{tenant}.{resource}.counter
  Counter with the summed usage by tenant. e.g. cpu seconds

* zuul.nodepool.resources.project.{project}.{resource}.counter
  Counter with the summed usage by project. e.g. cpu seconds

Depends-On: https://review.openstack.org/616262
Change-Id: I68ea68128287bf52d107959e1c343dfce98f1fc8
2019-05-29 04:10:08 +00:00