zuul/zuul - zuul - OpenDev: Free Software Needs Free Tools

Commit Graph

Author	SHA1	Message	Date
Simon Westphahl	e41af7c312	Add job name back to node request data With the circular dependency refactoring we also removed the job name from the requestor data in the node request. However, this could previously be used as part of the dynamic-tags in Nodepool which might be useful for billing and cost calculations. Add back the job name so those use-cases start working again. Change-Id: Ie3be39819bf84d05a7427cd0e859f485de90835d	2024-03-07 08:02:30 +01:00
Zuul	4afe5cfab5	Merge "Fix nodepool stats calculation in Zuul"	2024-02-27 19:08:19 +00:00
James E. Blair	1f026bd49c	Finish circular dependency refactor This change completes the circular dependency refactor. The principal change is that queue items may now include more than one change simultaneously in the case of circular dependencies. In dependent pipelines, the two-phase reporting process is simplified because it happens during processing of a single item. In independent pipelines, non-live items are still used for linear depnedencies, but multi-change items are used for circular dependencies. Previously changes were enqueued recursively and then bundles were made out of the resulting items. Since we now need to enqueue entire cycles in one queue item, the dependency graph generation is performed at the start of enqueing the first change in a cycle. Some tests exercise situations where Zuul is processing events for old patchsets of changes. The new change query sequence mentioned in the previous paragraph necessitates more accurate information about out-of-date patchsets than the previous sequence, therefore the Gerrit driver has been updated to query and return more data about non-current patchsets. This change is not backwards compatible with the existing ZK schema, and will require Zuul systems delete all pipeline states during the upgrade. A later change will implement a helper command for this. All backwards compatability handling for the last several model_api versions which were added to prepare for this upgrade have been removed. In general, all model data structures involving frozen jobs are now indexed by the frozen job's uuid and no longer include the job name since a job name no longer uniquely identifies a job in a buildset (either the uuid or the (job name, change) tuple must be used to identify it). Job deduplication is simplified and now only needs to consider jobs within the same buildset. The fake github driver had a bug (fakegithub.py line 694) where it did not correctly increment the check run counter, so our tests that verified that we closed out obsolete check runs when re-enqueing were not valid. This has been corrected, and in doing so, has necessitated some changes around quiet dequeing when we re-enqueue a change. The reporting in several drivers has been updated to support reporting information about multiple changes in a queue item. Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1 Depends-On: https://review.opendev.org/907627	2024-02-09 07:39:40 -08:00
James E. Blair	7262ef7f6f	Include job_uuid in NodeRequests This is part of the circular dependency refactor. It updates the NodeRequest object to include the job_uuid in addition to the job_name (which is temporarily kept for backwards compatability). When node requests are completed, we now look up the job by uuid if supplied. Change-Id: I57d4ab6c241b03f76f80346b5567600e1692947a	2023-12-20 10:44:04 -08:00
James E. Blair	e5bfebc660	Fix nodepool stats calculation in Zuul When emitting nodepool stats, Zuul incorrectly assumes the format of the user_data dict on nodes. It could be a different format on nodes that it doesn't own. It correctly checks this elsewhere, but was missed in this one spot. Change-Id: I399047b9ddac6af855392d5df23bfb34a1cfcc56	2023-11-20 06:22:07 -08:00
Simon Westphahl	a4337b1475	Force Nodepool re-election on connection suspended When the Zookeeper connection is suspended we might miss some events during the time until the client is able to reconnect. This can lead to jobs waiting for node requests that are already fulfilled. To fix this edge-case we force a re-election of the Nodepool event watcher when the connection is suspended. This fixes the issue with lost events as the event watcher will re-send nodes-provisioned events for all ready requests when the election was won. Change-Id: I69b39bb02481241d584253906922ae74b94060cf	2023-10-23 09:25:59 +02:00
Simon Westphahl	d864d83ade	End node request span when result event is sent The node request span needs to be ended whenever we add a result event to the pipeline. Before we only did that when iterating over the node requests after we've won the nodepool election. Change-Id: I0276d5498b243522540657352a733d663ae71918	2022-10-07 15:29:49 +02:00
Simon Westphahl	937e25432f	Trace node request phase Since we are mainly interested in the time taken until the request is failed or fulfilled we won't create a span for full the lifetime of the node request. Change-Id: Ia8d9aaaac3ab4a4791eace2024c1ecb1b9c7a6bd	2022-09-19 11:25:49 +02:00
Benjamin Schanzel	eac322d252	Report gross/total tenant resource usage stats Export a new statsd gauge with the total resources of a tenant. Currently, we only export resources of in-use nodes. With this, we additionally report the cummulative resources of all of a tenants nodes (i.e. ready, deleting, ...). This also renames the existing in-use resource stat to distinguish those clearly. Change-Id: I76a8c1212c7e9b476782403d52e4e22c030d1371	2022-03-17 14:51:18 +01:00
Zuul	88ea050f68	Merge "Add pipeline timing metrics"	2022-02-23 19:29:07 +00:00
Zuul	b7fd46e48c	Merge "Don't submit empty node requests to Zookeeper"	2022-02-22 14:08:44 +00:00
Simon Westphahl	69c9ec33ae	Annotate logs in Nodepool API where possible Some methods in the Nodepool API did not use the annotated logger that adds the zuul event id to the log lines. Change-Id: Iff99b0be5791abb0cc3eac3546f36994b8c6fdfe	2022-02-21 11:28:11 +01:00
James E. Blair	c522bfa460	Add pipeline timing metrics This adds several metrics for different phases of processing an item in a pipeline: * How long we wait for a response from mergers * How long it takes to get or compute a layout * How long it takes to freeze jobs * How long we wait for node requests to complete * How long we wait for an executor to start running a job after the request And finally, the total amount of time from the original event until the first job starts. We already report that at the tenant level, this duplicates that for a pipeline-specific metric. Several of these would also make sense as job metrics, but since they are mainly intended to diagnose Zuul system performance and not individual jobs, that would be a waste of storage space due to the extremely high cardinality. Additionally, two other timing metrics are added: the cumulative time spent reading and writing ZKObject data to ZK during pipeline processing. These can help determine whether more effort should be spent optimizing ZK data transfer. In preparing this change, I noticed that python statsd emits floating point values for timing. It's not clear whether this strictly matches the statsd spec, but since it does emit values with that precision, I have removed several int() casts in order to maintain the precision through to the statsd client. I also noticed a place where we were writing a monotonic timestamp value in a JSON serialized string to ZK. I do not believe this value is currently being used, therefore there is no further error to correct, however, we should not use time.monotonic() for values that are serialized since the reference clock will be different on different systems. Several new attributes are added to the QueueItem and Build classes, but are done so in a way that is backwards compatible, so no model api schema upgrade is needed. The code sites where they are used protect against the null values which will occur in a mixed-version cluster (the components will just not emit these stats in those cases). Change-Id: Iaacbef7fa2ed93bfc398a118c5e8cfbc0a67b846	2022-02-20 16:55:34 -08:00
Simon Westphahl	e6e4588bb7	Don't submit empty node requests to Zookeeper We are currently also submitting empty node requests to Zookeeper. As far as I could see that was only necessary as an intermediate step towards the scale-out scheduler. Since we have all state in Zookeeper by now it seems that we don't need this anymore. Since we will no longer receive a nodes provisioned event for the empty node request, we will set the empty nodeset immediately after requesting the nodes if the request is fulfilled at this point. By not submitting empty node requests to Zookeeper, we can also save one run handler cycle for jobs that don't need any nodes. Change-Id: I4f9cbc7555591bb8817e3596edf4b9af99efd998	2022-02-16 11:27:08 +01:00
James E. Blair	9e1118615c	Don't add node resources to nonexistent tenant The periodic stats emitter may try to add up node resource usage for nodes which belong to tenants the scheduler doesn't know about yet because it's still starting up. This causes the following error: Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/zuul/scheduler.py", line 314, in runStats self._runStats() File "/usr/local/lib/python3.8/site-packages/zuul/scheduler.py", line 459, in _runStats self.nodepool.emitStatsTotals(self.abide) File "/usr/local/lib/python3.8/site-packages/zuul/nodepool.py", line 526, in emitStatsTotals self.addResources(resources_by_tenant[tenant_name], File "/usr/local/lib/python3.8/site-packages/zuul/nodepool.py", line 65, in addResources target[key] += value TypeError: 'int' object is not subscriptable First, we shouldn't initialize that dictionary to a defaultdict(int) because it's actually a dict of dicts. The (int) was leftover from a previous implementation. Second, we should just ignore nodes which belong to tenants or projects we don't know about yet. If we're supposed to track them, then we will do so later once configuration is complete. Change-Id: I552943ef9135041704f9849ef241c68d6b758a8a	2021-09-29 15:07:42 -07:00
Felix Edel	5997228e46	Remove job_name attribute form NodesProvisionedEvent When moving the NodesProvisionedEvents to ZooKeeper, this was needed to lookup empty NodeRequests in the scheduler as those requests weren't stored in ZooKeeper and didn't provide any ID. Some of the newer changes which are related to how node requests are handled in Zuul made this attribute obsolete. Change-Id: I382473bd10150bd47237cecc05ebcd345cf98ba8	2021-09-21 15:47:53 +02:00
James E. Blair	cebebf0f42	Fix test race with node allocation In order to determine if the system is settled, the test framework checks whether all node requests have been fulfilled (and, implicitly, that Zuul has seen and processed that fulfillment). That sequence is now: 1. Request state in ZK is set to fulfilled by nodepool 2. Scheduler receives watch that request is fulfilled 3. Internal cache is updated to reflect new state 4. NodesProvisionedEvent is added to ZK result event queue There is a window between 3 and 4 where, to an external observer, the system looks the same as after 4. We used to have an internal list of pending requests in the scheduler which we would not clear until after 4, however that has been removed since it doesn't make sense with multiple schedulers. To resolve this, we wrap steps 3 and 4 in a lock so they act as a critical section, and then in the test framework, we grab that same lock and check both the internal cache (3) and the result event queue(4) to determine if we're between 3 and 4 (not settled) or after 4 (settled). Change-Id: Ib6d0ad826cc4d3fad9d1b59434971f48bab7d23a	2021-09-11 11:02:05 -07:00
Felix Edel	38776452bb	Don't use the AnsibleJob in the nodepool client This change follows up on a few TODOs left by the lock/unlock nodes on executor change. When locking the nodes on the executor we used the AnsibleJob as a replacement for the old build parameter that was provided to the nodepool client methods as they were originally called by the scheduler. However, the AnsibleJob class should only be used internally by the executor, so we now provide all parameters directly to the nodepool methods. This also annotates the logger in the updated nodepool client methods and fixes an outdated method signature in test_scheduler.TestSemaphore.test_semaphore_zk_error. Remove two comments about storing timestamps on the build request in ZooKeeper as this doesn't make much sense. It sounded like a good idea in the beginning, but with the current solution, the scheduler doesn't need to care about the build request anymore after it was submitted (except for canceling/cleanup purposes) and the result data is self-contained. Change-Id: I2d1005f69904c6ace8f79523133f382af0024c52	2021-09-10 10:55:01 -07:00
James E. Blair	65cac91e6c	Add ZK session-aware elections This creates a session-aware election class which will set a flag that indicates it has lost the underlying lock. We can check this flag when iterating to make sure that we don't continue to attempt to operate when we have lost the lock underlying an election. Some drivers had connection lost handling for the EventReceiverElection at the driver level. Those are updated to use the handling at the election level for consistency as well as brevity. Change-Id: I776f88d015acdfbf1487a85d8473cd174917e90f	2021-09-10 10:55:00 -07:00
James E. Blair	aee6ef6f7f	Report nodepool resource stats gauges in scheduler We currently report nodepool resource usage whenever we use or return nodes. This now happens on the executors, and they don't have a global view of all nodes used. The schedulers do, and they already have a periodic stats reporting method. Shift the reporting of node resource gauges to the scheduler. To make this efficient, use a tree cache for nodes. Because node records alone don't have enough information to tie them back to a tenant or project, use the new user_data field on the Node object to store that info when we mark a node in use. Also, store the zuul system id on the node, so that we can ensure we're only reporting nodes that belong to us. Update the node list in the REST API to use the cache as well, and also filter its results by zuul system id and tenant. Depends-On: https://review.opendev.org/807362 Change-Id: I9d0987b250b8fb54b3b937c86db327d255e54abd	2021-09-10 10:54:59 -07:00
James E. Blair	b41f467340	Remove internal nodepool request cache The internal zuul.nodepool.Nodepool.requests dictionary is used so the scheduler can keep track of its requests. Since we will have multiple schedulers emitting requests, we can't use that any more. Remove any remaining uses of it. The NodeRequest uid was only used to index that dictionary (and was used to persist a request across resubmission). Since it isn't needed any more, it is removed. Change-Id: I7c82485d95979c6c9a246c3dc3954bae3c65ac13	2021-09-10 10:53:47 -07:00
James E. Blair	6dc1178fc3	Don't store node requests/nodesets on queue items To prepare for queue items moving into ZooKeeper, stop storing the NodeRequest and NodeSet objects on them. Instead, reference requests by ID and consult ZK when necessary, and store only the info about nodesets that the scheduler needs. The result are simple dicts than can easily be serialized. The deleteNodeRequest method is updated to accept IDs instead of NodeRequest objects to minimize the number of times we need to use a full NodeRequest object. Change-Id: I3587a42eb5a151f41369385e482b7f36b1c41bf6	2021-09-10 08:51:20 -07:00
James E. Blair	514f62ea31	Refactor the checkNodeRequest method We perform some checks which aren't necessary any more. This method is better thought of as a method of getting a nodeset from a fulfilled node request, so update it accordingly. Change-Id: I1113820115af68b706b6fe06d6d03cd35ae6b382	2021-09-10 08:46:42 -07:00
James E. Blair	dbab353ca3	Remove unecessary node request cancelation code The only use of request.canceled that matters at this point is in emitting stats. Otherwise, since the canceled flag isn't stored in ZK (the request is just deleted instead!) there isn't a point to using it. Remove those tests. Change-Id: I82d17f2832ae8fe14cf365b302a454caec5bef3c	2021-09-10 08:46:04 -07:00
James E. Blair	678bc4846c	Remove unneeded scheduler.zk_nodepool object The scheduler has a Nodeool object, and the Nodepool object has a ZooKeeperNodepool object. Separately, the scheduler also has a standalone ZooKeeperNodepool object. Rather than having a second zk_nodepool object, just reach into Nodepool object and use its zk_nodepool object directly. This is more important now that ZooKeeperNodepool maintains a node request cache (and will also maintain a node cache in a future change). This means that the scheduler was keeping two in-memory caches, which is extra work being performed. Because one of the zk_nodepool objects was being used to generate nodes provisioned events, and the other was being used to process them, if their caches weren't in sync, the scheduler could end up marking node requests as failed when they actually succeeded. The dual cache issue is why we saw this issue in tests, but the same issue would be present with multiple schedulers too, so we also update the getNodeRequest method to make the cache optional. We bypass the cache where we must be certain we have the most up to date info. Change-Id: I89242a01f656abce143bfb991670d452deae8b72	2021-09-10 08:05:07 -07:00
James E. Blair	bb94937ea3	Wrap nodepool request completed events with election So that only one scheduler puts nodepool request completed events in the queue, wrap that with an election. There is a dedicated thread to try to win the election, and if it does, it emits complete events for every completed request (in case we missed some during the handover). Other than that, the process stays the same. If we encounter a problem putting the event on the ZK queue, we tell the election thread to re-run the election. Change-Id: I3dadf5524dc3d931415e20267d36030e945a3000	2021-09-06 15:27:16 -07:00
Felix Edel	4e2985638c	Add node request cache to zk nodepool interface This adds a TreeCache to the ZK nodepool interface; it's nearly identical to the one on the nodepool side. Co-Authored-By: James E. Blair <jim@acmegating.com> Change-Id: Ie972c397cf235d637619d1e40c5e7ff78431ac0d	2021-09-06 15:26:39 -07:00
James E. Blair	e225a28fa5	Make node requests persistent The original Nodepool protocol specified that node requests should be ephemeral, that way if the requestor crashed before accepting the nodes, the request would automatically be cleaned up and the nodes returned. This doesn't comport with multiple schedulers, as we will soon expect schedulers to stop and start routinely while we want the node requests they spawn to persist and be handled by other schedulers. Fortunately, Nodepool doesn't really care if the request is ephemeral or not. So we'll drop the "ephemeral" flag. But in the short term, we will be stopping the scheduler and that will leave orphan node requests. And even in the long term, we may have a complete Zuul system shutdown or even a bug which may leak node requests, so we still need a way of deleting node requests which don't belong. To handle that, we add a cleanup routine which we run immediately on startup and every hour that looks for node requests created by this Zuul system but don't correspond to any queue entries. We create a new UUID to identify the Zuul system and store it in ZK (so that if Nodepool has any other users we don't delete their requests). We no longer need to resubmit requests on connection loss, so tests addressing that behavior are removed. Change-Id: Ie22e99ef71cbe6b31d40c25a21498c1e867ca777	2021-09-03 16:17:15 -07:00
James E. Blair	dbe13ce076	Remove nodeset from NodeRequest To make things simpler for schedulers to handle node provisioned events for node requests which they may not have in their local pipeline state, we need to make the pipeline storage of node requests simpler. That starts by removing the nodeset object as an attribute of the NodeRequest object. This means that the scheduler can work with a node request object without relying on having the associated nodeset. It also simplifies the ZooKeeper code that deserializes NodeRequests (as it doesn't have to create fake NodeSet objects too). And finally, it simplifies what must be stored in the pipeline and queue item structures, which will also come in handy later. Two tests designed to verify that the request->nodeset magic deserialization worked have been removed since they are no longer applicable. Change-Id: I70ae083765d5cd9a4fd1afc2442bf22d6c52ba0b	2021-09-02 09:29:44 -07:00
Benjamin Schanzel	e577ec90bd	Add tenant name on NodeRequests for Nodepool This change adds the tenant name of the current events' context to NodeRequests and exposes it as a new field on ZooKeeper. It prepares for a tenant-aware Nodepool Launcher for it to enforce per-tenant resource quota. In addition, Zuul exposes a new statsd metric ``zuul.nodepool.tenant.<tenant>.current_requests`` that drills down the overall current_requests metric per tenant. Corresponding Spec can be found here https://review.opendev.org/c/zuul/zuul/+/788481 Change-Id: I6d47431e939aba2c80f30504b7a48c15f9fc8fb7	2021-09-02 09:26:34 -07:00
Simon Westphahl	919c5a3654	Fix wrong varible use when updating resource stats The code path for updating the nodepool resource stats was still assuming a full Project instance that we no longer have when requesting hold of a node set. Change-Id: I03a11bc21ae519229fff05b6bff7b9dbb4ae9253	2021-08-20 07:46:03 -07:00
James E. Blair	d87a9a8b8f	Clear nodeset when re-submitting node requests We encountered an issue where Zuul: * submitted a node request * nodepool fulfilled it * zuul received the ZK watch and refreshed the NodeRequest object * zuul submitted the node provisioned event to the event queue * ZK was disconnected/reconnected * zuul processed the node provisioned event * zuul found the node request no longer existed (because it's ephemeral) * zuul resubmitted the node request Because the NodeRequest object had the provisioned node information attached to it, the re-submitted request was created with an existing 'nodes' list. Nodepool appended to that list and fulfilled the new request (which requested 1 but received 2 nodes). This caused an exception in Zuul's nodepool request watch callback, which caused Zuul to ignore that and all future updates to the node request. To address this, we make a new copy of the nodeset without any allocated node info when re-submitting a request. This contains an unrelated change to the event id handling from an earlier revision; it is kept because it will simplify future changes which eliminate the node request cache altogether. Change-Id: I72f5ed7ad53e44d77b37870546daf61b8a4e7e09	2021-08-04 12:20:11 -07:00
James E. Blair	4dabbd9502	Fix race when canceling a node request When we cancel a node request, we delete the request from ZK. We might get the callback from ZK to update the node request object (due to the delete event) in a seprate thread while the first thread is between the lines where we delete the request and set the internal flag indicating it was canceled. That would cause the update callback to think that the request was externally deleted (not by us) and resubmit it. To correct this, set the internal canceled flag before performing the ZK delete. Change-Id: I1b4771b5840cb168b01939bd8590534ef618d878	2021-07-15 14:00:30 -07:00
Felix Edel	040c5c8032	Move parent provider determination to pipeline manager Moving the parent provider determination into the pipeline manager allows us to remove the buildset and job objects from the NodeRequest constructor. This way we can fully serialize the NodeRequest to ZooKeeper and restore it without missing important information. This has also an impact on the NodeRequest's priority property. As this is only needed to determine the znode path when the NodeRequest is submitted, we can provide it directly as parameter to the submitNodeRequest call (and the related update callbacks). To ensure that NodePool doesn't strip those additional information when it fulfills the NodeRequest, we use the new "requestor_data" field which is implemented in [1]. To make this work, we also have to look up the buildset by its UUID from the active tenants and pipelines when the NodesProvisioned event is handled in the scheduler. Something similar was already done for handling the other result events as well. [1]: https://review.opendev.org/c/zuul/nodepool/+/798746/ Depends-On: https://review.opendev.org/c/zuul/nodepool/+/798746/ Change-Id: Id794643dcf26b0565499d20adba99d3b0518fdf1	2021-07-08 13:27:08 -07:00
Felix Edel	fee46c25bc	Lock/unlock nodes on executor server Currently, the nodes are locked in the scheduler/pipeline manager before the actual build is created in the executor client. When the nodes are locked, the corresponding NodeRequest is also deleted. With this change, the executor will lock the nodes directly before starting the build and unlock them when the build is completed. To keep the order of events intact, the nodepool.acceptNodes() method is split up into two: 1. nodepool.acceptNodeRequest() does most of the old acceptNodes() method except for locking the nodes and deleting the node request. It is called on the scheduler side when the NodesProvisionedEvent is handled (which is also where acceptNodes() was previously called). 2. nodepool.acceptNodes() is now called on the executor side when the job is started. It locks the nodes and deletes the node request in ZooKeeper. Finally, it's also necessary to move the autohold processing to the executor, as this requires a lock on the node. To allow processing of autoholds, the executor now also determines the build attempts and sets the RETRY_LIMIT result if necessary. Change-Id: I7392ce47e84dcfb8079c16e34e0ed2062ebf4136	2021-07-01 05:46:02 +00:00
Zuul	bd1a669cc8	Merge "statsd: decrement resources gauge for held node"	2021-05-28 17:36:47 +00:00
Zuul	7e802df42d	Merge "Remove use of item's layout in Nodepool API"	2021-05-12 07:21:33 +00:00
Clark Boylan	f2982dc152	Check if statsd is set before using it We don't require a statsd config which means we must check that the statsd objects are valid before using them to send data. Do this in two places that were missed. Change-Id: Ifda150d5305ea0cadf2865cdb691263e32476b94	2021-05-11 10:40:32 -07:00
Simon Westphahl	336e48d824	Remove use of item's layout in Nodepool API The useNodeSet() method was still using an item's layout to get the tenant name. Since the layout might be set to None during a re-enqueue (see Id7cef4f1fa222b1491418ea2449687964fcfb361) we need to get the tenant name via the pipeline instead. Change-Id: I3835b5082681930b962cecf7fe6edcf2a211465a	2021-05-11 14:19:51 +02:00
Felix Edel	ba7f81be2d	Provide statsd client to Nodepool and make scheduler optional To lock/unlock the nodes directly in the executor server, we have to make the Nodepool API work without a scheduler instance. To keep the stats emitting intact, we provide a statsd client directly to the Nodepool instance. This leaves only one place where the scheduler is used in the Nodepool class, which is the onNodesProvisioned() callback. This callback won't be necessary anymore when the nodes are locked on the executor and thus this function call and the scheduler parameter itself can be removed. Change-Id: I3f3e4bfff08e244f68a9be7c6a4efcc194a23332	2021-04-30 12:12:28 +02:00
Tristan Cacqueray	da60b252a7	statsd: decrement resources gauge for held node This change fixes an issue where held node are never substracted from the resources gauge, resulting in a ever increasing resources usage metric. Change-Id: Id87fcf95a8224492f0335dbb357977865f4fd45f	2021-04-26 14:56:09 +00:00
Jan Kubovy	d518e56208	Prepare Zookeeper for scale-out scheduler This change is a common root for other Zookeeper related changed regarding scale-out-scheduler. Zookeeper becoming a central component requires to increase "maxClientCnxns". Since the ZooKeeper class is expected to grow significantly (ZooKeeper is becoming a central part of Zuul) a split of the ZooKeeper class (zk.py) into zk module is done here to avoid the current god-class. Also the zookeeper log is copied to the "zuul_output_dir". Change-Id: I714c06052b5e17269a6964892ad53b48cf65db19 Story: 2007192	2021-02-15 14:44:18 +01:00
Tobias Henkel	4205740b67	Fix memleak on zk session loss When the scheduler looses its zk session it resubmits all lost node requests as new ones. However it didn't stop the watch for the old one which keeps being registered in Kazoo. The watch contains the NodeRequest object since it's bound to the callback. Thus by leaking the Watch we also leak the NodeRequest, the attached BuildSet, QueueItem and finally the Tenant and Layout as well. This can be fixed by stopping the watch in this case. Change-Id: I3b05ec92816ab5eb06ad40dfad85ddfebfbf2cc4	2020-09-11 09:23:21 +02:00
Tristan Cacqueray	e85fb93d1d	Store a list of held nodes per held build in hold request Instead of storing a flat list of nodes per hold request, this change updates the request nodes attribute to become a list of dictionary with the build uuid and the held node list. Change-Id: I9e50e7ccadc58fb80d5e80d9f5aac70eb7501a36	2019-10-24 13:39:16 -04:00
David Shrewsbury	9f5743366d	Auto-delete expired autohold requests When a request is created with a node expiration, set a request expiration for 24 hours after the nodes expire. Change-Id: I0fbf59eb00d047e5b066d2f7347b77a48f8fb0e7	2019-09-18 10:09:08 -04:00
David Shrewsbury	2c1c9ae662	Record held node IDs with autohold request These node IDs will be output with the 'zuul autohold-info' command. Change-Id: I8f52d2b87b3bec6d3b8ecc2f69507049d905cad5	2019-09-16 10:48:41 -04:00
David Shrewsbury	716ac1f2e1	Store autohold requests in zookeeper Storing autohold requests in ZooKeeper, rather than in-memory, allows us to remember requests across restarts, and is a necessity for future work to scale out the scheduler. Future changes to build on this will allow us to store held node information with the change for easy node identification, and to delete any held nodes for a request using the zuul CLI. A new 'zuul autohold-delete' command is added since hold requests are no longer automatically deleted. This makes the autohold API: zuul autohold: Create a new hold request zuul autohold-list: List current hold requests zuul autohold-delete: Delete a hold request Change-Id: I6130175d1dc7d6c8ce8667f9b14ae9377737d280	2019-09-16 08:47:53 -04:00
Tobias Henkel	6931703536	Annotate logs around finished builds We should annotate the logs around finished builds with event ids. Change-Id: I44ba4219f6d602aeab1f0d5829dfcb107341cf6d	2019-05-30 19:21:31 +02:00
Tobias Henkel	6f3bcdd6b6	Annotate builds with event id It's useful to be able to trace an event through the system including the builds. Change-Id: If852cbe8aecc4cf346dccc1b8fc34272c8ff483d	2019-05-30 19:18:00 +02:00
Tobias Henkel	e90fe41bfe	Report tenant and project specific resource usage stats We currently lack means to support resource accounting of tenants or projects. Together with an addition to nodepool that adds resource metadata to nodes we can emit statsd statistics per tenant and per project. The following statistics are emitted: * zuul.nodepool.resources.tenant.{tenant}.{resource}.current Gauge with the currently used resources by tenant * zuul.nodepool.resources.project.{project}.{resource}.current Gauge with the currently used resources by project * zuul.nodepool.resources.tenant.{tenant}.{resource}.counter Counter with the summed usage by tenant. e.g. cpu seconds * zuul.nodepool.resources.project.{project}.{resource}.counter Counter with the summed usage by project. e.g. cpu seconds Depends-On: https://review.openstack.org/616262 Change-Id: I68ea68128287bf52d107959e1c343dfce98f1fc8	2019-05-29 04:10:08 +00:00

1 2

86 Commits