zuul/zuul - zuul - OpenDev: Free Software Needs Free Tools

Commit Graph

Author	SHA1	Message	Date
Zuul	3b19ca9cb3	Merge "Add zuul_unreachable ansible host group"	2024-03-25 18:26:14 +00:00
James E. Blair	632839804c	Add a zuul.buildset_refs variable This adds information about the changes associated with a circular dependency queue item. Currently the bundle_id can be used to identify which of the items in zuul.items is related to the current dependency cycle. That variable is deprecated, so zuul.buildset_refs can be used to replace that functionality. Since it repeats some of the information at the top level (eg zuul.change, zuul.project, etc), the code is refactored so they can share the dictionary construction. That is also used by zuul.items. This results in a few extra fields in zuul.items now, such as the change message, but that is relatively inconsequential, so is not called out in the release notes. The src_dir is similarly included in all of these places. In writing this change it was discovered that zuul.items.project.src_dir always used the golang scheme, but zuul.project.src_dir used the correct per-job workspace scheme. This has been corrected so that they both use the per-job scheme now. A significant reorganization of the job variable documentation is included. Previously we had a section with additional variables for each item type, but since many of these are duplicated at the top level, in the item list, and now in the refs list, that structure became difficult to work with. Instead, the documentation for each of these sections now exhaustively lists all of the possible variables. This makes for some repitition, but it also means that a user can see at a glance what variables are available, and we can deep-link to each one. To address the variation between different item types, the variables that mutate based on item type now contain a definition list indicating what types they are valid for and their respective meanings. Change-Id: Iab8f99d4c4f40c44d630120c458539060cc725b5	2024-03-22 06:41:36 -07:00
Joshua Watt	d5dcb7eb35	Report topic to jobs as zuul.topic Reports the change topic to jobs as an ansible variable. This can be useful for jobs that either want to name artifact output based on a topic, or enforce that a topic is set using a zuul job. Change-Id: I678404523d228947541160554623bf4066a729c4	2024-03-08 11:30:45 -07:00
Simon Westphahl	ae258ac2e0	Always delay response to first build request In Ic52f3d82ed12fd45922d9ec9d61e8ded43507837 we changed the build worker to delay answering request only after we've successfully locked and started a job. This change also implicitly removed the initial delay to the first build request. However, in practice this turned out to be a mistake, as low-latency executors where capable of responding much quicker to request than other executors with a higher ZK latency. The problem is reinforced by the fact that there are usually just a few build requests in each iteration. The effect is that builds are no longer evenly distributed among executors. To fix this problem we'll add back the initial delay before responding to the first build request. Change-Id: I91f2ce87ccc2c5b6a854cf7937ebe65521e72d66	2024-03-04 12:07:12 +01:00
James E. Blair	4421a87806	Add zuul_unreachable ansible host group This will allow users to write post-run playbooks that skip running certain tasks on unreachable hosts. Change-Id: I04106ad0222bcd8073ed6655a8e4ed77f43881f8	2024-02-27 13:57:07 -08:00
James E. Blair	8dd4011aa0	Monitor and report executor inode usage This adds inodes to the hdd executor sensor and reports usage to statsd as well. Change-Id: Ifd9a63cfc7682f6679322e39809be69abca6827e	2024-02-19 11:20:57 -08:00
James E. Blair	922a6b53ed	Make executor sensors slightly more efficient Rather than checking all of the sensors to see if they are okay, then collecting all the data again for stats purposes, do both at the same time. Change-Id: Ia974a7d013057880171fd1695a1d17169d093410	2024-02-19 09:04:41 -08:00
Zuul	dc7c896af2	Merge "Replace Ansible 6 with Ansible 9"	2024-02-16 18:02:54 +00:00
Zuul	8eabd3c465	Merge "Add backoff delay only after successful job start"	2024-02-16 13:20:36 +00:00
James E. Blair	5a8e373c3b	Replace Ansible 6 with Ansible 9 Ansible 6 is EOL and Ansible 9 is available. Remove 6 and add 9. This is usually done in two changes, but this time it's in one since we can just rotate the 6 around to make it a 9. command.py has been updated for ansible 9. Change-Id: I537667f66ba321d057b6637aa4885e48c8b96f04	2024-02-15 16:20:45 -08:00
Simon Westphahl	8a7607acd5	Add backoff delay only after successful job start When there is a burst of new jobs, executors could be slowed down by the backoff delay in accepting those jobs. The backoff delay is calculated based on the number of currently running jobs and is supposed to distribute jobs more evenly among executors. So far this delay was added for every processed build request independent of whether the job could be started or not. With this change we only add the backoff delay when the previous request could be handled. By moving the sleep from before to after handling a request, we also reduce the time between getting the request from ZK and trying to lock it. This can increase the chance of successfully locking a request. This change also adds a missing return when a build request could be locked, but the request was already processed by another executor. Change-Id: Ic52f3d82ed12fd45922d9ec9d61e8ded43507837	2024-02-15 17:46:21 +01:00
James E. Blair	fa274fcf25	Remove most model_api backwards-compat handling Since we require deleting the state (except config cache) for the circular dependency upgrade, we can now remove any zookeeper backwards compatability handling not in the config cache. This change does so. It leaves upgrade handling for two attributes in the branch cache (even though they are old) because it is plausible that even systems that have been upgrading regularly may have non-upgraded data in the branch cache for infrequently used branches, and the cost for retaining the upgrade handling is small. Change-Id: I881578159b06c8c7b38a9fa0980aa0626dddad31	2024-02-09 07:39:55 -08:00
James E. Blair	1f026bd49c	Finish circular dependency refactor This change completes the circular dependency refactor. The principal change is that queue items may now include more than one change simultaneously in the case of circular dependencies. In dependent pipelines, the two-phase reporting process is simplified because it happens during processing of a single item. In independent pipelines, non-live items are still used for linear depnedencies, but multi-change items are used for circular dependencies. Previously changes were enqueued recursively and then bundles were made out of the resulting items. Since we now need to enqueue entire cycles in one queue item, the dependency graph generation is performed at the start of enqueing the first change in a cycle. Some tests exercise situations where Zuul is processing events for old patchsets of changes. The new change query sequence mentioned in the previous paragraph necessitates more accurate information about out-of-date patchsets than the previous sequence, therefore the Gerrit driver has been updated to query and return more data about non-current patchsets. This change is not backwards compatible with the existing ZK schema, and will require Zuul systems delete all pipeline states during the upgrade. A later change will implement a helper command for this. All backwards compatability handling for the last several model_api versions which were added to prepare for this upgrade have been removed. In general, all model data structures involving frozen jobs are now indexed by the frozen job's uuid and no longer include the job name since a job name no longer uniquely identifies a job in a buildset (either the uuid or the (job name, change) tuple must be used to identify it). Job deduplication is simplified and now only needs to consider jobs within the same buildset. The fake github driver had a bug (fakegithub.py line 694) where it did not correctly increment the check run counter, so our tests that verified that we closed out obsolete check runs when re-enqueing were not valid. This has been corrected, and in doing so, has necessitated some changes around quiet dequeing when we re-enqueue a change. The reporting in several drivers has been updated to support reporting information about multiple changes in a queue item. Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1 Depends-On: https://review.opendev.org/907627	2024-02-09 07:39:40 -08:00
Simon Westphahl	904143e003	Stop and join the build worker before ZK client As the `BaseMergeServer` also disconnects the ZK client we need to first stop and join the build worker in the executore before calling the join of the parent class. This should prevent ZK connection lost messages from showing up during executor shutdown. Change-Id: I6866cfa39ee533ed7d06ab39ebfb9ad46d6e5377	2024-01-25 16:35:34 +01:00
James E. Blair	663931cba3	Include job_uuid in BuildRequests/Events This is part of the circular dependency refactor. It updates BuildRequests to include the job uuid (similar to the previous change with NodeRequests). We also use the job uuid when sending BuildResultEvents. Change-Id: If7a55138d0cb5a358c62e8cd97ee322087b09a6b	2024-01-08 08:54:56 -08:00
James E. Blair	9201f9ee28	Store builds on buildset by uuid This is part of the circular dependency refactor. This updates the buildset object in memory (and zk) to store builds indexed by frozen job uuid rather than job name. This also updates everal related fields and also temporary dictionaries to do the same. This will allow us, in the future, to have more than one job/build in a buildset with the same name (for different changes/refs). Change-Id: I70865ec8d70fb9105633f0d03ba7c7e3e6cd147d	2023-12-12 11:58:21 -08:00
James E. Blair	071c48c5ae	Freeze job dependencies with job graph This is part of the circular dependency refactor. Update the job graph to record job dependencies when it is frozen, and store these dependencies by uuid. This means our dependency graph points to actual frozen jobs rather than mere job names. This is a pre-requisite to being able to disambiguate dependencies later when a queue item supports multiple jobs with the same name. The behavior where we would try to unwind an addition to the job graph if it failed is removed. This was originally written with the idea that we would try to run as many jobs as possible if there was a config error. That was pre-zuul-v3 behavior. Long since, in all cases when we actually encounter an error adding to the job graph, we bail and report that to the user. No longer handling that case simplifies the code somewhat and makes it more future-proof (while complicating one of the tests that relied on that behavior as a shortcut). This attempts to handle upgrades by emulating the old behavior if a job graph was created on an older model version. Since it relies on frozen job uuids, it also attempts to handle the case where a frozenjob does not have a uuid (which is a very recent model change and likely to end up in the same upgrade for some users) by emulating the old behavior. Change-Id: I0070a07fcb5af950651404fa8ae66ea18c6ca006	2023-12-06 16:41:18 -08:00
James E. Blair	033470e8b3	Fix repo state restore for zuul role tag override When a repo that is being used for a zuul role has override-checkout set to a tag, the job would fail because we did not reconstruct the tag in our zuul-role checkout; we only did that for branches. This fixes the repo state restore for any type of ref. There is a an untested code path where a zuul role repo is checked out to a tag using override-checkout. Add a test for that (and also the same for a branch, for good measure). Change-Id: I36f142cd3c4e7d0b930318dddd7276f3635cc3a2	2023-11-30 10:06:03 -08:00
Zuul	90746cb286	Merge "Retry lingering deduplicated builds"	2023-11-20 17:06:05 +00:00
James E. Blair	dd60903a95	Retry lingering deduplicated builds We intend to handle the case where two queue items in check share a deduplicated build and one of them completes and is dequeued while the other continues. To handle this, we avoid removing any queue items (and therefore their associated builds) from ZK if their builds show up in any currently-enqueued queue item. However, we don't actually have a mechanism to load a build from ZK in that situation if it isn't in a queue item that is currently enqueued. Adding such a mechanism is complex and risky, whereas the circular dependency refactoring effort currently in progress will address this issue in a comprehensive manner. This change addresses the issue by detecting the situation at the point where we would try to launch the build (since we failed to restore it from ZK) and instead of raising an exception as we currently do, we tell the scheduler to retry the build. This results in the buildset not actually taking advantage of the potential deduplication, but it does at least provide a working build result to the user in the form of a brand new build. Change-Id: I86c159c82b858e67433bdaa1e479471b21ea8b86	2023-11-16 14:59:31 -08:00
Zuul	a509e25f87	Merge "Send job parent data + artifacts via build request"	2023-11-16 07:15:20 +00:00
Simon Westphahl	68d7a99cee	Send job parent data + artifacts via build request With job parents that supply data we might end up updating the (secret) parent data and artifacts of a job multiple times in addition to also storing duplicate data as most of this information is part of the parent's build result. Instead we will collect the parent data and artifacts before scheduling a build request and send it as part of the request paramters. If those parameters are part of the build request the executor will use them, otherwise it falls back on using the data from the job for backward compatibility. This change affects the behavior of job deduplication in that input data from parent jobs is no longer considered when deciding if a job can be deduplicated or not. Change-Id: Ic4a85a57983d38f033cf63947a3b276c1ecc70dc	2023-11-15 07:24:52 +01:00
James E. Blair	518194af1d	Thin the workspace repo clone When the executor clones a repo from the cache to the workspace, it performs a lot of unecessary work: * It checks out HEAD and prepares a workspace which we will immediately change. * It copies all of the branch refs, half of which we will immediately delete, and in some configurations (exclude_unprotected_branches) we will immediately delete most of the rest. Deleting refs with gitpython is much more expensive than creating them. This change updates the initial clone to do none of those, instead relying on the repo state restoration to take care of that for us. Change-Id: Ie8846c48ccd6255953f46640f5559bb41491d425	2023-11-10 06:19:47 -08:00
James E. Blair	07e28cca57	Add tracing spans for build repo setup To aid analysis of the build startup phase, add opentelemetry spans to the start of builds around repo operations (update, checkout, merge, etc). Change-Id: I61df8e63f6401deb98b9842299a23d0c2c2f142a	2023-11-08 14:05:47 -08:00
James E. Blair	e55748ba69	Make Ansible variable freezing more efficient We currently iterate over every job/host/etc variable in the freeze playbook. The reason is because if any value in any variable is Undefined according to jinja, the Ansible combine filter throws an error. What we want to do in Zuul is merge any variable we can, but if any is undefined, we skip it. Thus, the process of combining the variables one at a time in a task and ignoring errors. This process can be slow, especially if we have start with a large amount of data in one of the early variables. The combine filter needs to reprocess the large data repeatedly for each additional variable. To improve the process, we create a new action plugin, "zuul_freeze" which takes a list of variables we want to freeze, then templates them one at a time and stores the result in a cacheable fact. This is the essence of what we were trying to accomplish with the combine filter. Change-Id: Ie41f404762daa1b1a5ae47f6ec1aa1954ad36a39	2023-09-14 14:00:45 -07:00
James E. Blair	1b042ba4ab	Add job failure output detection regexes This allows users to trigger the new early failure detection by matching regexes in the streaming job output. For example, if a unit test job outputs something sufficiently unique on failure, one could write a regex that matches that and triggers the early failure detection before the playbook completes. For hour-long unit test jobs, this could save a considerable amount of time. Note that this adds the google-re2 library to the Ansible venvs. It has manylinux wheels available, so is easy to install with zuul-manage-ansible. In Zuul itself, we use the fb-re2 library which requires compilation and is therefore more difficult to use with zuul-manage-ansible. Presumably using fb-re2 to validate the syntax and then later actually using google-re2 to run the regexes is sufficient. We may want to switch Zuul to use google-re2 later for consistency. Change-Id: Ifc9454767385de4c96e6da6d6f41bcb936aa24cd	2023-08-21 16:41:21 -07:00
James E. Blair	f04d80912e	Fix issue with early failure and blocks Ansible tasks that fail within block tasks call the failure callback, which means that they triggered Zuul early failure detection even if they were later rescued. To avoid this, ignore block tasks for purposes of early failure detection. Also make early failure detection "sticky". This helps uncover errors like this (the "remote" tests uncover this particular failure if the result is made sticky), and also ensures consistent behavior in dependent pipelines. Change-Id: I505667678e7384386819b5389036e4fb4f108afd	2023-08-21 16:41:08 -07:00
Simon Westphahl	3b011296e6	Keep task stdout/stderr separate in result object Combining stdout/stderr in the result can lead to problems when e.g. the stdout of a task is used as an input for another task. This is also different from the normal Ansible behavior and can be surprising and hard to debug for users. The new behavior is configurable and off by default to retain backward compatibility. Change-Id: Icaced970650913f9632a8db75a5970a38d3a6bc4 Co-Authored-By: James E. Blair <jim@acmegating.com>	2023-08-17 16:22:41 -07:00
Simon Westphahl	d1d886bc03	Report error details on Ansible failure In case of a retry there might be no logs available to help the user understand the reason for a failure. To improve this we can the details of the failure as part of the build result. Change-Id: Ib9fdbdec5d783a347d1b6e5ce6510d50acfe1286	2023-08-07 10:13:16 +02:00
Zuul	6c0ffe565f	Merge "Report early failure from Ansible task failures"	2023-07-29 18:28:08 +00:00
Zuul	1500788011	Merge "Refactor will_retry"	2023-07-29 18:14:44 +00:00
Zuul	c05ae202b9	Merge "Add Zuul job variable to indicate a job will retry"	2023-07-29 18:13:41 +00:00
Zuul	af15ee197b	Merge "Add commit_id to zuul vars"	2023-07-27 19:41:28 +00:00
James E. Blair	2ecf2d62cc	Refactor will_retry This moves some variables around to try to make the code have more of a direct correspondence with the zuul_will_retry Ansible variable. The logic is just a little too complex to have a 1:1 match between a python variable and Ansible. But this refactor adds a "should_retry" which is pretty close to "zuul_will_retry" with the caveat that will_retry is should_retry iff we have not hit the retry limit. Change-Id: Ic869e3befc5763ffcd4a97c406224f7bb46acf13	2023-07-27 07:08:26 -07:00
Simon Westphahl	7bba28a32f	Add Zuul job variable to indicate a job will retry This change adds a variable to post and cleanup playboks in order to determine if a job will be retried due to a failure in one of the earlier playbooks. This variable might be useful for only performing certain actions (e.g. interacting with a remote system) when the job result is final and there won't be any further attempts. Change-Id: If7f4488d4a59b1544795401bdc243978fea9ca86	2023-07-27 13:13:55 +02:00
James E. Blair	60a8dfd451	Add Ansible 8 This is the currently supported version of Ansible. Since 7 is out of support, let's skip it. Change-Id: I1d13c23189dce7fd9db291ee03a452089b92a421	2023-07-19 15:46:48 -07:00
James E. Blair	985aefc70c	Add commit_id to zuul vars We added the commit_id to the MQTT reporter a while ago so that external systems which identify changes by commit id (the Gerrit current patchset hexsha, or the Github PR HEAD, etc) can track Zuul work. To enable the same sort of interfacing with external systems in Zuul jobs, the variable is added to the set of zuul variables. To make it easier for use in multiple pipelines, the variable is set to oldrev or newrev in the non-change case. To make it easier to confirm driver parity, a new Gerrit driver test class is added with the intention that it should parallel the other driver test classes over time. Change-Id: I2826349ee9ce696c8f9de8c23276094123292196	2023-07-19 15:16:17 -07:00
James E. Blair	1170b91bd8	Report early failure from Ansible task failures We can have the Ansible callback plugin tell the executor to tell the scheduler that a task has failed and therefore the job will fail. This will allow the scheduler to begin a gate reset before the failing job has finished and potentially save much developer and CPU time. We take some extra precautions to try to avoid sending a pre-fail notification where we think we might end up retrying the job (either due to a failure in a pre-run playbook, or an unreachable host). If that does happen then a gate pipeline might end up flapping between two different NNFI configurations (ie, it may perform unecessary gate resets behind the change with the retrying job), but should still not produce an incorrect result. Presumably the detections here should catch that case sufficiently early, but due to the nature of these errors, we may need to observe it in production to be sure. Change-Id: Ic40b8826f2d54e45fb6c4d2478761a89ef4549e4	2023-06-29 13:40:34 -07:00
James E. Blair	1c92165ab7	List process ids in bwrap namespace If the kernel kills a process due to an out of memory error, it can be difficult to track the process back to the build that triggered it. The kernel error just gives us a PID, but we don't know any of the Ansible process ids. Further, since they are in bwrap, Ansible only knows its namespaced pid rather than the host pid, so we can't simply output it in one of our callback plugins. To aid in debugging, output all of the process ids within a namespace right at the start of an ansible-playbook execution. At this time, it is certain that the Ansible process will have started, and it is very likely that it is still running. That should provide a way to map from an OOM message back to an Ansible process id. (Note that Ansible forks and this is unlikely to catch any forked processes, so we will only see the main Ansible process id. Typically this is what the kernel should elect to kill, but if it does not, we may need a futher change to repeat this process each time Ansible forks. Since that is more costly, let's see if we can avoid it.) Change-Id: I9f262c3a3c5410427b0fb301cb4f1697b033ba2f	2023-06-28 13:31:06 -07:00
Zuul	0bd76048d1	Merge "Correctly complete build on transient errors"	2023-05-22 11:10:13 +00:00
Simon Westphahl	515e6ab8f7	Correctly complete build on transient errors Use the `completeBuild()` method for reporting the result on transient errors during repo update instead of the old `sendWorkComplete()` API. 2023-04-28 12:29:49,173 ERROR zuul.AnsibleJob: [e: ...] [build: ...] Exception while executing job Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/executor/server.py", line 1144, in do_execute self._execute() File "/opt/zuul/lib/python3.10/site-packages/zuul/executor/server.py", line 1345, in _execute self.job.sendWorkComplete( AttributeError: 'FrozenJob' object has no attribute 'sendWorkComplete' Change-Id: I05be093f80f015463f727f55154e16202821e961	2023-05-02 10:52:42 +02:00
Simon Westphahl	b8be20c5db	Expose max. attempts to the job as a Zuul variable This allows a job to determine if the current execution is the last attempt. Change-Id: I989db9f924aab997496851ce99ad71bbacf2379e	2023-04-14 08:13:35 +02:00
Zuul	8c3c0c2251	Merge "Retry jobs on transient IO errors on repo update"	2023-03-22 21:52:16 +00:00
Zuul	c5de8cb54b	Merge "Expose nodepool slot attribute"	2023-03-15 15:56:42 +00:00
Simon Westphahl	408c066786	Retry jobs on transient IO errors on repo update We are occassionally seeing different types of IO errors when updating repos on an executor. Currently those exceptions will abort the build and result in an error being reported. Since those errors are usually transient and point to some infrastructure problem we should retry those builds instead. We'll catch all IOErrors which includes request related exceptions from the "requests" Python package. See: https://github.com/psf/requests/blob/main/requests/exceptions.py Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/executor/server.py", line 3609, in _innerUpdateLoop self.merger.updateRepo( File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 994, in updateRepo repo = self.getRepo(connection_name, project_name, File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 966, in getRepo url = source.getGitUrl(project) File "/opt/zuul/lib/python3.10/site-packages/zuul/driver/github/githubsource.py", line 154, in getGitUrl return self.connection.getGitUrl(project) File "/opt/zuul/lib/python3.10/site-packages/zuul/driver/github/githubconnection.py", line 1744, in getGitUrl self._github_client_manager.get_installation_key( File "/opt/zuul/lib/python3.10/site-packages/zuul/driver/github/githubconnection.py", line 1126, in get_installation_key response = github.session.post(url, headers=headers, json=None) File "/opt/zuul/lib/python3.10/site-packages/requests/sessions.py", line 635, in post return self.request("POST", url, data=data, json=json, *kwargs) File "/opt/zuul/lib/python3.10/site-packages/github3/session.py", line 171, in request response = super().request(args, kwargs) File "/opt/zuul/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/opt/zuul/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, kwargs) File "/opt/zuul/lib/python3.10/site-packages/cachecontrol/adapter.py", line 53, in send resp = super(CacheControlAdapter, self).send(request, kw) File "/opt/zuul/lib/python3.10/site-packages/requests/adapters.py", line 565, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /api/v3/app/installations/123/access_tokens (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f44f6136ef0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')) Change-Id: I4e07e945c88b9ba61f83131076fbf7b9768a61f9	2023-02-24 15:00:48 +01:00
James E. Blair	99d39545a6	Add an !unsafe change_message variable In I9628e2770dda120b269612e28bb6217036942b8e we switched zuul.change from a plain string tagged with !unsafe to base64 encoded and no !unsafe tag. The idea was to make the inventory file parseable by external tools while avoiding accidental interpolation of the commit message by Ansible. That doesn't work in all cases -- it's not hard to construct a scenario where after base64 decoding the message any further processing by Ansible causes it to undergo interpolation. Moreover, since then we have made many changes to how we deal with variables; notably, the inventory.yaml is no longer actually used by Zuul's Anisble -- it is now there only for human and downstream processing. We call it the "debug inventory". The actual inventory is much more complex and in some cases has lots of !unsafe tags in it. Given all that, it now seems like the most straightforward way to deal with this is to tag the message variable as !unsafe when passing it to Zuul's Ansible, but render it as plain text in the inventory.yaml. To address backwards compatability, this is done in a new variable called zuul.change_message. Since that's a more descriptive variable anyway, we will just keep that one in the future and drop the current base64- encoded zuul.message variable Change-Id: Iea86de15e722bc271c1bf0540db2c9efb032500c	2023-02-09 09:07:53 -08:00
Clark Boylan	045bb270c0	Fix ResourceWarnings in the executor server We have two open() calls in the executor server without matching close() calls. Fix this by wrapping both in a with context manager. Change-Id: I0e51c2b22ea1540484851f98749e648728b26406	2023-02-07 16:17:14 -08:00
Clark Boylan	753bebbb2c	Cleanup some Python ResourceWarnings in the test suite First thing we add support for PYTHONTRACEMALLOC being passed through nox to easily enable tracebacks on emitted ResourceWarnings. We set this value to 0 by deafult as enabling this slows things down and requires more memory. But it is super useful locally when debugging specific ResourceWarnings to set `PYTHONTRACEMALLOC=5` in order o correct these issues. With that in place we identify and correct two classes of ResourceWarnings. First up is the executor server not closing its statsd socket when stopping the executor server. Address that by closing the socket if statsd is enabled and set the statsd attribute to None to prevent anything else from using it later. Second is a test only issue where we don't close the fake Gerrit, Gitlab, or Web Proxy Fixture server's HTTP socket we only shutdown the server. Add a close call to the server after it is shutdown to correct this. There are potentially other ResourceWarnings to be found and cleaned up, but sifting through the noise will be easier as we eliminate these more widespread warnings. Change-Id: Iddabe79be1c8557e300dde21a6b34e57b04c48e0	2023-02-06 10:30:18 -08:00
Zuul	35c68169bd	Merge "Fix deduplication exceptions in pipeline processing"	2022-12-20 01:48:36 +00:00
Felix Edel	137557c559	Abort job if cleanup playbook timed out We've investigated an issue where a job was stuck on the executor because it wasn't aborted properly. The job was cancelled by the scheduler, but the cleanup playbook on the executor ran into a timeout. This caused another abort via the WatchDog. The problem is that the abort function doesn't do anything if the cleanup playbook is running [1]. Most probably this covers the case that we don't want to abort the cleanup playbook after a normal job cancellation. However, this doesn't differentiate if the abort was caused by the run of the cleanup playbook itself, resulting in a build that's hanging indefinitely on the executor. To fix this, we now differentiate if the abort was caused by a stop() call [2] or if it was caused by a timeout. In case of a timeout, we kill the running process. Add a test case to validate the changed behaviour. Without the fix, the test case runs indefinitetly because the cleanup playbook won't be aborted even after the test times out (during the test cleanup). [1]: `4d555ca675/zuul/executor/server.py (L2688)` [2]: `4d555ca675/zuul/executor/server.py (L1064)` Change-Id: I979f55b52da3b7a237ac826dfa8f3007e8679932	2022-12-12 16:49:39 +01:00

1 2 3 4 5 ...

775 Commits