Commit Graph

775 Commits

Author SHA1 Message Date
Zuul 3b19ca9cb3 Merge "Add zuul_unreachable ansible host group" 2024-03-25 18:26:14 +00:00
James E. Blair 632839804c Add a zuul.buildset_refs variable
This adds information about the changes associated with a
circular dependency queue item.  Currently the bundle_id can be
used to identify which of the items in zuul.items is related to
the current dependency cycle.  That variable is deprecated, so
zuul.buildset_refs can be used to replace that functionality.

Since it repeats some of the information at the top level (eg
zuul.change, zuul.project, etc), the code is refactored so they
can share the dictionary construction.  That is also used by
zuul.items.  This results in a few extra fields in zuul.items
now, such as the change message, but that is relatively
inconsequential, so is not called out in the release notes.

The src_dir is similarly included in all of these places.  In
writing this change it was discovered that
zuul.items.project.src_dir always used the golang scheme, but
zuul.project.src_dir used the correct per-job workspace scheme.
This has been corrected so that they both use the per-job scheme
now.

A significant reorganization of the job variable documentation is
included.  Previously we had a section with additional variables
for each item type, but since many of these are duplicated at the
top level, in the item list, and now in the refs list, that
structure became difficult to work with.  Instead, the
documentation for each of these sections now exhaustively lists
all of the possible variables.  This makes for some repitition,
but it also means that a user can see at a glance what variables
are available, and we can deep-link to each one.  To address the
variation between different item types, the variables that mutate
based on item type now contain a definition list indicating what
types they are valid for and their respective meanings.

Change-Id: Iab8f99d4c4f40c44d630120c458539060cc725b5
2024-03-22 06:41:36 -07:00
Joshua Watt d5dcb7eb35 Report topic to jobs as zuul.topic
Reports the change topic to jobs as an ansible variable. This can be
useful for jobs that either want to name artifact output based on a
topic, or enforce that a topic is set using a zuul job.

Change-Id: I678404523d228947541160554623bf4066a729c4
2024-03-08 11:30:45 -07:00
Simon Westphahl ae258ac2e0
Always delay response to first build request
In Ic52f3d82ed12fd45922d9ec9d61e8ded43507837 we changed the build
worker to delay answering request only after we've successfully locked
and started a job.

This change also implicitly removed the initial delay to the first build
request. However, in practice this turned out to be a mistake, as
low-latency executors where capable of responding much quicker to
request than other executors with a higher ZK latency. The problem is
reinforced by the fact that there are usually just a few build requests
in each iteration.

The effect is that builds are no longer evenly distributed among
executors. To fix this problem we'll add back the initial delay before
responding to the first build request.

Change-Id: I91f2ce87ccc2c5b6a854cf7937ebe65521e72d66
2024-03-04 12:07:12 +01:00
James E. Blair 4421a87806 Add zuul_unreachable ansible host group
This will allow users to write post-run playbooks that skip
running certain tasks on unreachable hosts.

Change-Id: I04106ad0222bcd8073ed6655a8e4ed77f43881f8
2024-02-27 13:57:07 -08:00
James E. Blair 8dd4011aa0 Monitor and report executor inode usage
This adds inodes to the hdd executor sensor and reports usage
to statsd as well.

Change-Id: Ifd9a63cfc7682f6679322e39809be69abca6827e
2024-02-19 11:20:57 -08:00
James E. Blair 922a6b53ed Make executor sensors slightly more efficient
Rather than checking all of the sensors to see if they are okay,
then collecting all the data again for stats purposes, do both
at the same time.

Change-Id: Ia974a7d013057880171fd1695a1d17169d093410
2024-02-19 09:04:41 -08:00
Zuul dc7c896af2 Merge "Replace Ansible 6 with Ansible 9" 2024-02-16 18:02:54 +00:00
Zuul 8eabd3c465 Merge "Add backoff delay only after successful job start" 2024-02-16 13:20:36 +00:00
James E. Blair 5a8e373c3b Replace Ansible 6 with Ansible 9
Ansible 6 is EOL and Ansible 9 is available.  Remove 6 and add 9.

This is usually done in two changes, but this time it's in one
since we can just rotate the 6 around to make it a 9.

command.py has been updated for ansible 9.

Change-Id: I537667f66ba321d057b6637aa4885e48c8b96f04
2024-02-15 16:20:45 -08:00
Simon Westphahl 8a7607acd5
Add backoff delay only after successful job start
When there is a burst of new jobs, executors could be slowed down by the
backoff delay in accepting those jobs. The backoff delay is calculated
based on the number of currently running jobs and is supposed to
distribute jobs more evenly among executors.

So far this delay was added for every processed build request
independent of whether the job could be started or not. With this change
we only add the backoff delay when the previous request could be
handled.

By moving the sleep from before to after handling a request, we also
reduce the time between getting the request from ZK and trying to lock
it. This can increase the chance of successfully locking a request.

This change also adds a missing return when a build request could be
locked, but the request was already processed by another executor.

Change-Id: Ic52f3d82ed12fd45922d9ec9d61e8ded43507837
2024-02-15 17:46:21 +01:00
James E. Blair fa274fcf25 Remove most model_api backwards-compat handling
Since we require deleting the state (except config cache) for the
circular dependency upgrade, we can now remove any zookeeper
backwards compatability handling not in the config cache.  This
change does so.  It leaves upgrade handling for two attributes
in the branch cache (even though they are old) because it is
plausible that even systems that have been upgrading regularly
may have non-upgraded data in the branch cache for infrequently
used branches, and the cost for retaining the upgrade handling
is small.

Change-Id: I881578159b06c8c7b38a9fa0980aa0626dddad31
2024-02-09 07:39:55 -08:00
James E. Blair 1f026bd49c Finish circular dependency refactor
This change completes the circular dependency refactor.

The principal change is that queue items may now include
more than one change simultaneously in the case of circular
dependencies.

In dependent pipelines, the two-phase reporting process is
simplified because it happens during processing of a single
item.

In independent pipelines, non-live items are still used for
linear depnedencies, but multi-change items are used for
circular dependencies.

Previously changes were enqueued recursively and then
bundles were made out of the resulting items.  Since we now
need to enqueue entire cycles in one queue item, the
dependency graph generation is performed at the start of
enqueing the first change in a cycle.

Some tests exercise situations where Zuul is processing
events for old patchsets of changes.  The new change query
sequence mentioned in the previous paragraph necessitates
more accurate information about out-of-date patchsets than
the previous sequence, therefore the Gerrit driver has been
updated to query and return more data about non-current
patchsets.

This change is not backwards compatible with the existing
ZK schema, and will require Zuul systems delete all pipeline
states during the upgrade.  A later change will implement
a helper command for this.

All backwards compatability handling for the last several
model_api versions which were added to prepare for this
upgrade have been removed.  In general, all model data
structures involving frozen jobs are now indexed by the
frozen job's uuid and no longer include the job name since
a job name no longer uniquely identifies a job in a buildset
(either the uuid or the (job name, change) tuple must be
used to identify it).

Job deduplication is simplified and now only needs to
consider jobs within the same buildset.

The fake github driver had a bug (fakegithub.py line 694) where
it did not correctly increment the check run counter, so our
tests that verified that we closed out obsolete check runs
when re-enqueing were not valid.  This has been corrected, and
in doing so, has necessitated some changes around quiet dequeing
when we re-enqueue a change.

The reporting in several drivers has been updated to support
reporting information about multiple changes in a queue item.

Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1
Depends-On: https://review.opendev.org/907627
2024-02-09 07:39:40 -08:00
Simon Westphahl 904143e003
Stop and join the build worker before ZK client
As the `BaseMergeServer` also disconnects the ZK client we need to first
stop and join the build worker in the executore before calling the join
of the parent class.

This should prevent ZK connection lost messages from showing up during
executor shutdown.

Change-Id: I6866cfa39ee533ed7d06ab39ebfb9ad46d6e5377
2024-01-25 16:35:34 +01:00
James E. Blair 663931cba3 Include job_uuid in BuildRequests/Events
This is part of the circular dependency refactor.

It updates BuildRequests to include the job uuid (similar to the
previous change with NodeRequests).  We also use the job uuid when
sending BuildResultEvents.

Change-Id: If7a55138d0cb5a358c62e8cd97ee322087b09a6b
2024-01-08 08:54:56 -08:00
James E. Blair 9201f9ee28 Store builds on buildset by uuid
This is part of the circular dependency refactor.

This updates the buildset object in memory (and zk) to store builds
indexed by frozen job uuid rather than job name.  This also updates
everal related fields and also temporary dictionaries to do the same.

This will allow us, in the future, to have more than one job/build
in a buildset with the same name (for different changes/refs).

Change-Id: I70865ec8d70fb9105633f0d03ba7c7e3e6cd147d
2023-12-12 11:58:21 -08:00
James E. Blair 071c48c5ae Freeze job dependencies with job graph
This is part of the circular dependency refactor.

Update the job graph to record job dependencies when it is frozen,
and store these dependencies by uuid.  This means our dependency
graph points to actual frozen jobs rather than mere job names.

This is a pre-requisite to being able to disambiguate dependencies
later when a queue item supports multiple jobs with the same name.

The behavior where we would try to unwind an addition to the job
graph if it failed is removed.  This was originally written with the
idea that we would try to run as many jobs as possible if there was
a config error.  That was pre-zuul-v3 behavior.  Long since, in all
cases when we actually encounter an error adding to the job graph,
we bail and report that to the user.  No longer handling that
case simplifies the code somewhat and makes it more future-proof
(while complicating one of the tests that relied on that behavior
as a shortcut).

This attempts to handle upgrades by emulating the old behavior
if a job graph was created on an older model version.  Since it
relies on frozen job uuids, it also attempts to handle the case
where a frozenjob does not have a uuid (which is a very recent
model change and likely to end up in the same upgrade for some
users) by emulating the old behavior.

Change-Id: I0070a07fcb5af950651404fa8ae66ea18c6ca006
2023-12-06 16:41:18 -08:00
James E. Blair 033470e8b3 Fix repo state restore for zuul role tag override
When a repo that is being used for a zuul role has override-checkout
set to a tag, the job would fail because we did not reconstruct the
tag in our zuul-role checkout; we only did that for branches.

This fixes the repo state restore for any type of ref.

There is a an untested code path where a zuul role repo is checked
out to a tag using override-checkout.  Add a test for that (and
also the same for a branch, for good measure).

Change-Id: I36f142cd3c4e7d0b930318dddd7276f3635cc3a2
2023-11-30 10:06:03 -08:00
Zuul 90746cb286 Merge "Retry lingering deduplicated builds" 2023-11-20 17:06:05 +00:00
James E. Blair dd60903a95 Retry lingering deduplicated builds
We intend to handle the case where two queue items in check share
a deduplicated build and one of them completes and is dequeued while
the other continues.  To handle this, we avoid removing any queue
items (and therefore their associated builds) from ZK if their builds
show up in any currently-enqueued queue item.  However, we don't
actually have a mechanism to load a build from ZK in that situation
if it isn't in a queue item that is currently enqueued.

Adding such a mechanism is complex and risky, whereas the circular
dependency refactoring effort currently in progress will address this
issue in a comprehensive manner.

This change addresses the issue by detecting the situation at the
point where we would try to launch the build (since we failed to
restore it from ZK) and instead of raising an exception as we currently
do, we tell the scheduler to retry the build.  This results in the
buildset not actually taking advantage of the potential deduplication,
but it does at least provide a working build result to the user in
the form of a brand new build.

Change-Id: I86c159c82b858e67433bdaa1e479471b21ea8b86
2023-11-16 14:59:31 -08:00
Zuul a509e25f87 Merge "Send job parent data + artifacts via build request" 2023-11-16 07:15:20 +00:00
Simon Westphahl 68d7a99cee
Send job parent data + artifacts via build request
With job parents that supply data we might end up updating the (secret)
parent data and artifacts of a job multiple times in addition to also
storing duplicate data as most of this information is part of the
parent's build result.

Instead we will collect the parent data and artifacts before scheduling
a build request and send it as part of the request paramters.

If those parameters are part of the build request the executor will use
them, otherwise it falls back on using the data from the job for
backward compatibility.

This change affects the behavior of job deduplication in that input data
from parent jobs is no longer considered when deciding if a job can be
deduplicated or not.

Change-Id: Ic4a85a57983d38f033cf63947a3b276c1ecc70dc
2023-11-15 07:24:52 +01:00
James E. Blair 518194af1d Thin the workspace repo clone
When the executor clones a repo from the cache to the workspace,
it performs a lot of unecessary work:

* It checks out HEAD and prepares a workspace which we will
  immediately change.
* It copies all of the branch refs, half of which we will immediately
  delete, and in some configurations (exclude_unprotected_branches)
  we will immediately delete most of the rest.  Deleting refs with
  gitpython is much more expensive than creating them.

This change updates the initial clone to do none of those, instead
relying on the repo state restoration to take care of that for us.

Change-Id: Ie8846c48ccd6255953f46640f5559bb41491d425
2023-11-10 06:19:47 -08:00
James E. Blair 07e28cca57 Add tracing spans for build repo setup
To aid analysis of the build startup phase, add opentelemetry spans
to the start of builds around repo operations (update, checkout,
merge, etc).

Change-Id: I61df8e63f6401deb98b9842299a23d0c2c2f142a
2023-11-08 14:05:47 -08:00
James E. Blair e55748ba69 Make Ansible variable freezing more efficient
We currently iterate over every job/host/etc variable in the freeze
playbook.  The reason is because if any value in any variable is
Undefined according to jinja, the Ansible combine filter throws
an error.  What we want to do in Zuul is merge any variable we can,
but if any is undefined, we skip it.  Thus, the process of combining
the variables one at a time in a task and ignoring errors.

This process can be slow, especially if we have start with a large
amount of data in one of the early variables.  The combine filter
needs to reprocess the large data repeatedly for each additional
variable.

To improve the process, we create a new action plugin, "zuul_freeze"
which takes a list of variables we want to freeze, then templates
them one at a time and stores the result in a cacheable fact.  This
is the essence of what we were trying to accomplish with the combine
filter.

Change-Id: Ie41f404762daa1b1a5ae47f6ec1aa1954ad36a39
2023-09-14 14:00:45 -07:00
James E. Blair 1b042ba4ab Add job failure output detection regexes
This allows users to trigger the new early failure detection by
matching regexes in the streaming job output.

For example, if a unit test job outputs something sufficiently
unique on failure, one could write a regex that matches that and
triggers the early failure detection before the playbook completes.

For hour-long unit test jobs, this could save a considerable amount
of time.

Note that this adds the google-re2 library to the Ansible venvs.  It
has manylinux wheels available, so is easy to install with
zuul-manage-ansible.  In Zuul itself, we use the fb-re2 library which
requires compilation and is therefore more difficult to use with
zuul-manage-ansible.  Presumably using fb-re2 to validate the syntax
and then later actually using google-re2 to run the regexes is
sufficient.  We may want to switch Zuul to use google-re2 later for
consistency.

Change-Id: Ifc9454767385de4c96e6da6d6f41bcb936aa24cd
2023-08-21 16:41:21 -07:00
James E. Blair f04d80912e Fix issue with early failure and blocks
Ansible tasks that fail within block tasks call the failure
callback, which means that they triggered Zuul early failure
detection even if they were later rescued.  To avoid this,
ignore block tasks for purposes of early failure detection.

Also make early failure detection "sticky".  This helps
uncover errors like this (the "remote" tests uncover this
particular failure if the result is made sticky), and also
ensures consistent behavior in dependent pipelines.

Change-Id: I505667678e7384386819b5389036e4fb4f108afd
2023-08-21 16:41:08 -07:00
Simon Westphahl 3b011296e6 Keep task stdout/stderr separate in result object
Combining stdout/stderr in the result can lead to problems when e.g.
the stdout of a task is used as an input for another task.

This is also different from the normal Ansible behavior and can be
surprising and hard to debug for users.

The new behavior is configurable and off by default to retain backward
compatibility.

Change-Id: Icaced970650913f9632a8db75a5970a38d3a6bc4
Co-Authored-By: James E. Blair <jim@acmegating.com>
2023-08-17 16:22:41 -07:00
Simon Westphahl d1d886bc03
Report error details on Ansible failure
In case of a retry there might be no logs available to help the user
understand the reason for a failure. To improve this we can the details
of the failure as part of the build result.

Change-Id: Ib9fdbdec5d783a347d1b6e5ce6510d50acfe1286
2023-08-07 10:13:16 +02:00
Zuul 6c0ffe565f Merge "Report early failure from Ansible task failures" 2023-07-29 18:28:08 +00:00
Zuul 1500788011 Merge "Refactor will_retry" 2023-07-29 18:14:44 +00:00
Zuul c05ae202b9 Merge "Add Zuul job variable to indicate a job will retry" 2023-07-29 18:13:41 +00:00
Zuul af15ee197b Merge "Add commit_id to zuul vars" 2023-07-27 19:41:28 +00:00
James E. Blair 2ecf2d62cc Refactor will_retry
This moves some variables around to try to make the code have more
of a direct correspondence with the zuul_will_retry Ansible variable.

The logic is just a little too complex to have a 1:1 match between
a python variable and Ansible.  But this refactor adds a "should_retry"
which is pretty close to "zuul_will_retry" with the caveat that
will_retry is should_retry iff we have not hit the retry limit.

Change-Id: Ic869e3befc5763ffcd4a97c406224f7bb46acf13
2023-07-27 07:08:26 -07:00
Simon Westphahl 7bba28a32f
Add Zuul job variable to indicate a job will retry
This change adds a variable to post and cleanup playboks in order to
determine if a job will be retried due to a failure in one of the
earlier playbooks.

This variable might be useful for only performing certain actions (e.g.
interacting with a remote system) when the job result is final and there
won't be any further attempts.

Change-Id: If7f4488d4a59b1544795401bdc243978fea9ca86
2023-07-27 13:13:55 +02:00
James E. Blair 60a8dfd451 Add Ansible 8
This is the currently supported version of Ansible.  Since 7 is out
of support, let's skip it.

Change-Id: I1d13c23189dce7fd9db291ee03a452089b92a421
2023-07-19 15:46:48 -07:00
James E. Blair 985aefc70c Add commit_id to zuul vars
We added the commit_id to the MQTT reporter a while ago so that
external systems which identify changes by commit id (the Gerrit
current patchset hexsha, or the Github PR HEAD, etc) can track
Zuul work.  To enable the same sort of interfacing with external
systems in Zuul jobs, the variable is added to the set of zuul
variables.

To make it easier for use in multiple pipelines, the variable is
set to oldrev or newrev in the non-change case.

To make it easier to confirm driver parity, a new Gerrit driver
test class is added with the intention that it should parallel
the other driver test classes over time.

Change-Id: I2826349ee9ce696c8f9de8c23276094123292196
2023-07-19 15:16:17 -07:00
James E. Blair 1170b91bd8 Report early failure from Ansible task failures
We can have the Ansible callback plugin tell the executor to tell
the scheduler that a task has failed and therefore the job will
fail.  This will allow the scheduler to begin a gate reset before
the failing job has finished and potentially save much developer
and CPU time.

We take some extra precautions to try to avoid sending a pre-fail
notification where we think we might end up retrying the job
(either due to a failure in a pre-run playbook, or an unreachable
host).  If that does happen then a gate pipeline might end up
flapping between two different NNFI configurations (ie, it may
perform unecessary gate resets behind the change with the retrying
job), but should still not produce an incorrect result.  Presumably
the detections here should catch that case sufficiently early, but
due to the nature of these errors, we may need to observe it in
production to be sure.

Change-Id: Ic40b8826f2d54e45fb6c4d2478761a89ef4549e4
2023-06-29 13:40:34 -07:00
James E. Blair 1c92165ab7 List process ids in bwrap namespace
If the kernel kills a process due to an out of memory error, it
can be difficult to track the process back to the build that triggered
it.  The kernel error just gives us a PID, but we don't know any of
the Ansible process ids.  Further, since they are in bwrap, Ansible
only knows its namespaced pid rather than the host pid, so we can't
simply output it in one of our callback plugins.

To aid in debugging, output all of the process ids within a namespace
right at the start of an ansible-playbook execution.  At this time,
it is certain that the Ansible process will have started, and it is
very likely that it is still running.  That should provide a way
to map from an OOM message back to an Ansible process id.

(Note that Ansible forks and this is unlikely to catch any forked
processes, so we will only see the main Ansible process id.  Typically
this is what the kernel should elect to kill, but if it does not,
we may need a futher change to repeat this process each time Ansible
forks.  Since that is more costly, let's see if we can avoid it.)

Change-Id: I9f262c3a3c5410427b0fb301cb4f1697b033ba2f
2023-06-28 13:31:06 -07:00
Zuul 0bd76048d1 Merge "Correctly complete build on transient errors" 2023-05-22 11:10:13 +00:00
Simon Westphahl 515e6ab8f7
Correctly complete build on transient errors
Use the `completeBuild()` method for reporting the result on transient
errors during repo update instead of the old `sendWorkComplete()` API.

2023-04-28 12:29:49,173 ERROR zuul.AnsibleJob: [e: ...] [build: ...] Exception while executing job
Traceback (most recent call last):
  File "/opt/zuul/lib/python3.10/site-packages/zuul/executor/server.py", line 1144, in do_execute
    self._execute()
  File "/opt/zuul/lib/python3.10/site-packages/zuul/executor/server.py", line 1345, in _execute
    self.job.sendWorkComplete(
AttributeError: 'FrozenJob' object has no attribute 'sendWorkComplete'

Change-Id: I05be093f80f015463f727f55154e16202821e961
2023-05-02 10:52:42 +02:00
Simon Westphahl b8be20c5db
Expose max. attempts to the job as a Zuul variable
This allows a job to determine if the current execution is the last
attempt.

Change-Id: I989db9f924aab997496851ce99ad71bbacf2379e
2023-04-14 08:13:35 +02:00
Zuul 8c3c0c2251 Merge "Retry jobs on transient IO errors on repo update" 2023-03-22 21:52:16 +00:00
Zuul c5de8cb54b Merge "Expose nodepool slot attribute" 2023-03-15 15:56:42 +00:00
Simon Westphahl 408c066786
Retry jobs on transient IO errors on repo update
We are occassionally seeing different types of IO errors when updating
repos on an executor. Currently those exceptions will abort the build
and result in an error being reported.

Since those errors are usually transient and point to some
infrastructure problem we should retry those builds instead.

We'll catch all IOErrors which includes request related exceptions from
the "requests" Python package. See:
https://github.com/psf/requests/blob/main/requests/exceptions.py

Traceback (most recent call last):
  File "/opt/zuul/lib/python3.10/site-packages/zuul/executor/server.py", line 3609, in _innerUpdateLoop
    self.merger.updateRepo(
  File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 994, in updateRepo
    repo = self.getRepo(connection_name, project_name,
  File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 966, in getRepo
    url = source.getGitUrl(project)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/driver/github/githubsource.py", line 154, in getGitUrl
    return self.connection.getGitUrl(project)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/driver/github/githubconnection.py", line 1744, in getGitUrl
    self._github_client_manager.get_installation_key(
  File "/opt/zuul/lib/python3.10/site-packages/zuul/driver/github/githubconnection.py", line 1126, in get_installation_key
    response = github.session.post(url, headers=headers, json=None)
  File "/opt/zuul/lib/python3.10/site-packages/requests/sessions.py", line 635, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/github3/session.py", line 171, in request
    response = super().request(*args, **kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/cachecontrol/adapter.py", line 53, in send
    resp = super(CacheControlAdapter, self).send(request, **kw)
  File "/opt/zuul/lib/python3.10/site-packages/requests/adapters.py", line 565, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError:
HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded
with url: /api/v3/app/installations/123/access_tokens (Caused by
NewConnectionError('<urllib3.connection.HTTPSConnection object at
0x7f44f6136ef0>: Failed to establish a new connection: [Errno -3]
Temporary failure in name resolution'))

Change-Id: I4e07e945c88b9ba61f83131076fbf7b9768a61f9
2023-02-24 15:00:48 +01:00
James E. Blair 99d39545a6 Add an !unsafe change_message variable
In I9628e2770dda120b269612e28bb6217036942b8e we switched zuul.change from
a plain string tagged with !unsafe to base64 encoded and no !unsafe tag.
The idea was to make the inventory file parseable by external tools while
avoiding accidental interpolation of the commit message by Ansible.

That doesn't work in all cases -- it's not hard to construct a scenario
where after base64 decoding the message any further processing by Ansible
causes it to undergo interpolation.  Moreover, since then we have made
many changes to how we deal with variables; notably, the inventory.yaml
is no longer actually used by Zuul's Anisble -- it is now there only
for human and downstream processing.  We call it the "debug inventory".
The actual inventory is much more complex and in some cases has lots of
!unsafe tags in it.

Given all that, it now seems like the most straightforward way to deal
with this is to tag the message variable as !unsafe when passing it to
Zuul's Ansible, but render it as plain text in the inventory.yaml.

To address backwards compatability, this is done in a new variable called
zuul.change_message.  Since that's a more descriptive variable anyway,
we will just keep that one in the future and drop the current base64-
encoded zuul.message variable

Change-Id: Iea86de15e722bc271c1bf0540db2c9efb032500c
2023-02-09 09:07:53 -08:00
Clark Boylan 045bb270c0 Fix ResourceWarnings in the executor server
We have two open() calls in the executor server without matching
close() calls. Fix this by wrapping both in a with context manager.

Change-Id: I0e51c2b22ea1540484851f98749e648728b26406
2023-02-07 16:17:14 -08:00
Clark Boylan 753bebbb2c Cleanup some Python ResourceWarnings in the test suite
First thing we add support for PYTHONTRACEMALLOC being passed through
nox to easily enable tracebacks on emitted ResourceWarnings. We set this
value to 0 by deafult as enabling this slows things down and requires
more memory. But it is super useful locally when debugging specific
ResourceWarnings to set `PYTHONTRACEMALLOC=5` in order o correct these
issues.

With that in place we identify and correct two classes of
ResourceWarnings.

First up is the executor server not closing its statsd socket when
stopping the executor server. Address that by closing the socket if
statsd is enabled and set the statsd attribute to None to prevent
anything else from using it later.

Second is a test only issue where we don't close the fake Gerrit,
Gitlab, or Web Proxy Fixture server's HTTP socket we only shutdown the
server. Add a close call to the server after it is shutdown to correct
this.

There are potentially other ResourceWarnings to be found and cleaned up,
but sifting through the noise will be easier as we eliminate these more
widespread warnings.

Change-Id: Iddabe79be1c8557e300dde21a6b34e57b04c48e0
2023-02-06 10:30:18 -08:00
Zuul 35c68169bd Merge "Fix deduplication exceptions in pipeline processing" 2022-12-20 01:48:36 +00:00
Felix Edel 137557c559 Abort job if cleanup playbook timed out
We've investigated an issue where a job was stuck on the executor
because it wasn't aborted properly. The job was cancelled by the
scheduler, but the cleanup playbook on the executor ran into a timeout.
This caused another abort via the WatchDog.

The problem is that the abort function doesn't do anything if the
cleanup playbook is running [1]. Most probably this covers the case
that we don't want to abort the cleanup playbook after a normal job
cancellation.

However, this doesn't differentiate if the abort was caused by the run
of the cleanup playbook itself, resulting in a build that's hanging
indefinitely on the executor.

To fix this, we now differentiate if the abort was caused by a stop()
call [2] or if it was caused by a timeout. In case of a timeout, we kill
the running process.

Add a test case to validate the changed behaviour. Without the fix, the
test case runs indefinitetly because the cleanup playbook won't be
aborted even after the test times out (during the test cleanup).

[1]: 4d555ca675/zuul/executor/server.py (L2688)
[2]: 4d555ca675/zuul/executor/server.py (L1064)

Change-Id: I979f55b52da3b7a237ac826dfa8f3007e8679932
2022-12-12 16:49:39 +01:00