To avoid confusion with nodepool-launcher, we've decided to rename
zuul-launcher to zuul-executor.
Change-Id: I7d03cf0f0093400f4ba2e4beb1c92694224a3e8c
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Currently our post playbook timeout value is hardcoded to 10mins, for
the majority of our jobs this is okay. However, when projects need to
transfer a lot of data (kolla 2.6gb tarballs) zuul will abort the post
playbook.
For zuulv3, we should properly expose this value to be configured per
job, but today just bump our timeout to 30mins.
Change-Id: I12dcbfe60bb1d59c3af8a13f49f04e3b68ff7197
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
In I6cae11c1e89f6ccc78cb5bfaf61ef78e846e87be, we attempted to fix
an error where long-running workers never reset their watchdog
timeout flag, meaning that once a job timed out, all further jobs
on that worker timed out. That change cleared the flag each time
ansible ran. However, that flag is also used in conjunction with
the abort flag to determine whether a failed or null result should
be sent back to Zuul (a null result will cause a job to be
rescheduled). By clearing the flag before, say, a post playbook
we would lose the information that the abort was due to a timeout
rather than a direct abort request, and return the null result to
Zuul. This means all jobs that timeout would be relaunched.
Instead of clearing the flag before each ansible run, clear it once
at the start of the job launch. This means it will be set for any
ansible timeout. That should be fine for both the aborted job check
as well as the new "timed out" log message.
The typo this change corrects indicates this was the intended logic.
Change-Id: Ie31409a7706b6cf4d7ce858b4d5f0c00e4ee31da
The watchdog timeout emits an operator log, but no end-user visible
message. Add some text to the error message if we do time out.
Change-Id: I38fed8e020a966362ee708025ab5bc9aa5995c68
For the long lived worker, the flag never gets reset, which means that
every job that runs after a job that times out will show as failed for
no good reason.
Change-Id: I6cae11c1e89f6ccc78cb5bfaf61ef78e846e87be
There is a bug (https://github.com/ansible/ansible/issues/18281) in the
ansible synchronize module that causes any retry attempt at
synchronizing to fail because the paths get munged resulting in invalid
paths. Unfortunately this also means that the error message we get is
not for the first failed sync attempt but for the last making it hard to
debug why things failed in the first place.
Address this by not attempting to retry until ansible is fixed. This way
we get accurate error messages more quickly (as we don't retry over and
over and generate a bad error message at the end).
Change-Id: I545c44b11f37576edc8768a3ed78962ff870995f
The logic to rsync files into AFS is very complex, requiring
an rsync command for each of the pseudo-build-roots that are
produced by our docs jobs. Rather than try to do this in ansible
YAML, move it into an ansible module where it is much simpler.
Change-Id: I4cab8003442734ed48c67e09ea8407ec69303d87
The custom command module used in order to collect job output was
also being used by the pre and post playbooks. This meant that
instead of going to the ansible log file, the rsync output would
end up in /tmp/console.html on the zuul launcher.
To correct this, create separate library directories for use by
the pre and post playbooks which will contain all of the modules
except the custom command module. Write separate ansible.cfg files
for them, and instruct ansible-playbook to use those config files.
Change-Id: I5eb6bcc48bcaa6b056af1af7da93f29408f9db41
Add the Ansible-standard rsync output format option to rsync, and
also output the filter file to the logs to aid in debugging.
Change-Id: I68daf93ee7f5d501e51ec90d201830a18c6e5a47
While trying to follow a failed post-playbook in the gate, it became
harder than desirable to follow which task was failing. Add names to the
tasks so that we can track which thing is going on.
Change-Id: I35fd7ad75c82f6a82fc8d12b7fd48860c1ab10f1
We still need to setup our timeout-var environmental variable,
otherwise devstack gate will fail to read BUILD_TIMEOUT and default
jobs to 120min timeouts.
Change-Id: Ieccba55eaab83074a409efdbb928b4a4fdfdecf7
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
For the generated shell scripts which are named using UUID4, prepend a
sequence count to them to easily be able to tell the ordering of the
scripts when looking in '_zuul_ansible/scripts/'. Keep the uuid to
avoid potential collisions in /tmp.
Change-Id: Id80bf5139ba1ce12c62945421d49c5e3cd8e2f48
For the generated shell scripts in ansiblelaunchserver.py, have them
be generated in numerical order. For example 01.sh, 02.sh, etc. This
will allow us to tell the ordering of the scripts when looking in
'_zuul_ansible/scripts/'
Change-Id: Iba6231242a58a23549c92aa32620d498e05886f8
The find command that collected the marker files is expected
to print paths with a leading '/' (see later commands which
grep for '^/') but this was omitted. This would cause all jobs
which published to the root (whether they had any content in
the root directory or were simply only intended to publish to a
subdir of the root) to conflict with each other.
Also, correct a missing fully-qualified path.
Change-Id: I6030c2b101026ff8e72cf4043e1d1b4fbffc5dcb
It seems that Jenkins does this. At least with FTP. We don't have
any leading / on AFS targets, but do the same there for symmetry.
Change-Id: Icb7451c0f3f5fa62c8a15fc621fd30f2df166c96
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
From the manual:
Enabling pipelining reduces the number of SSH operations required to
execute a module on the remote server, by executing many ansible
modules without actual file transfer. This can result in a very
significant performance improvement when enabled, however when using
“sudo:” operations you must first disable ‘requiretty’ in
/etc/sudoers on all managed hosts.
Basically on local testing, there is a speed improvement. However, I
believe the better reason to enable this is to reduce the number of
SSH transactions we preform on our workers. In doing this we reduce
our potential chances for SSH connection issue.
However, it also appears async operations do not use this setting
simply because of async works.
Change-Id: Ib224fbf1fed19be3ce7db4da0c466e3d11acc365
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
In order to get to the point where playbooks that people write for tests
are playbooks that they could conceivably also use outside of the zuul
context, we need to remove the need for zuul-specific things in the main
playbook.
Add a pre-playbook that runs before the playbook and runs the things
that are not tied to current JJB content - namely setting up the logger
and prepping directories.
Move the SUCCESS/FAILURE message to the post-playbook.
Extract the injected variables into a variables file and add a
-e@vars.yaml option to the playbook invocation. This provides variables
in a known namespace. Obviously there is still an exercise in how a user
might write a playbook that wants to consume those variables in some
way.
Change-Id: Ie5ec6ec65a03ceea9afc3ac59df73cb28f5ca4dd
The async module is complex, and we're only using it to handle the
running cumulative timeout. However, we still fallback on the watchdog
timeout from time to time. Make things simpler by just having that be
how we time things out.
Change-Id: Ie51de4a135d953c4ad9dcb773d27b3c54ca8829b
Now that we're using the command module, just do inline script content
to make debugging/reading easier.
Change-Id: Ia63f77fd41a03b4662c26f9d0f3b70d1e6a8b5d3
Having a modified command module with the zuul_runner logic allows us to
use normal command and shell entries in the playbooks. (shell is just a
wrapper around command)
At this moment in time it's an invasive fork of the run_command method
on AnsibleModule. That's not optimal for long term, but should get us
closer to being able to discuss appropriate hook points with upstream
ansible.
Use environment task parameter instead of parameters
ansible has a structure for passing in environment variables which we
can use. We did not use it before due to a behavior in ansible from
pre-2.2 that set LANG settings in the environment in a way that caused
us to need to clean things in zuul_runner. The module_set_locale
variable defaults to False in 2.2, but to True in 2.1 (which was the
regression) Set the config value explcitly just to be sure.
Change-Id: Iae4769f923ecf74462e1fe43168ea93ff1c61d6e
In the next patch, we're going to change the body of zuul_runner. But,
in order to render that diff well, do the rename in this patch.
Change-Id: I3727f506cae5da561948869bd8f8daaf42e4dc0d
This contains several fixes:
* Support remove-prefix. This is used by the FTP publisher we are
replacing.
* Fix sed expressions. They were missing a '/'.
* Make the target directory before rsync. Rsync requires the target
root directory exist before running. Elsewhere we solved that by
encoding the mkdir into the remote rsync command. Since we are
running locally here, just run 'mkdir -p' before running rsync.
However, it must be done with the keytab, so include it in the
k5start command (so that we do not need to run k5start twice).
* Include the 'user' in the site definition as the principal for
k5start.
Change-Id: I69c263a35e732b9a21d411bd30215945783d1023
Rather than requiring the launcher to be run with k5start,
run k5start only during the specific rsync command where it is
required.
Change-Id: I1d8258c4b13d21c96072d1a03c3a3472b0d878d5
This is an extension to JJB that works only in zuul-launcher, not
Jenkins. It allows copying the results of a build into afs.
It actually isn't really AFS specific at all, other than it
checks that the destination path is under /afs. Otherwise, it
behaves as a local copy on the launcher itself.
It also contains the logic needed to publish OpenStack's
documentation builds, which can appear as subdirectories of other
builds.
Change-Id: Icda75266219d2d7167e80aaad8e290443cfdbadc
We are seeing intermitent failures in zuul trying to talk to the node
which look like they are the 10s ssh negotiation failing. Extremely
busy test nodes that are using their entire network bw to pull
packages, may take longer than this.
Try to reduce this by bumping the timeout.
Change-Id: Ic4ec2ea3c8b77cb308fb1a85514d831acf6c4b67
Jobs no longer launch using this code. Revert so we can debug the
issue.
This reverts commit b6341fbe63.
Change-Id: Ie8076e3e162e3f223367321d8f57ccb48a0f57f6
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Run ssh-keygen on the known_host file to extract the ssh_host_key. We
do this to help debug the scenario when the remote nodes
identification has changed:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle
attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
51:82:00:1c:7e:6f:ac:ac:de:f1:53:08:1c:7d:55:68.
Please contact your system administrator.
Change-Id: Ica41c80db91e7b08dbc34516b3812da4148c36e3
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
We sometimes see errors rsyncing data from the node or to the
log server. Since these are all rsync commands, they are safe
to retry. Attempt all post playbook rsyncs up to 3 times with
a 30 second delay between each attempt.
Change-Id: I329e1f1f31d53d82799e3485a912b76e2249d03f
This is a noop change, which removes the hardcoded node IP address
from our playbook. This is a step forward to allow users to re-run our
playbooks in an effort to reproduce produce problems locally.
Change-Id: I3d3b979fb9bfffce1ea1466403a277e6f6e146cc
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Because we are using the private MASS_DO gearman operation to
register functions, the gear.Worker does not know what functions
are registered and therefore the routine which automatically
re-registers functions after a gear server disconnect was not
effective. Correct this by also storing the function list when
sending MASS_DO. This will result in the worker actually sending
CAN_DO packets rather than MASS_DO in the case of a reconnect,
but at least it will be correct, if not efficient.
This error would cause existing nodes attached to zuul launchers
to be unable to run jobs after a zuul (geard) restart.
Change-Id: I60804355a8b3a3cfb79a12dd6e6f0e219fe50c31
When we use 'delegate_to' to run commands locally, the 'remote'
side of the Ansible connection is the local host. When running
these tasks it will write to the 'remote_tmp' directory, which
is actually the local ~/.ansible/tmp directory. We also set
'keep_remote_files' to true in order to avoid a race condition
with 'async' on the actual remote hosts, but in this case, these
two options in combination end up meaning 'keep some files in
the local ~/.ansible/tmp directory indefinitely' which is not
good for our long-running launchers.
Instead, set 'remote_tmp' to a subdirectory of the jobdir so that
when used in the local context, it will be cleaned up at the end
of the run. In the remote context, it will end up in a similarly
randomly named directory under /tmp on the worker. Ansible will
create that directory. This has the side benefit of removing the
Ansible running the job further from potential uses of Ansible
within the job (which may continue to use ~/.ansible by default).
Change-Id: I70475d5844cbd66bf670566f992fdec263d271a5