Commit Graph

183 Commits

Author SHA1 Message Date
Paul Belanger 174a8274d0 Rename zuul-launcher to zuul-executor
To avoid confusion with nodepool-launcher, we've decided to rename
zuul-launcher to zuul-executor.

Change-Id: I7d03cf0f0093400f4ba2e4beb1c92694224a3e8c
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-03-15 12:21:24 -04:00
Joshua Hesketh 25695cbb51 Merge branch 'master' into feature/zuulv3
Change-Id: I37a3c5d4f12917b111b7eb624f8b68689687ebc4
2017-03-06 09:40:04 -08:00
Paul Belanger 08de693416 Bump post playbook timeout to 30mins
Currently our post playbook timeout value is hardcoded to 10mins, for
the majority of our jobs this is okay. However, when projects need to
transfer a lot of data (kolla 2.6gb tarballs) zuul will abort the post
playbook.

For zuulv3, we should properly expose this value to be configured per
job, but today just bump our timeout to 30mins.

Change-Id: I12dcbfe60bb1d59c3af8a13f49f04e3b68ff7197
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-02-03 09:01:07 -05:00
James E. Blair 7f7ddbdfa0 Fix watchdog timeout fix
In I6cae11c1e89f6ccc78cb5bfaf61ef78e846e87be, we attempted to fix
an error where long-running workers never reset their watchdog
timeout flag, meaning that once a job timed out, all further jobs
on that worker timed out.  That change cleared the flag each time
ansible ran.  However, that flag is also used in conjunction with
the abort flag to determine whether a failed or null result should
be sent back to Zuul (a null result will cause a job to be
rescheduled).  By clearing the flag before, say, a post playbook
we would lose the information that the abort was due to a timeout
rather than a direct abort request, and return the null result to
Zuul.  This means all jobs that timeout would be relaunched.

Instead of clearing the flag before each ansible run, clear it once
at the start of the job launch.  This means it will be set for any
ansible timeout.  That should be fine for both the aborted job check
as well as the new "timed out" log message.

The typo this change corrects indicates this was the intended logic.

Change-Id: Ie31409a7706b6cf4d7ce858b4d5f0c00e4ee31da
2016-12-14 10:12:15 -08:00
Monty Taylor cef224d162
Add a log message when ansible times out
The watchdog timeout emits an operator log, but no end-user visible
message. Add some text to the error message if we do time out.

Change-Id: I38fed8e020a966362ee708025ab5bc9aa5995c68
2016-12-14 12:03:12 -06:00
Joshua Hesketh 52846ffb3d Add note about redundant file
Change-Id: I45be20233dab35e58eb8a77309df184fe4415e9d
2016-12-09 15:42:23 +11:00
Monty Taylor 4afdd8a89d
Add reset of watchdog timeout flag
For the long lived worker, the flag never gets reset, which means that
every job that runs after a job that times out will show as failed for
no good reason.

Change-Id: I6cae11c1e89f6ccc78cb5bfaf61ef78e846e87be
2016-12-07 10:26:04 -06:00
Clark Boylan 63a595bae3 Don't retry when using synchronize module
There is a bug (https://github.com/ansible/ansible/issues/18281) in the
ansible synchronize module that causes any retry attempt at
synchronizing to fail because the paths get munged resulting in invalid
paths. Unfortunately this also means that the error message we get is
not for the first failed sync attempt but for the last making it hard to
debug why things failed in the first place.

Address this by not attempting to retry until ansible is fixed. This way
we get accurate error messages more quickly (as we don't retry over and
over and generate a bad error message at the end).

Change-Id: I545c44b11f37576edc8768a3ed78962ff870995f
2016-11-16 11:49:08 -08:00
James E. Blair bafbc5b328 Ansible launcher: move AFS publisher into a module
The logic to rsync files into AFS is very complex, requiring
an rsync command for each of the pseudo-build-roots that are
produced by our docs jobs.  Rather than try to do this in ansible
YAML, move it into an ansible module where it is much simpler.

Change-Id: I4cab8003442734ed48c67e09ea8407ec69303d87
2016-11-07 14:32:52 -08:00
James E. Blair 38ce39fe58 Use separate library directories for pre and post
The custom command module used in order to collect job output was
also being used by the pre and post playbooks.  This meant that
instead of going to the ansible log file, the rsync output would
end up in /tmp/console.html on the zuul launcher.

To correct this, create separate library directories for use by
the pre and post playbooks which will contain all of the modules
except the custom command module.  Write separate ansible.cfg files
for them, and instruct ansible-playbook to use those config files.

Change-Id: I5eb6bcc48bcaa6b056af1af7da93f29408f9db41
2016-11-01 08:43:06 -07:00
James E. Blair b2d99edf67 Add extra debugging for AFS rsync
Add the Ansible-standard rsync output format option to rsync, and
also output the filter file to the logs to aid in debugging.

Change-Id: I68daf93ee7f5d501e51ec90d201830a18c6e5a47
2016-11-01 07:52:42 -07:00
Monty Taylor a126e32d30
Add names to post-playbook tasks for debugging
While trying to follow a failed post-playbook in the gate, it became
harder than desirable to follow which task was failing. Add names to the
tasks so that we can track which thing is going on.

Change-Id: I35fd7ad75c82f6a82fc8d12b7fd48860c1ab10f1
2016-11-01 07:35:31 -05:00
James E. Blair 90b13ca096 Ansible launcher: remove keep_remote_files
This option was overriding pipelining=True.

Change-Id: Icfb281513e33d2390414a5dffc8c9f433d7e24d7
2016-10-20 10:05:31 -07:00
Paul Belanger 7aaf5d2f76
Add back timeout_var logic
We still need to setup our timeout-var environmental variable,
otherwise devstack gate will fail to read BUILD_TIMEOUT and default
jobs to 120min timeouts.

Change-Id: Ieccba55eaab83074a409efdbb928b4a4fdfdecf7
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-10-20 06:12:33 -04:00
John L. Villalovos 5a84a7647e Ansible launcher: use sequence-uuid in shell scripts
For the generated shell scripts which are named using UUID4, prepend a
sequence count to them to easily be able to tell the ordering of the
scripts when looking in '_zuul_ansible/scripts/'.  Keep the uuid to
avoid potential collisions in /tmp.

Change-Id: Id80bf5139ba1ce12c62945421d49c5e3cd8e2f48
2016-10-19 10:11:33 -07:00
John L. Villalovos 3f21d4061d Generate shell scripts as a sequence
For the generated shell scripts in ansiblelaunchserver.py, have them
be generated in numerical order. For example 01.sh, 02.sh, etc. This
will allow us to tell the ordering of the scripts when looking in
'_zuul_ansible/scripts/'

Change-Id: Iba6231242a58a23549c92aa32620d498e05886f8
2016-10-19 11:17:41 -05:00
Monty Taylor a4c892d6b4
Revert "Put script string in directly instead of in files"
This reverts commit a192814194.

Change-Id: Idd17e474d3ac8842855cb47f74d5ba7c331a074e
2016-10-19 10:54:07 -05:00
Jenkins b27b7e1c5d Merge "Enable pipelining for ansible-playbook" 2016-10-19 13:55:01 +00:00
Jenkins 380103662e Merge "Split playbook into vars, pre-playbook and playbook" 2016-10-19 13:54:55 +00:00
Jenkins 9555cafb98 Merge "Stop running commands with async" 2016-10-19 13:54:48 +00:00
Jenkins 192c027adb Merge "Put script string in directly instead of in files" 2016-10-19 13:54:41 +00:00
Jenkins 4746b64086 Merge "Use command module instead of zuul_runner" 2016-10-19 13:54:37 +00:00
Jenkins 523f4458c4 Merge "Rename zuul_runner to command" 2016-10-19 13:51:43 +00:00
James E. Blair 226cdd4706 Ansible launcher: Fix afs publisher root detection
The find command that collected the marker files is expected
to print paths with a leading '/' (see later commands which
grep for '^/') but this was omitted.  This would cause all jobs
which published to the root (whether they had any content in
the root directory or were simply only intended to publish to a
subdir of the root) to conflict with each other.

Also, correct a missing fully-qualified path.

Change-Id: I6030c2b101026ff8e72cf4043e1d1b4fbffc5dcb
2016-10-03 10:50:31 -07:00
Paul Belanger fd97be44d9
Strip leading / from afs targets
It seems that Jenkins does this.  At least with FTP.  We don't have
any leading / on AFS targets, but do the same there for symmetry.

Change-Id: Icb7451c0f3f5fa62c8a15fc621fd30f2df166c96
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-09-30 19:58:24 -04:00
Paul Belanger 843b4df30d
Enable pipelining for ansible-playbook
From the manual:

  Enabling pipelining reduces the number of SSH operations required to
  execute a module on the remote server, by executing many ansible
  modules without actual file transfer. This can result in a very
  significant performance improvement when enabled, however when using
  “sudo:” operations you must first disable ‘requiretty’ in
  /etc/sudoers on all managed hosts.

Basically on local testing, there is a speed improvement.  However, I
believe the better reason to enable this is to reduce the number of
SSH transactions we preform on our workers. In doing this we reduce
our potential chances for SSH connection issue.

However, it also appears async operations do not use this setting
simply because of async works.

Change-Id: Ib224fbf1fed19be3ce7db4da0c466e3d11acc365
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-09-29 18:35:30 -05:00
Monty Taylor 6767fceba6
Split playbook into vars, pre-playbook and playbook
In order to get to the point where playbooks that people write for tests
are playbooks that they could conceivably also use outside of the zuul
context, we need to remove the need for zuul-specific things in the main
playbook.

Add a pre-playbook that runs before the playbook and runs the things
that are not tied to current JJB content - namely setting up the logger
and prepping directories.

Move the SUCCESS/FAILURE message to the post-playbook.

Extract the injected variables into a variables file and add a
-e@vars.yaml option to the playbook invocation. This provides variables
in a known namespace. Obviously there is still an exercise in how a user
might write a playbook that wants to consume those variables in some
way.

Change-Id: Ie5ec6ec65a03ceea9afc3ac59df73cb28f5ca4dd
2016-09-29 18:35:30 -05:00
Monty Taylor f166784a28
Stop running commands with async
The async module is complex, and we're only using it to handle the
running cumulative timeout. However, we still fallback on the watchdog
timeout from time to time. Make things simpler by just having that be
how we time things out.

Change-Id: Ie51de4a135d953c4ad9dcb773d27b3c54ca8829b
2016-09-29 18:34:59 -05:00
Monty Taylor a192814194
Put script string in directly instead of in files
Now that we're using the command module, just do inline script content
to make debugging/reading easier.

Change-Id: Ia63f77fd41a03b4662c26f9d0f3b70d1e6a8b5d3
2016-09-29 18:19:53 -05:00
Monty Taylor d1ddd284b8
Use command module instead of zuul_runner
Having a modified command module with the zuul_runner logic allows us to
use normal command and shell entries in the playbooks. (shell is just a
wrapper around command)

At this moment in time it's an invasive fork of the run_command method
on AnsibleModule. That's not optimal for long term, but should get us
closer to being able to discuss appropriate hook points with upstream
ansible.

Use environment task parameter instead of parameters

ansible has a structure for passing in environment variables which we
can use. We did not use it before due to a behavior in ansible from
pre-2.2 that set LANG settings in the environment in a way that caused
us to need to clean things in zuul_runner. The module_set_locale
variable defaults to False in 2.2, but to True in 2.1 (which was the
regression) Set the config value explcitly just to be sure.

Change-Id: Iae4769f923ecf74462e1fe43168ea93ff1c61d6e
2016-09-29 18:17:56 -05:00
Monty Taylor 331c3de4a7
Rename zuul_runner to command
In the next patch, we're going to change the body of zuul_runner. But,
in order to render that diff well, do the rename in this patch.

Change-Id: I3727f506cae5da561948869bd8f8daaf42e4dc0d
2016-09-29 18:17:56 -05:00
James E. Blair 5b9b2bdf02 Ansible launcher: fix afs publisher
This contains several fixes:

* Support remove-prefix.  This is used by the FTP publisher we are
  replacing.
* Fix sed expressions.  They were missing a '/'.
* Make the target directory before rsync.  Rsync requires the target
  root directory exist before running.  Elsewhere we solved that by
  encoding the mkdir into the remote rsync command.  Since we are
  running locally here, just run 'mkdir -p' before running rsync.
  However, it must be done with the keytab, so include it in the
  k5start command (so that we do not need to run k5start twice).
* Include the 'user' in the site definition as the principal for
  k5start.

Change-Id: I69c263a35e732b9a21d411bd30215945783d1023
2016-09-29 10:08:23 -07:00
Jenkins 2e077db0b8 Merge "Ansible launcher: format ipv6 urls [correctly]" 2016-09-15 17:13:33 +00:00
James E. Blair c02dd818a6 Ansible launcher: format ipv6 urls [correctly]
Change-Id: Ib6464498a6a030cbfa89c65dcf27dd98d21c1cfa
2016-09-14 16:13:52 -07:00
James E. Blair 583fdc3d7e Ansible launcher: run k5start in playbook
Rather than requiring the launcher to be run with k5start,
run k5start only during the specific rsync command where it is
required.

Change-Id: I1d8258c4b13d21c96072d1a03c3a3472b0d878d5
2016-09-14 16:13:19 -07:00
James E. Blair 50408bc955 Ansible launcher: add AFS publisher
This is an extension to JJB that works only in zuul-launcher, not
Jenkins.  It allows copying the results of a build into afs.
It actually isn't really AFS specific at all, other than it
checks that the destination path is under /afs.  Otherwise, it
behaves as a local copy on the launcher itself.

It also contains the logic needed to publish OpenStack's
documentation builds, which can appear as subdirectories of other
builds.

Change-Id: Icda75266219d2d7167e80aaad8e290443cfdbadc
2016-09-14 16:05:00 -07:00
Jenkins aba3258b4e Merge "Use {{ ansible_host }} for ssh-keyscan" 2016-09-02 22:00:52 +00:00
Sean Dague fa17628a45 bump timeout on ssh commands to 30s
We are seeing intermitent failures in zuul trying to talk to the node
which look like they are the 10s ssh negotiation failing. Extremely
busy test nodes that are using their entire network bw to pull
packages, may take longer than this.

Try to reduce this by bumping the timeout.

Change-Id: Ic4ec2ea3c8b77cb308fb1a85514d831acf6c4b67
2016-09-01 09:26:57 -04:00
Jenkins 80fe50f484 Merge "Ansible launcher: re-register functions after disconnect" 2016-08-31 23:21:17 +00:00
Jenkins a63f4e7bbe Merge "Revert "Make job registration with labels optional"" 2016-08-30 18:50:20 +00:00
Paul Belanger 30f2b29874
Revert "Store ssh_host_key of remote node"
Jobs no longer launch using this code. Revert so we can debug the
issue.

This reverts commit b6341fbe63.

Change-Id: Ie8076e3e162e3f223367321d8f57ccb48a0f57f6
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-08-25 00:43:17 -04:00
Paul Belanger b6341fbe63
Store ssh_host_key of remote node
Run ssh-keygen on the known_host file to extract the ssh_host_key.  We
do this to help debug the scenario when the remote nodes
identification has changed:

  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
  Someone could be eavesdropping on you right now (man-in-the-middle
  attack)!
  It is also possible that a host key has just been changed.
  The fingerprint for the RSA key sent by the remote host is
  51:82:00:1c:7e:6f:ac:ac:de:f1:53:08:1c:7d:55:68.
  Please contact your system administrator.

Change-Id: Ica41c80db91e7b08dbc34516b3812da4148c36e3
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-08-24 15:15:32 -04:00
James E. Blair 17262fd512 Ansible launcher: retry publisher sync tasks
We sometimes see errors rsyncing data from the node or to the
log server.  Since these are all rsync commands, they are safe
to retry.  Attempt all post playbook rsyncs up to 3 times with
a 30 second delay between each attempt.

Change-Id: I329e1f1f31d53d82799e3485a912b76e2249d03f
2016-08-10 07:57:11 -07:00
Paul Belanger 6662f85042
Use {{ ansible_host }} for ssh-keyscan
This is a noop change, which removes the hardcoded node IP address
from our playbook. This is a step forward to allow users to re-run our
playbooks in an effort to reproduce produce problems locally.

Change-Id: I3d3b979fb9bfffce1ea1466403a277e6f6e146cc
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-08-08 18:53:45 -04:00
James E. Blair 2959855544 Ansible launcher: re-register functions after disconnect
Because we are using the private MASS_DO gearman operation to
register functions, the gear.Worker does not know what functions
are registered and therefore the routine which automatically
re-registers functions after a gear server disconnect was not
effective.  Correct this by also storing the function list when
sending MASS_DO.  This will result in the worker actually sending
CAN_DO packets rather than MASS_DO in the case of a reconnect,
but at least it will be correct, if not efficient.

This error would cause existing nodes attached to zuul launchers
to be unable to run jobs after a zuul (geard) restart.

Change-Id: I60804355a8b3a3cfb79a12dd6e6f0e219fe50c31
2016-08-03 15:14:43 -07:00
James E. Blair 176431ec14 Ansible launcher: set remote_tmp
When we use 'delegate_to' to run commands locally, the 'remote'
side of the Ansible connection is the local host.  When running
these tasks it will write to the 'remote_tmp' directory, which
is actually the local ~/.ansible/tmp directory.  We also set
'keep_remote_files' to true in order to avoid a race condition
with 'async' on the actual remote hosts, but in this case, these
two options in combination end up meaning 'keep some files in
the local ~/.ansible/tmp directory indefinitely' which is not
good for our long-running launchers.

Instead, set 'remote_tmp' to a subdirectory of the jobdir so that
when used in the local context, it will be cleaned up at the end
of the run.  In the remote context, it will end up in a similarly
randomly named directory under /tmp on the worker.  Ansible will
create that directory.  This has the side benefit of removing the
Ansible running the job further from potential uses of Ansible
within the job (which may continue to use ~/.ansible by default).

Change-Id: I70475d5844cbd66bf670566f992fdec263d271a5
2016-07-25 08:11:24 -07:00
Jenkins c709b6adb7 Merge "Ansible launcher: Use port 19885 for console streaming" 2016-07-21 20:37:06 +00:00
James E. Blair ff6dd45cbc Ansible launcher: Use port 19885 for console streaming
Thanks Jay!

Change-Id: Ie67bbba02dbf61a481f66001de3e0dede9448316
Closes-Bug: 1590139
2016-07-21 11:43:02 -07:00
James E. Blair bb2e9dbcfb Revert "Make job registration with labels optional"
This reverts commit aad4917fce.

We ended up not using this.

Change-Id: I17d37627528ece1880d05e372099e9d1158e1fec
2016-07-20 09:51:48 -07:00
Jenkins bc58ea3412 Merge "Make job registration with labels optional" 2016-07-19 21:06:49 +00:00