Fix watchdog timeout fix

In I6cae11c1e89f6ccc78cb5bfaf61ef78e846e87be, we attempted to fix
an error where long-running workers never reset their watchdog
timeout flag, meaning that once a job timed out, all further jobs
on that worker timed out.  That change cleared the flag each time
ansible ran.  However, that flag is also used in conjunction with
the abort flag to determine whether a failed or null result should
be sent back to Zuul (a null result will cause a job to be
rescheduled).  By clearing the flag before, say, a post playbook
we would lose the information that the abort was due to a timeout
rather than a direct abort request, and return the null result to
Zuul.  This means all jobs that timeout would be relaunched.

Instead of clearing the flag before each ansible run, clear it once
at the start of the job launch.  This means it will be set for any
ansible timeout.  That should be fine for both the aborted job check
as well as the new "timed out" log message.

The typo this change corrects indicates this was the intended logic.

Change-Id: Ie31409a7706b6cf4d7ce858b4d5f0c00e4ee31da
This commit is contained in:
James E. Blair 2016-12-14 10:12:15 -08:00
parent cef224d162
commit 7f7ddbdfa0
1 changed files with 1 additions and 7 deletions

View File

@ -815,7 +815,7 @@ class NodeWorker(object):
result = None
self._sent_complete_event = False
self._aborted_job = False
self._watchog_timeout = False
self._watchdog_timeout = False
try:
self.sendStartEvent(job_name, args)
@ -1424,8 +1424,6 @@ class NodeWorker(object):
preexec_fn=os.setsid,
env=env_copy,
)
# Reset timeout flag
self._watchdog_timeout = False
ret = None
watchdog = Watchdog(ANSIBLE_DEFAULT_PRE_TIMEOUT,
self._ansibleTimeout,
@ -1467,8 +1465,6 @@ class NodeWorker(object):
preexec_fn=os.setsid,
env=env_copy,
)
# Reset timeout flag
self._watchdog_timeout = False
ret = None
watchdog = Watchdog(timeout + ANSIBLE_WATCHDOG_GRACE,
self._ansibleTimeout,
@ -1522,8 +1518,6 @@ class NodeWorker(object):
preexec_fn=os.setsid,
env=env_copy,
)
# Reset timeout flag
self._watchdog_timeout = False
ret = None
watchdog = Watchdog(ANSIBLE_DEFAULT_POST_TIMEOUT,
self._ansibleTimeout,