From ef313d1f8eb4bccbf91e8621a417651602444972 Mon Sep 17 00:00:00 2001
From: Dan Smith <dansmith@redhat.com>
Date: Mon, 4 Apr 2022 12:40:09 -0700
Subject: [PATCH] Add generic guidance on rechecks

Per recent discussions, this attempts to provide some general
guidance about responsible use of rechecks that we can link to when
"encouraging" people to exhibit the desired behavior.

Change-Id: Iadfc32f79423b54e5933788686791c42c8d616fd
---
 doc/source/testing.rst | 87 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 85 insertions(+), 2 deletions(-)

diff --git a/doc/source/testing.rst b/doc/source/testing.rst
index 48a9fc0..9601a51 100644
--- a/doc/source/testing.rst
+++ b/doc/source/testing.rst
@@ -99,8 +99,91 @@ For more information on the automated testing infrastructure itself,
 including how to configure and use it, see the `OpenDev
 Manual <https://docs.opendev.org/opendev/infra-manual/latest/>`_.
 
-Test Failures
-=============
+How to Handle Test Failures
+===========================
+
+If Zuul reports a test failure on a patch, the first step should be
+identifying what went wrong. You will be tempted to just recheck the
+patch to see if it fails again, but please **DO NOT DO THAT.** CI test
+resources are a very scarce resource (and becoming more so all the
+time), so please be extremely sparing when asking the system to re-run
+tests.
+
+.. note:: Please do not **EVER** simply ``recheck`` without a
+          reason. Even if that reason is "I don't know", please
+          indicate that you at least *attempted* to determine the
+          reason for the failure.
+
+It is important that before you request a recheck, you adhere to the
+following guidelines:
+
+#. First, you should examine the logs of the jobs that failed. Look
+   for the reason why the job failed, be it failed tests, or a setup
+   failure, such as a failed devstack run, or job timeout. You should
+   always begin this process suspecting that the failure is a result
+   of the proposed patch itself, but with an eye to the problem being
+   unrelated. Try to determine the most obvious cause for the failure,
+   and do not ignore failures in multiple voting jobs.
+#. If the failure is likely caused by the proposed patch, you should
+   try whenever possible to reproduce the failure locally. This will
+   allow you to revise the change and re-submit with a higher
+   likelihood of subsequently getting a passing run.
+#. If the failure appears to be totally unrelated to the patch at
+   hand, look for some indication of what went wrong. Only after you
+   have done this should you ask Zuul to re-run the tests. To do this,
+   comment on the patch the recheck command and a reason. Examples of
+   this are:
+
+   ``recheck nova timed out waiting for glance``
+
+   ``recheck glance lost connection to mysql``
+
+   ``recheck cinder failed to detach volume``
+
+#. The gold standard for recheck commands is ``recheck bug #XXXXXXX``,
+   which directly references a known problem that is being
+   worked. Doing this helps add heat to that bug and enables stats
+   tracking so the community knows what bugs are blocking the most
+   people in the CI system.
+#. In some cases, it may be entirely unclear why something failed. In
+   this case, you may need to recheck with a reason of "Not sure what
+   failed, rechecking to get another data point."
+#. If a recheck results in a similar failure on the subsequent run, it
+   would be best to reach out (via the mailing list or IRC) to the
+   project team responsible for the service you think is failing and
+   look for some guidance on whether or not the issue is known and
+   being worked, as it may be that a patch for the problem is proposed
+   but not merged, which you can ``Depends-On`` to move forward.
+#. Especially if the same failure occurs more than once and is not yet
+   reported, it is highly recommended that you open a bug against the
+   project (or projects) affected and use that for your recheck.
+
+Suggestions For Determining Causes of Failure
+---------------------------------------------
+
+This is more art than science, but here are some ideas:
+
+- First examine the ``job-output.txt`` file to see if the job failed
+  while running tests, or earlier when setup was running.
+- If it looks like a test failure, the ``testr_results.html`` file is
+  usually very helpful for looking at individual failures.
+- If a test failed, try to identify which services are being used in
+  that test. Quickly skim the logs for those services looking for
+  **ERROR** lines and especially tracebacks that seem to line up with
+  the test failure. For example, if the test is a compute failure to
+  attach a volume, it would be good to look at ``n-api``,
+  ``n-cpu``, ``c-api``, and ``c-vol`` logs as Nova and Cinder are
+  both involved in that process.
+- Test failures in tempest-based jobs generally print out resource
+  IDs, such as instance or volume UUIDs. Use these to search the
+  relevant logs for errors and warnings related to a resource that was
+  involved in the test failure.
+- Looking at the timestamps of test failures can also help locate
+  relevant lines in the service logs.
+
+
+Automatic Test Failure Identification
+=====================================
 
 OpenStack project integration tests have logs from running services
 automatically uploaded to a logstash-based processing system.  An