diff --git a/doc/source/testing.rst b/doc/source/testing.rst index 48a9fc0..9601a51 100644 --- a/doc/source/testing.rst +++ b/doc/source/testing.rst @@ -99,8 +99,91 @@ For more information on the automated testing infrastructure itself, including how to configure and use it, see the `OpenDev Manual `_. -Test Failures -============= +How to Handle Test Failures +=========================== + +If Zuul reports a test failure on a patch, the first step should be +identifying what went wrong. You will be tempted to just recheck the +patch to see if it fails again, but please **DO NOT DO THAT.** CI test +resources are a very scarce resource (and becoming more so all the +time), so please be extremely sparing when asking the system to re-run +tests. + +.. note:: Please do not **EVER** simply ``recheck`` without a + reason. Even if that reason is "I don't know", please + indicate that you at least *attempted* to determine the + reason for the failure. + +It is important that before you request a recheck, you adhere to the +following guidelines: + +#. First, you should examine the logs of the jobs that failed. Look + for the reason why the job failed, be it failed tests, or a setup + failure, such as a failed devstack run, or job timeout. You should + always begin this process suspecting that the failure is a result + of the proposed patch itself, but with an eye to the problem being + unrelated. Try to determine the most obvious cause for the failure, + and do not ignore failures in multiple voting jobs. +#. If the failure is likely caused by the proposed patch, you should + try whenever possible to reproduce the failure locally. This will + allow you to revise the change and re-submit with a higher + likelihood of subsequently getting a passing run. +#. If the failure appears to be totally unrelated to the patch at + hand, look for some indication of what went wrong. Only after you + have done this should you ask Zuul to re-run the tests. To do this, + comment on the patch the recheck command and a reason. Examples of + this are: + + ``recheck nova timed out waiting for glance`` + + ``recheck glance lost connection to mysql`` + + ``recheck cinder failed to detach volume`` + +#. The gold standard for recheck commands is ``recheck bug #XXXXXXX``, + which directly references a known problem that is being + worked. Doing this helps add heat to that bug and enables stats + tracking so the community knows what bugs are blocking the most + people in the CI system. +#. In some cases, it may be entirely unclear why something failed. In + this case, you may need to recheck with a reason of "Not sure what + failed, rechecking to get another data point." +#. If a recheck results in a similar failure on the subsequent run, it + would be best to reach out (via the mailing list or IRC) to the + project team responsible for the service you think is failing and + look for some guidance on whether or not the issue is known and + being worked, as it may be that a patch for the problem is proposed + but not merged, which you can ``Depends-On`` to move forward. +#. Especially if the same failure occurs more than once and is not yet + reported, it is highly recommended that you open a bug against the + project (or projects) affected and use that for your recheck. + +Suggestions For Determining Causes of Failure +--------------------------------------------- + +This is more art than science, but here are some ideas: + +- First examine the ``job-output.txt`` file to see if the job failed + while running tests, or earlier when setup was running. +- If it looks like a test failure, the ``testr_results.html`` file is + usually very helpful for looking at individual failures. +- If a test failed, try to identify which services are being used in + that test. Quickly skim the logs for those services looking for + **ERROR** lines and especially tracebacks that seem to line up with + the test failure. For example, if the test is a compute failure to + attach a volume, it would be good to look at ``n-api``, + ``n-cpu``, ``c-api``, and ``c-vol`` logs as Nova and Cinder are + both involved in that process. +- Test failures in tempest-based jobs generally print out resource + IDs, such as instance or volume UUIDs. Use these to search the + relevant logs for errors and warnings related to a resource that was + involved in the test failure. +- Looking at the timestamps of test failures can also help locate + relevant lines in the service logs. + + +Automatic Test Failure Identification +===================================== OpenStack project integration tests have logs from running services automatically uploaded to a logstash-based processing system. An