Merge "clean up readme"

This commit is contained in:
Jenkins 2014-08-28 01:59:16 +00:00 committed by Gerrit Code Review
commit ce81aa0dbf
1 changed files with 45 additions and 44 deletions

View File

@ -1,6 +1,6 @@
===============================
===============
elastic-recheck
===============================
===============
"Use ElasticSearch to classify OpenStack gate failures"
@ -8,27 +8,28 @@ elastic-recheck
Idea
----
Identifying the specific bug that is causing a transient error in the gate
is very hard. Just identifying which tempest test failed is not enough
because a single bug can potentially cause multiple tempest tests to fail.
If we can find a fingerprint for a specific bug using logs, then we can use
ElasticSearch to automatically detect any occurrences of the bug.
Identifying the specific bug that is causing a transient error in the gate is
difficult. Just identifying which tempest test failed is not enough because a
single tempest test can fail due to any number of underlying bugs. If we can
find a fingerprint for a specific bug using logs, then we can use ElasticSearch
to automatically detect any occurrences of the bug.
Using these fingerprints elastic-recheck can:
* Search ElasticSearch for all occurrences of a bug.
* Identify bug trends such as: when it started, is the bug fixed, is it
getting worse, etc.
* Identify bug trends such as: when it started, is the bug fixed, is it getting
worse, etc.
* Classify bug failures in real time and report back to gerrit if we find a
match, so a patch author knows why the test failed.
queries/
--------
All queries are stored in separate yaml files in a queries directory
at the top of the elastic-recheck code base. The format of these files
is ######.yaml (where ###### is the launchpad bug number), the yaml should have
a ``query`` keyword which is the query text for elastic search.
All queries are stored in separate yaml files in a queries directory at the top
of the elastic-recheck code base. The format of these files is ######.yaml
(where ###### is the launchpad bug number), the yaml should have a ``query``
keyword which is the query text for elastic search.
Guidelines for good queries:
@ -36,60 +37,60 @@ Guidelines for good queries:
filename query is typically better than a console one, as that's matching a
deep failure versus a surface symptom.
- Queries should not return any hits for successful jobs, this is a
sign the query isn't specific enough. A rule of thumb is > 10% success hits
probably means this isn't good enough.
- Queries should not return any hits for successful jobs, this is a sign the
query isn't specific enough. A rule of thumb is > 10% success hits probably
means this isn't good enough.
- If it's impossible to build a query to target a bug, consider patching the
upstream program to be explicit when it fails in a particular way.
- Use the 'tags' field rather than the 'filename' field for filtering. This is
primarily because of grenade jobs where the same log file shows up in the
'old' and 'new' side of the grenade job. For example, tags:"screen-n-cpu.txt"
will query in logs/old/screen-n-cpu.txt and logs/new/screen-n-cpu.txt. The
tags:"console" filter is also used to query in console.html as well as
tempest and devstack logs.
'old' and 'new' side of the grenade job. For example,
``tags:"screen-n-cpu.txt"`` will query in ``logs/old/screen-n-cpu.txt`` and
``logs/new/screen-n-cpu.txt``. The ``tags:"console"`` filter is also used to
query in ``console.html`` as well as tempest and devstack logs.
- Avoid the use of wildcards in queries since they can put an undue burden on
the query engine. A common case where wildcards are used and shouldn't be are
in querying against a specific set of build_name fields,
e.g. gate-nova-python26 and gate-nova-python27.
Rather than use build_name:gate-nova-python*, list the jobs with an OR, e.g.:
::
in querying against a specific set of ``build_name`` fields, e.g.
``gate-nova-python26`` and ``gate-nova-python27``. Rather than use
``build_name:gate-nova-python*``, list the jobs with an ``OR``. For example::
(build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")
In order to support rapidly added queries, it's considered socially
acceptable to +A changes that only add 1 new bug query, and to even
self approve those changes by core reviewers.
In order to support rapidly added queries, it's considered socially acceptable
to approve changes that only add 1 new bug query, and to even self approve
those changes by core reviewers.
Adding Bug Signatures
---------------------
Most transient bugs seen in gate are not bugs in tempest associated
with a specific tempest test failure, but rather some sort of issue
further down the stack that can cause many tempest tests to fail.
Most transient bugs seen in gate are not bugs in tempest associated with a
specific tempest test failure, but rather some sort of issue further down the
stack that can cause many tempest tests to fail.
#. Given a transient bug that is seen during the gate, go through the
logs (logs.openstack.org) and try to find a log that is associated
with the failure. The closer to the root cause the better.
#. Given a transient bug that is seen during the gate, go through `the logs
<http://logs.openstack.org/>`_ and try to find a log that is associated with
the failure. The closer to the root cause the better.
Note that queries can only be written against INFO level and higher log
messages. This is by design to not overwhelm the search cluster.
#. Go to logstash.openstack.org and create an elastic search query to
find the log message from step 1. To see the possible fields to
search on click on an entry. Lucene query syntax is available at
http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
#. Go to `logstash.openstack.org <http://logstash.openstack.org/>`_ and create
an elastic search query to find the log message from step 1. To see the
possible fields to search on click on an entry. Lucene query syntax is
available at `lucene.apache.org
<http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>`_.
#. Add a comment to the bug with the query you identified and a link to
the logstash url for that query search.
#. Add the query to ``elastic-recheck/queries/BUGNUMBER.yaml`` and push
the patch up for review.
https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries
#. Tag your commit with a ``Related-Bug`` tag in the footer, or add a comment
to the bug with the query you identified and a link to the logstash URL for
that query search.
#. Add the query to ``elastic-recheck/queries/BUGNUMBER.yaml``
(All queries can be found on `git.openstack.org
<https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries>`_)
and push the patch up for review.
Future Work
------------