clean up readme

Change-Id: I6c52fd6a75ca1b6bcbb97624afd6c62eb074f03c
This commit is contained in:
Dolph Mathews 2014-08-18 11:21:05 -05:00 committed by Joe Gordon
parent bff0cbd899
commit eb9126decd
1 changed files with 45 additions and 44 deletions

View File

@ -1,6 +1,6 @@
=============================== ===============
elastic-recheck elastic-recheck
=============================== ===============
"Use ElasticSearch to classify OpenStack gate failures" "Use ElasticSearch to classify OpenStack gate failures"
@ -8,27 +8,28 @@ elastic-recheck
Idea Idea
---- ----
Identifying the specific bug that is causing a transient error in the gate
is very hard. Just identifying which tempest test failed is not enough Identifying the specific bug that is causing a transient error in the gate is
because a single bug can potentially cause multiple tempest tests to fail. difficult. Just identifying which tempest test failed is not enough because a
If we can find a fingerprint for a specific bug using logs, then we can use single tempest test can fail due to any number of underlying bugs. If we can
ElasticSearch to automatically detect any occurrences of the bug. find a fingerprint for a specific bug using logs, then we can use ElasticSearch
to automatically detect any occurrences of the bug.
Using these fingerprints elastic-recheck can: Using these fingerprints elastic-recheck can:
* Search ElasticSearch for all occurrences of a bug. * Search ElasticSearch for all occurrences of a bug.
* Identify bug trends such as: when it started, is the bug fixed, is it * Identify bug trends such as: when it started, is the bug fixed, is it getting
getting worse, etc. worse, etc.
* Classify bug failures in real time and report back to gerrit if we find a * Classify bug failures in real time and report back to gerrit if we find a
match, so a patch author knows why the test failed. match, so a patch author knows why the test failed.
queries/ queries/
-------- --------
All queries are stored in separate yaml files in a queries directory All queries are stored in separate yaml files in a queries directory at the top
at the top of the elastic-recheck code base. The format of these files of the elastic-recheck code base. The format of these files is ######.yaml
is ######.yaml (where ###### is the launchpad bug number), the yaml should have (where ###### is the launchpad bug number), the yaml should have a ``query``
a ``query`` keyword which is the query text for elastic search. keyword which is the query text for elastic search.
Guidelines for good queries: Guidelines for good queries:
@ -36,60 +37,60 @@ Guidelines for good queries:
filename query is typically better than a console one, as that's matching a filename query is typically better than a console one, as that's matching a
deep failure versus a surface symptom. deep failure versus a surface symptom.
- Queries should not return any hits for successful jobs, this is a - Queries should not return any hits for successful jobs, this is a sign the
sign the query isn't specific enough. A rule of thumb is > 10% success hits query isn't specific enough. A rule of thumb is > 10% success hits probably
probably means this isn't good enough. means this isn't good enough.
- If it's impossible to build a query to target a bug, consider patching the - If it's impossible to build a query to target a bug, consider patching the
upstream program to be explicit when it fails in a particular way. upstream program to be explicit when it fails in a particular way.
- Use the 'tags' field rather than the 'filename' field for filtering. This is - Use the 'tags' field rather than the 'filename' field for filtering. This is
primarily because of grenade jobs where the same log file shows up in the primarily because of grenade jobs where the same log file shows up in the
'old' and 'new' side of the grenade job. For example, tags:"screen-n-cpu.txt" 'old' and 'new' side of the grenade job. For example,
will query in logs/old/screen-n-cpu.txt and logs/new/screen-n-cpu.txt. The ``tags:"screen-n-cpu.txt"`` will query in ``logs/old/screen-n-cpu.txt`` and
tags:"console" filter is also used to query in console.html as well as ``logs/new/screen-n-cpu.txt``. The ``tags:"console"`` filter is also used to
tempest and devstack logs. query in ``console.html`` as well as tempest and devstack logs.
- Avoid the use of wildcards in queries since they can put an undue burden on - Avoid the use of wildcards in queries since they can put an undue burden on
the query engine. A common case where wildcards are used and shouldn't be are the query engine. A common case where wildcards are used and shouldn't be are
in querying against a specific set of build_name fields, in querying against a specific set of ``build_name`` fields, e.g.
e.g. gate-nova-python26 and gate-nova-python27. ``gate-nova-python26`` and ``gate-nova-python27``. Rather than use
Rather than use build_name:gate-nova-python*, list the jobs with an OR, e.g.: ``build_name:gate-nova-python*``, list the jobs with an ``OR``. For example::
::
(build_name:"gate-nova-python26" OR build_name:"gate-nova-python27") (build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")
In order to support rapidly added queries, it's considered socially In order to support rapidly added queries, it's considered socially acceptable
acceptable to +A changes that only add 1 new bug query, and to even to approve changes that only add 1 new bug query, and to even self approve
self approve those changes by core reviewers. those changes by core reviewers.
Adding Bug Signatures Adding Bug Signatures
--------------------- ---------------------
Most transient bugs seen in gate are not bugs in tempest associated Most transient bugs seen in gate are not bugs in tempest associated with a
with a specific tempest test failure, but rather some sort of issue specific tempest test failure, but rather some sort of issue further down the
further down the stack that can cause many tempest tests to fail. stack that can cause many tempest tests to fail.
#. Given a transient bug that is seen during the gate, go through the #. Given a transient bug that is seen during the gate, go through `the logs
logs (logs.openstack.org) and try to find a log that is associated <http://logs.openstack.org/>`_ and try to find a log that is associated with
with the failure. The closer to the root cause the better. the failure. The closer to the root cause the better.
Note that queries can only be written against INFO level and higher log Note that queries can only be written against INFO level and higher log
messages. This is by design to not overwhelm the search cluster. messages. This is by design to not overwhelm the search cluster.
#. Go to logstash.openstack.org and create an elastic search query to #. Go to `logstash.openstack.org <http://logstash.openstack.org/>`_ and create
find the log message from step 1. To see the possible fields to an elastic search query to find the log message from step 1. To see the
search on click on an entry. Lucene query syntax is available at possible fields to search on click on an entry. Lucene query syntax is
http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description available at `lucene.apache.org
<http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>`_.
#. Add a comment to the bug with the query you identified and a link to #. Tag your commit with a ``Related-Bug`` tag in the footer, or add a comment
the logstash url for that query search. to the bug with the query you identified and a link to the logstash URL for
#. Add the query to ``elastic-recheck/queries/BUGNUMBER.yaml`` and push that query search.
the patch up for review.
https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries
#. Add the query to ``elastic-recheck/queries/BUGNUMBER.yaml``
(All queries can be found on `git.openstack.org
<https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries>`_)
and push the patch up for review.
Future Work Future Work
------------ ------------