Cleanup and clarify README

* Clarify what elastic-recheck does. Initially the primary goal was to
  report back to gerrit, but we are using elastic-recheck for much more
  then that now, so update docs to reflect that.
* Remove section on dependencies, it was used before we had a proper
  requirements file.
* Remove references to adding the resolved_at option, as that is now
  implemented

Change-Id: I6aa55bdf02174f13d86ad3309f5ad53110dc647d
This commit is contained in:
Joe Gordon 2013-12-13 17:37:58 +01:00
parent 487e64f2fd
commit c507fa1892
1 changed files with 16 additions and 18 deletions

View File

@ -2,35 +2,40 @@
elastic-recheck
===============================
"Classify tempest-devstack failures using ElasticSearch"
"Use ElasticSearch to classify OpenStack gate failures"
* Open Source Software: Apache license
* Documentation: http://docs.openstack.org/developer/elastic-recheck
Idea
----
When a tempest job failure is detected, by monitoring gerrit (using
gerritlib), a collection of logstash queries will be run on the failed
job to detect what the bug was.
Identifying the specific bug that is causing a transient error in the gate
is very hard. Just identifying which tempest test failed is not enough
because a single bug can potentially cause multiple tempest tests to fail.
If we can find a fingerprint for a specific bug using logs, then we can use
ElasticSearch to automatically detect any occurrences of the bug.
Eventually this can be tied into the rechecker tool and launchpad
Using these fingerprints elastic-recheck can:
* Search ElasticSearch for all occurrences of a bug.
* Identify bug trends such as: when it started, is the bug fixed, is it
getting worse, etc.
* Classify bug failures in real time and report back to gerrit if we find a
match, so a patch author knows why the test failed.
queries/
--------
All queries are stored in separate yaml files in a queries directory
at the top of the elastic_recheck code base. The format of these files
is ######.yaml (where ###### is the bug number), the yaml should have
at the top of the elastic-recheck code base. The format of these files
is ######.yaml (where ###### is the launchpad bug number), the yaml should have
a ``query`` keyword which is the query text for elastic search.
Guidelines for good queries
- After a bug is resolved and has no more hits in elasticsearch, we
should flag it with a resolved_at keyword. This will let us keep
some memory of past bugs, and see if they come back. (Note: this is
a forward looking statement, sorting out resolved_at will come in
the future)
some memory of past bugs, and see if they come back.
- Queries should get as close as possible to fingerprinting the root cause
- Queries should not return any hits for successful jobs, this is a
sign the query isn't specific enough
@ -69,14 +74,7 @@ Future Work
- Add debug mode flag
- Expand gating testing
- Cleanup and document code better
- Sort out resolved_at stamping to remove active bugs
- Add ability to check if any resolved bugs return
- Move away from polling ElasticSearch to discover if its ready or not
- Add nightly job to propose a patch to remove bug queries that return
no hits -- Bug hasn't been seen in 2 weeks and must be closed
- implement resolved_at in loader
Main Dependencies
------------------
- gerritlib
- pyelasticsearch