Cleanup and clarify README
* Clarify what elastic-recheck does. Initially the primary goal was to report back to gerrit, but we are using elastic-recheck for much more then that now, so update docs to reflect that. * Remove section on dependencies, it was used before we had a proper requirements file. * Remove references to adding the resolved_at option, as that is now implemented Change-Id: I6aa55bdf02174f13d86ad3309f5ad53110dc647d
This commit is contained in:
parent
487e64f2fd
commit
c507fa1892
34
README.rst
34
README.rst
|
@ -2,35 +2,40 @@
|
|||
elastic-recheck
|
||||
===============================
|
||||
|
||||
"Classify tempest-devstack failures using ElasticSearch"
|
||||
"Use ElasticSearch to classify OpenStack gate failures"
|
||||
|
||||
* Open Source Software: Apache license
|
||||
* Documentation: http://docs.openstack.org/developer/elastic-recheck
|
||||
|
||||
Idea
|
||||
----
|
||||
When a tempest job failure is detected, by monitoring gerrit (using
|
||||
gerritlib), a collection of logstash queries will be run on the failed
|
||||
job to detect what the bug was.
|
||||
Identifying the specific bug that is causing a transient error in the gate
|
||||
is very hard. Just identifying which tempest test failed is not enough
|
||||
because a single bug can potentially cause multiple tempest tests to fail.
|
||||
If we can find a fingerprint for a specific bug using logs, then we can use
|
||||
ElasticSearch to automatically detect any occurrences of the bug.
|
||||
|
||||
Eventually this can be tied into the rechecker tool and launchpad
|
||||
Using these fingerprints elastic-recheck can:
|
||||
|
||||
* Search ElasticSearch for all occurrences of a bug.
|
||||
* Identify bug trends such as: when it started, is the bug fixed, is it
|
||||
getting worse, etc.
|
||||
* Classify bug failures in real time and report back to gerrit if we find a
|
||||
match, so a patch author knows why the test failed.
|
||||
|
||||
queries/
|
||||
--------
|
||||
|
||||
All queries are stored in separate yaml files in a queries directory
|
||||
at the top of the elastic_recheck code base. The format of these files
|
||||
is ######.yaml (where ###### is the bug number), the yaml should have
|
||||
at the top of the elastic-recheck code base. The format of these files
|
||||
is ######.yaml (where ###### is the launchpad bug number), the yaml should have
|
||||
a ``query`` keyword which is the query text for elastic search.
|
||||
|
||||
Guidelines for good queries
|
||||
|
||||
- After a bug is resolved and has no more hits in elasticsearch, we
|
||||
should flag it with a resolved_at keyword. This will let us keep
|
||||
some memory of past bugs, and see if they come back. (Note: this is
|
||||
a forward looking statement, sorting out resolved_at will come in
|
||||
the future)
|
||||
some memory of past bugs, and see if they come back.
|
||||
- Queries should get as close as possible to fingerprinting the root cause
|
||||
- Queries should not return any hits for successful jobs, this is a
|
||||
sign the query isn't specific enough
|
||||
|
@ -69,14 +74,7 @@ Future Work
|
|||
- Add debug mode flag
|
||||
- Expand gating testing
|
||||
- Cleanup and document code better
|
||||
- Sort out resolved_at stamping to remove active bugs
|
||||
- Add ability to check if any resolved bugs return
|
||||
- Move away from polling ElasticSearch to discover if its ready or not
|
||||
- Add nightly job to propose a patch to remove bug queries that return
|
||||
no hits -- Bug hasn't been seen in 2 weeks and must be closed
|
||||
- implement resolved_at in loader
|
||||
|
||||
|
||||
Main Dependencies
|
||||
------------------
|
||||
- gerritlib
|
||||
- pyelasticsearch
|
||||
|
|
Loading…
Reference in New Issue