summaryrefslogtreecommitdiff
path: root/README.rst
blob: 444d76ebd314629527b4b9a92f48b6d3a7203246 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
===============
elastic-recheck
===============

"Use ElasticSearch to classify OpenStack gate failures"

* Open Source Software: Apache license

Idea
----

Identifying the specific bug that is causing a transient error in the gate is
difficult. Just identifying which tempest test failed is not enough because a
single tempest test can fail due to any number of underlying bugs. If we can
find a fingerprint for a specific bug using logs, then we can use ElasticSearch
to automatically detect any occurrences of the bug.

Using these fingerprints elastic-recheck can:

* Search ElasticSearch for all occurrences of a bug.
* Identify bug trends such as: when it started, is the bug fixed, is it getting
  worse, etc.
* Classify bug failures in real time and report back to gerrit if we find a
  match, so a patch author knows why the test failed.

queries/
--------

All queries are stored in separate yaml files in a queries directory at the top
of the elastic-recheck code base. The format of these files is ######.yaml
(where ###### is the launchpad bug number), the yaml should have a ``query``
keyword which is the query text for elastic search.

Guidelines for good queries:

- Queries should get as close as possible to fingerprinting the root cause. A
  screen log query (e.g. ``tags:"screen-n-net.txt"``) is typically better than
  a console one (``tags:"console"``), as that's matching a deep failure versus
  a surface symptom.

- Queries should not return any hits for successful jobs, this is a sign the
  query isn't specific enough. A rule of thumb is > 10% success hits probably
  means this isn't good enough.

- If it's impossible to build a query to target a bug, consider patching the
  upstream program to be explicit when it fails in a particular way.

- Use the 'tags' field rather than the 'filename' field for filtering. This is
  primarily because of grenade jobs where the same log file shows up in the
  'old' and 'new' side of the grenade job. For example,
  ``tags:"screen-n-cpu.txt"`` will query in ``logs/old/screen-n-cpu.txt`` and
  ``logs/new/screen-n-cpu.txt``. The ``tags:"console"`` filter is also used to
  query in ``console.html`` as well as tempest and devstack logs.

- Avoid the use of wildcards in queries since they can put an undue burden on
  the query engine. A common case where wildcards are used and shouldn't be are
  in querying against a specific set of ``build_name`` fields, e.g.
  ``gate-nova-python26`` and ``gate-nova-python27``. Rather than use
  ``build_name:gate-nova-python*``, list the jobs with an ``OR``. For example::

   (build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")

When adding queries you can optionally suppress the creation of graphs
and notifications by adding ``suppress-graph: true`` or
``suppress-notification: true`` to the yaml file.  These can be used to make
sure expected failures don't show up on the unclassified page.

If the only signature available is overly broad and adding additional logging
can't reasonably make a good signature, you can also filter the results of a
query based on the test_ids that failed for the run being checked.
This can be done by adding a ``test_ids`` keyword to the query file and then a
list of the test_ids to verify failed. The test_id also should exclude any
attrs, this is the list of attrs appended to the test_id between '[]'. For
example, 'smoke', 'slow', any service tags, etc. This is how subunit-trace
prints the test ids by default if you're using it. If any of the listed
test_ids match as failing for the run being checked with the query it will
return a match. Since filtering leverages subunit2sql which only receives
tempest test results from the gate pipeline, this technique will only work on
tempest or grenade jobs in the gate queue. For more information about this
refer to the `infra subunit2sql documentation`_ For example, if your query yaml file looked like::

    query: >-
      message:"ExceptionA"
    test_ids:
      - tempest.api.compute.servers.test_servers.test_update_server_name
      - tempest.api.compute.servers.test_servers_negative.test_server_set_empty_name

this will only match the bug if the logstash query had a hit for the run and
either test_update_server_name or test_server_set_empty name failed during the
run.

.. _infra subunit2sql documentation: http://docs.openstack.org/infra/system-config/logstash.html#subunit2sql

In order to support rapidly added queries, it's considered socially acceptable
to approve changes that only add 1 new bug query, and to even self approve
those changes by core reviewers.


Adding Bug Signatures
---------------------

Most transient bugs seen in gate are not bugs in tempest associated with a
specific tempest test failure, but rather some sort of issue further down the
stack that can cause many tempest tests to fail.

#. Given a transient bug that is seen during the gate, go through `the logs
   <http://logs.openstack.org/>`_ and try to find a log that is associated with
   the failure. The closer to the root cause the better.

   - Note that queries can only be written against INFO level and higher log
     messages. This is by design to not overwhelm the search cluster.
   - Since non-voting jobs are not allowed in the gate queue and e-r is
     primarily used for tracking bugs in the gate queue, it doesn't spend time
     tracking race failures in non-voting jobs since they are considered
     unstable by definition (since they don't vote).

     - There is, however, a special 'allow-nonvoting' key that can be added
       to a query yaml file to allow tracking non-voting job bug failures in
       the graph. They won't show up in the bot though (IRC or Gerrit
       comments).

#. Go to `logstash.openstack.org <http://logstash.openstack.org/>`_ and create
   an elastic search query to find the log message from step 1. To see the
   possible fields to search on click on an entry. Lucene query syntax is
   available at `lucene.apache.org
   <http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>`_.

#. Tag your commit with a ``Related-Bug`` tag in the footer, or add a comment
   to the bug with the query you identified and a link to the logstash URL for
   that query search.

   Putting the logstash query link in the bug report is also valuable in the
   case of rare failures that fall outside the window of how far back log
   results are stored. In such cases the bug might be marked as Incomplete
   and the e-r query could be removed, only for the failure to re-surface
   later. If a link to the query is in the bug report someone can easily
   track when it started showing up again.

#. Add the query to ``elastic-recheck/queries/BUGNUMBER.yaml``
   (All queries can be found on `git.openstack.org
   <https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries>`_)
   and push the patch up for review.

You can also help classify `Unclassified failed jobs
<http://status.openstack.org/elastic-recheck/data/uncategorized.html>`_, which
is an aggregation of all failed voting gate jobs that don't currently have
elastic-recheck fingerprints.


Removing Bug Signatures
-----------------------

Old queries which are no longer hitting in logstash and are associated with
fixed or incomplete bugs are routinely deleted. This is to keep the load on the
elastic-search engine as low as possible when checking a job failure. If a bug
marked as Incomplete does show up again, the bug should be re-opened with a
link to the failure and the e-r query should be restored.

Queries that have "suppress-graph: true" in them generally should not be
removed since we basically want to keep those around, they are persistent infra
issues and are not going away.

Steps:

#. Go to the `All Pipelines <http://status.openstack.org/elastic-recheck/index.html>`_ page.
#. Look for anything that is grayed out at the bottom which means it has not
   had any hits in 10 days.
#. From those, look for the ones that are status of
   Fixed/Incomplete/Invalid/Won't Fix in Launchpad - those are candidates for
   removal.

.. note::

  Sometimes bugs are still New/Confirmed/Triaged/In Progress but have
  not had any hits in over 10 days. Those bugs should be re-assessed to see
  if they are now actually fixed or incomplete/invalid, marked as such and
  then remove the related query.


Running Queries Locally
-----------------------

You can execute an individual query locally and analyze the search results::

    $ elastic-recheck-query queries/1331274.yaml
    total hits: 133
    build_status
      100% FAILURE
    build_name
      48% check-grenade-dsvm
      15% check-grenade-dsvm-partial-ncpu
      13% gate-grenade-dsvm
      9% check-grenade-dsvm-icehouse
      9% check-grenade-dsvm-partial-ncpu-icehouse
    build_branch
      95% master
      4% stable/icehouse

Notes
-----

* The html generation will generate links that work with Kibana3's
  logstash.json dashboard. If you want the links to work properly on these
  generated files you will need to host a Kibana3 with that dashboard.
* View the OpenStack ElasticSearch `cluster health here`_.

Future Work
-----------

- Move config files into a separate directory
- Make unit tests robust
- Add debug mode flag
- Expand gating testing
- Cleanup and document code better
- Add ability to check if any resolved bugs return
- Move away from polling ElasticSearch to discover if its ready or not
- Add nightly job to propose a patch to remove bug queries that return
  no hits -- Bug hasn't been seen in 2 weeks and must be closed


.. _cluster health here: http://logstash.openstack.org/elasticsearch/_cluster/health?pretty=true