The above exception [1] occurs for example [2] when elasticsearch returns data
with more than one zuul_executor as a list.
This is what l#58 is able to sort
[(12.5, '5'), (12.5, '4'), (12.5, '3'), (25.0, '6'), (18.75, '2'), (6.25, '13'), (12.5, '1')]
This is when it throws exception
[(8.13953488372093, 'ze06.opendev.org'),
(12.790697674418604, 'ze10.opendev.org'),
(5.813953488372093, 'ze05.opendev.org'),
(8.13953488372093, 'ze01.opendev.org'),
(16.27906976744186, 'ze04.opendev.org'),
(4.651162790697675, 'ze03.opendev.org'),
(3.488372093023256, 'ze02.opendev.org'),
(4.651162790697675, 'ze08.opendev.org'),
(12.790697674418604, 'ze09.opendev.org'),
(20.930232558139537, 'ze12.opendev.org'),
(1.1627906976744187, 'ze11.opendev.org'),
(1.1627906976744187, ['ze12.opendev.org', 'ze11.opendev.org'])]
[1] https://0050cb9fd8118437e3e0-3c2a18acb5109e625907972e3aa6a592.ssl.cf5.rackcdn.com/790065/7/check/openstack-tox-py38/4968a73/tox/test_results/1449136.yaml.log
[2] https://review.opendev.org/c/openstack/tripleo-ci-health-queries/+/787569/6/output/elastic-recheck/1449136.yaml
Change-Id: Ie559d5764d9f68420119a7f9608389f0745a9c02
This should render the need to use wrappers obsolete as all
file writing operations are now atomic, assuring that we either
write the entire file or fail.
That is important as we do not want to end-up serving partial files
with the web-server.
Change-Id: I696e2474b557e6b5fea707a198f32cea721cc150
Refactors configuration loading in order to simplify it and to
allow overriding defaults using environment variables.
This behavior is similar to other tools like pip or ansible, which
can load any configurable option from env.
This step ease migration towards containerized use, where we do not
want to keep any secrets inside containers and we may want to
avoid volume mounting, especially when testing.
Change-Id: I0d3a9f19b0ba8d1604d0ca63db01296a3219fb47
Replaces static implementation that received password and a member
function that can make use of the config object.
Change-Id: If9617b6db73eb49c5193f098d45e357a267529dd
Switches queries testing to use of pytest which provides the following:
- test generator for each query (parametrize)
- ability to test a single query test
- generate html report with test results, making easier to investigate
failures.
- parallel executions
- minor bugfix which prevented running queries from running with py38
as the config parser requires only strings (None being invalid).
Change-Id: I982c694a5160a9ecfd117d177d30b911cfe53425
- Dropping py27 as is out of support
- Enable py38 testing, already default python on several distros.
- removes six as a dependency as is no longer needed for pure py3
Change-Id: I1e825073abc6cd55aa2fdc363358f2701152c57b
- Upgraded hacking(flake8)
- Added more modern tox linters environment (pep8 alias)
- Temporary added skips for broken newer rules
- Fixed few basic rule violations
- Moved flake8 config to setup.cfg (tox.ini is not recommended)
Change-Id: I75b3ce5d2ce965a9dc5bdfaa49b2aacd8f0195ad
The json file outputs of e-r are loaded by web browsers in order to
render our graphs. These json files are actually quite large and part of
the reason why is we pretty print them with 4 space indents and they
have large nesting. Stop pretty printing (humans can pass the files
through a filter if necessary) in order to reduce the size of these
files and make browsers happier (less time spent downloading).
Change-Id: I19dedc2994169932eb0e90b6cdea3856637f5ef0
Getting elasticsearch data for bug 1708704 is failing
in the check queue with:
pyelasticsearch.exceptions.ElasticHttpError: \
(500, 'ArrayIndexOutOfBoundsException[null]')
This might have to do with the size of the resulting
messages from the hits on the tripleo and kolla jobs,
I'm not sure.
What's clear though is the graph generation is blowing
up in the check queue on that bug but not the gate queue,
maybe due to a smaller result set, so this adds some
error handling in the graph generation for when a specific
bug query fails so it does not halt the entire build of the
graph.
Change-Id: Ibe18c9cccc421a6549a18148f1a2ce3c1e4339d4
The elastic-recheck-tox-queries job is failing because
there is a query on an os-brick bug and the os-brick
project in launchpad is not part of the openstack project
group. This change simply hard-codes it since we know os-brick
is part of openstack.
Change-Id: Ia05c009226f88da427ec6ad9724410cd6ebed859
Story: 2006736
Task: 37197
If a bug is invalid in a project then we should probably
consider its query for removal in the cleanup command.
For example, bug 1663529 and bug 1828244 were both marked
Invalid and had no hits but weren't processed by the
cleanup command.
Change-Id: I7bac9fc169601c86a26565e9fa5b3d72c362a8fc
This automates the process to remove old queries
for fixed bugs. It's a bit conservative to start
so it doesn't check for open reviews nor does it
filter out affected projects with non-Fix* status
on the bug. It can be made more robust once we're
confident in how it works and play with it on the
open queries.
Change-Id: Iaaf17892804453b99a846be27457c88e5a8f8a55
As of the great renaming of 2019 we need to update the
openstack gerrit URL default to review.opendev.org.
Change-Id: I2e3f7e7fb03be0deba0c95995265376dbce3c5b6
Story: #2005498
Task: #30599
The playbook location changed and is no longer in project-config
so just make the query generic on the sub-directory rather than
the git repo.
Change-Id: I8532c193992adef0e996a3f42e9e84f491000c32
Chances are probably 0 that we won't have failures
or that we'd have 100% categorization rates, which
probably mean if we don't get any failures the
default ALL_FAILS_QUERY is broken, which can easily
happen:
I208675c2258b6c635925c7b9ea9fae5afd000565
This logs a warning if a group yields no failures
based on the default ALL_FAILS_QUERY.
Change-Id: Ib2c12b1fc276389297cf4ac15775e6b2da828fdd
Monty updated the post-ssh.yaml playbook in project config to do other
post tasks and renamed it to post.yaml as a result. Change
If01bdd7b7656b1a9ebaa5d5d7d021f82093db8ac has all the details.
We need to accomodate that in the all fails query of e-r by updating the
all fails query to look for post.yaml instead of post-ssh.yaml. Note
that there are no query matches for post-ssh.yaml so we don't need an
interim period if matching both.
Change-Id: I208675c2258b6c635925c7b9ea9fae5afd000565
Previously we gave every event a 20 minute timeout. This meant that we
could eventually rollover on the day and start querying against current
indexes for data in older indexes. If this happens every query would
fail because we are looking in the wrong index. Every query failing
means we run the 20 minute timeout every time.
All this results in snowballing never being able to check if events are
indexed.
Address this by using the gerrit eventCreatedOn timestamp to determine
when our timeout is hit. We will timeout 20 minutes from that timestamp
regardless of how long interim processing has taken us. This should over
longer periods of time ensure we query the current index for current
events.
Change-Id: Ic9ed7fefae37d2668de5d89e0d06b8326eadfbb9
The first condition does not hit because the message with
the path to post-ssh.yaml was not hitting without the .yaml
suffix on the path. Assuming the unescaped . was messing
with the query.
Change-Id: I293e9c6fa215cb3f8638763895fccb4bfcf3c235
Before the patch, the daemon was not waiting for all service logs to
upload to logstash, just console and grenade logs. It means that queries
that targeted service log files sometimes missed a legit match.
Change-Id: I96ae09c1be8f1b12117bcfc635589e7c149d5df2
There were two problems with the all fails query as sorted out by
manually running the query in kibana. First the query didn't properly
group the two sides of the job log ending query. They were separated by
an OR and were meant to be grouped together as one clause in the query.
Second zuul now requires the .yaml suffix on playbook names so the query
looking for the post ssh playbook needs to end with .yaml.
Change-Id: I951b2824fe6934eca667d1b14f8caf63428da89a
Updating uncategorized failures is currently failing on a query parse
error in elasticsearch. This appears to be due to unbalanced parens in
the new all fails query. Rebalance the parens by removing the extra
leading paren.
Change-Id: I05626c563a9a053e396782c54dae4c6fa7d6e269
This commit ensures elastic-recheck is able to support zuul v2 and v3
simultaneously:
- Add message queries based on v3 completion messages
- Include job-output.txt where console.html was expected
Change-Id: If3d7990d892a9698a112d9ff1dd5160998a4efe6
Depends-On: I7e34206d7968bf128e140468b9a222ecbce3a8f1
When gerrit is running slow we get 502 responses
back which kills the graph builder. We can retry
these requests from the client to keep going. Generally
a single retry fixes it.
Change-Id: I745d7c9b80ab8861972193d82c037df76af69e06
To avoid confusion, switch everything to use jobs_re for recheckwatch
config.
Change-Id: I1a84db6ec346a32f38e00560c1b322e7d377d434
Needed-By: I1e2369225c9bd83296684af0dd9ea0514d9098c4
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
13.3 minutes doesn't seem to be adequate anymore for getting the
console logs from elasticsearch, so this change increases the timeout
to 20 minutes.
Change-Id: I77f9d79833e23f2b9cda3622832d4315ea574f4a
Every time we create a FailEvent for a failed job
gerrit comment event we're reconstructing a Config
object unnecessarily, we can just pass the config in
from the Stream object to the FailEvent object.
Change-Id: Ibd85a4f0e813bc9bfff69de8f4f42951face88e4
This should make elastic recheck to capture queries in projects like
neutron where the previous regex was not working for quite some time.
(In neutron gate, full job is called
gate-tempest-dsvm-neutron-full-ubuntu-xenial; there are some jobs that
don't even have 'tempest' in their names that should still participate
in the elastic recheck, like grenade jobs, or rally; all of them have
'dsvm' part though).
Speaking of the regex, probably it should have also be applied to
separate job names before classifying them. But I'll leave it for a
follow-up.
Change-Id: If98951d13ba82833444ef4ffbb7c6be179126f2b
String interpolation should be delayed to be handled by the logging code,
rather than being done at the point of the logging call.
Ref:http://docs.openstack.org/developer/oslo.i18n/guidelines.html#log-translation
For example:
# WRONG
LOG.info(_LI('some message: variable=%s') % variable)
# RIGHT
LOG.info(_LI('some message: variable=%s'), variable)
Change-Id: I44b85cbf9f4b27d1fee2c1465029fca8cde4f87e
When elastic search indexing is behind, and the day has
progressed forward to a new day, the latest
index is not yet available for use. Exclude it from searches
until it is ready in order to avoid the ElasticHttpNotFoundError.
Add Unit tests for this case as well as for when multiple days
are specified for the search.
Change-Id: Ifd27d1ab21bebcb63b48ea164f425c4a2ac8759c