Replaces static implementation that received password and a member
function that can make use of the config object.
Change-Id: If9617b6db73eb49c5193f098d45e357a267529dd
- Upgraded hacking(flake8)
- Added more modern tox linters environment (pep8 alias)
- Temporary added skips for broken newer rules
- Fixed few basic rule violations
- Moved flake8 config to setup.cfg (tox.ini is not recommended)
Change-Id: I75b3ce5d2ce965a9dc5bdfaa49b2aacd8f0195ad
Previously we gave every event a 20 minute timeout. This meant that we
could eventually rollover on the day and start querying against current
indexes for data in older indexes. If this happens every query would
fail because we are looking in the wrong index. Every query failing
means we run the 20 minute timeout every time.
All this results in snowballing never being able to check if events are
indexed.
Address this by using the gerrit eventCreatedOn timestamp to determine
when our timeout is hit. We will timeout 20 minutes from that timestamp
regardless of how long interim processing has taken us. This should over
longer periods of time ensure we query the current index for current
events.
Change-Id: Ic9ed7fefae37d2668de5d89e0d06b8326eadfbb9
Before the patch, the daemon was not waiting for all service logs to
upload to logstash, just console and grenade logs. It means that queries
that targeted service log files sometimes missed a legit match.
Change-Id: I96ae09c1be8f1b12117bcfc635589e7c149d5df2
This commit ensures elastic-recheck is able to support zuul v2 and v3
simultaneously:
- Add message queries based on v3 completion messages
- Include job-output.txt where console.html was expected
Change-Id: If3d7990d892a9698a112d9ff1dd5160998a4efe6
Depends-On: I7e34206d7968bf128e140468b9a222ecbce3a8f1
13.3 minutes doesn't seem to be adequate anymore for getting the
console logs from elasticsearch, so this change increases the timeout
to 20 minutes.
Change-Id: I77f9d79833e23f2b9cda3622832d4315ea574f4a
Every time we create a FailEvent for a failed job
gerrit comment event we're reconstructing a Config
object unnecessarily, we can just pass the config in
from the Stream object to the FailEvent object.
Change-Id: Ibd85a4f0e813bc9bfff69de8f4f42951face88e4
Refactor to use a config class to hold all the
params needed so that they can be more easily
overridden and reused across all the
elastic-recheck tools.
In addition, use the new class to make the
jobs_regex and ci_username configurable.
Change-Id: Ic6f115a6882494bf4c087ded4d7cafa557765c28
Graphs counts were looking at all history instead of just 10 days
as intended. Update the search to only look at the most recent 10
days.
Change-Id: I9495888a818986b3ac187bac7fd65fbcad6135a3
This makes debugging code gone wrong a bit simpler.
Also fix other __str__ function to use __repr__ as well, to make it
consistent that objects which want representations implement __repr__
and not __str__.
Change-Id: I6913da8f3ef6a4632d5f1c9d6ed26a38cdcd5e73
Elastic recheck is about failures, all queries should only include
voting changes. We do this by explicitly adding voting:1 to all
queries that load in the query builder.
Change-Id: I4bd4827f72d85bf69bf501be2f5744e71de35a3c
pyelasticsearch>1.0 defaults the port to 9200 but logstash.o.o/es
is on port 80, so update the defaults in code and config samples.
Change-Id: Ibb85cd29e1cbc3ff448aa8470854fe0f8bede260
Currently it is not possible to point to a different database or
elastic search engine. Make these configurable by using the
same configuration file used by bot.py.
Also add a logstash url so that it can be configured separately
from elastic search url.
Change-Id: I77e4215765e32c34b67c38e37e5764c6c0e45c84
This commit adds options to the config file for the elastic recheck
bot configuration file. This enables users to specify how to connect
to an elastic recheck server and a subunit2sql database, but things
will still default to using the openstack-infra servers to prevent
breaking the running service.
Change-Id: I10db1a568cc01e137e5f4d8a8814b17201c4c438
This commit adds a new field to the query yaml test_ids which is a
list of test_ids that will be query the subunit2sql db to verify that
at least one of them failed on the failed uuid.
Change-Id: If3668709e3294b5d6bf9e1f082396fbc39c08512
* Similar to suppress-graph
There are some gate failures that are expected and are real errors (such
as global-requirements mismatches in requirements jobs).
suppress-notifications allows us to classify these failures and remove
them from the unclassified page while not telling developers to recheck.
This can be used along with suppress-graph.
Change-Id: I6d905ba65e66e799a65598f8a5d5c3dd684feb8c
I252ae31e7a4cb919e3c98c35591147cc96cfc3cc added the pipeline name to the
zuul gerrit comments. Update the string matching here to work with new
comment format.
Change-Id: I7c09b8f40d594733309660ed76647886653e53ec
This records the current time when the data is constructed, the date
of the last valid looking piece of data in elastic search, and how far
behind we seem to be on indexing. The json payload is adjusted to be
able to take additional metadata to support displaying this on the ER
page.
Change-Id: I0068ca0bbe72943d5d92dea704659ed865fea198
The bug_urls_map method is actually returning a list so just sort the
list and fix the tests that are racing due to random hashseed issues
with the dict.
This also updates the docstring which was incorrect before.
Related-Bug: #1348818
Change-Id: I13ca69b3e685083d4ced2b054e0d42a440259854
This would force a whitespace between message parts so that for
example URL at the end of 'unrecognized' part won't get joined with
first word of 'footer'.
This change also fixes hidden (I guess) bug that should've been
producing UnboundLocalError if FailedEvent.get_all_bugs() returns None.
Change-Id: I3a44db0b7018c49f87702d900961ea7119081b12
Instead of having the messages inline, we should do them in the
yaml file so that changing the UX for the bot reporting isn't a
code change.
Depends-On: I9208123a4cb3be02c272cd8a6eba460f4130a960
Change-Id: I8fdb07f9964f616addba6e8f25e5bd9de27d077a
It turns out that we broke grenade logs being indexed at all. This
will at least give us some warning on looking for them in jobs.
Change-Id: Ic6023b9c2cf64ac57eb023a7c6d60c2d1d731550
Elastic Recheck is really 2 things, real time searching, and bulk
offline categorization. While the bulk categorization needs to look
over the entire dataset, the real time portion is really deadline
oriented. So only cares about the last hour's worth of data. As such
we really don't need to search *all* the indexes in ES, but only
the most recent one (and possibly the one before that if we are near
rotation).
Implement this via a recent= parameter for our search feature. If set
to true then we specify the most recently logstash index. If it turns
out that we're within an hour of rotation, also search the one before
that.
Adjust all the queries the bot uses to be recent=True. This will
hopefully reduce the load generated by the bot on the ES cluster.
Change-Id: I0dfc295dd9b381acb67f192174edd6fdde06f24c
because of the olsotest join, ironic is now in our main gate, and
causing actual main gate failing. Treat it as such for triage
purposes.
Change-Id: Ib43130c3a0eb970dfda79ec422439340ac36bd5d
We shouldn't be reporting back to users why a non-voting job is failing,
Non-voting jobs are non-voting because the are unstable, so we don't want
folks running recheck on a bug for a non-voting job.
Update the unit tests to cover this case.
Change-Id: I61f4e7bb28235d2974f3dcf70187437c80f918d3
If there is an unclassified failure in the check queue, we want to make
it clear to the user so they will investigate the error as its most
likely a valid failure. Also don't include recheck instructions when
unclassified failure as they shouldn't be running a recheck if there is
an unclassified failure.
With us now classifying many failures from non-voting jobs, it is common
to see classified failures and no mention of the job that legitimately
failed.
Partial revert of I52044afb4f3a1bf3f22ba4c0e8d38d76271ffc00
Change-Id: I6b471b9ab9c7f36eeed93993ea086bbc9daa56b0
Recently the elasticsearch schema was updated to include a
build_short_uuid field which has indexed the first 7 chars of the
build_uuid. This field is useful because it allows e-r to filter on that
field instead of searching on build_uuid.
Update e-r to filter on build_short_uuid which should make queries much
more performant. As part of this change replace variables named
short_build_uuid with build_short_uuid for consistency with the
elasticsearch schema.
Change-Id: Iae5323f3f5d2fd01f2c69f78b9403baf5ebafe85
we really don't care about check failures for classification,
because those might just be terrible code, which we get a lot of.
So only report unclassified tests to the user on gate failures.
Now with extra tests for this behavior!
Change-Id: I52044afb4f3a1bf3f22ba4c0e8d38d76271ffc00
Use the correct time.sleep argument when sleeping. Also replace a post
for loop if check with an else to make the code more readable.
Change-Id: Icdfb41d1436abe930e4f45243ff6fe378ba3f91b
Point users to status.openstack.org/elastic-recheck to find further
information and links on the bugs they hit.
Change-Id: I9e6a70151d4f94c574b2eae55ff8ba0172189d7a
A few bugs have crept into elastic-recheck causing it to fail. This
patch fixes them.
* an update to gerritlib broke FailedEvent.rev and change, since both of
these should always be numbers cast to ints
* We appear to be missing files occasionally, add better logging for
(also simplify Exception classes)
* Remove last usage of skip_resolved (removed in a previous patch)
Change-Id: Ifc180989832be152e08a4873e62857a899835484
This reverts commit e75b996e60.
Change is being reverted because we can't actually use a static LOG
object if we expect setup_logging to do the right thing at runtime.
Python logging will load logging objects at import time using the static
LOG object before setup_logging can run otherwise.
Conflicts:
elastic_recheck/bot.py
elastic_recheck/elasticRecheck.py
Change-Id: I582c7e9c9b3c2ccab6a695bfba00a61f7c0a04a9
add in neutron, glance, and n-net logs as required files when
appropriate. This will help ensure that we don't miss a pattern
because we searched before the log was in the system.
Change-Id: Ia8f2cdedfc9964f1d9589fda253174e972fcc770
Instead of just listing which bugs were seen in an entire gerrit event
(multiple jenkins/zuul jobs), list which bugs were seen in which job.
If one of the jobs has an unrecognized error don't display the comment
about running recheck, just list which bugs were seen on which jobs (and
which has an unrecognized error)
Change-Id: I55b2eb8f0efe43ab22540294150d4bc9f5885510
We are starting to track a decent amount of data per zuul/jenkins job,
so track data in an object instead of assorted variables and
dictionaries. For example bugs are now tracked by job and not
gerrit event. Now, we can support reporting which bug caused which
specific job to fail. This also does some assorted object related
cleanups. This consists of internal changes only, a future patch will
make the gerrit and irc comments take advantage of this.
Change-Id: I2116cd0e10b45617a8d572b27f1672f695fa91d0