When elastic search indexing is behind, and the day has
progressed forward to a new day, the latest
index is not yet available for use. Exclude it from searches
until it is ready in order to avoid the ElasticHttpNotFoundError.
Add Unit tests for this case as well as for when multiple days
are specified for the search.
Change-Id: Ifd27d1ab21bebcb63b48ea164f425c4a2ac8759c
The e-r graph and uncategorized fails jobs are both
currently failing because we're getting back hits
with empty data for the 'timestamp' attribute because
the value is too large, e.g.:
"[FIELDDATA] Data too large, data for [@timestamp] would "
"be larger than limit of [22469214208/20.9gb]]"
To workaround this for now we check to see if the hit
item list has anything it it before returning the
data for the facet results.
We should follow this up by logging errors on hits that
have bad data and we should remove those indexes.
Change-Id: Icf19af6580632ef52a55d3fb4bed3bced140024a
Closes-Bug: #1630355
This code doesn't work at all. Bring it back to life.
Also accept inputs from a config file.
Closes-Bug: #1526921
Change-Id: I8f45dc9d42f7547f9d849686739b9a641c176814
Since we weren't autodoc'ing the modules during the docs build, the
module index link was broken.
This generates the module docs (but hides them from the main top-level
table of contents) so they can be accessed via the 'Module Index' link.
Also cleans up some docs issues so that warnerrors=True works during
sphinx-build.
Closes-Bug: #1472642
Change-Id: I5a3a16d1e81b12237452d5a3a3f7f0cc42618e88
Elastic Recheck is really 2 things, real time searching, and bulk
offline categorization. While the bulk categorization needs to look
over the entire dataset, the real time portion is really deadline
oriented. So only cares about the last hour's worth of data. As such
we really don't need to search *all* the indexes in ES, but only
the most recent one (and possibly the one before that if we are near
rotation).
Implement this via a recent= parameter for our search feature. If set
to true then we specify the most recently logstash index. If it turns
out that we're within an hour of rotation, also search the one before
that.
Adjust all the queries the bot uses to be recent=True. This will
hopefully reduce the load generated by the bot on the ES cluster.
Change-Id: I0dfc295dd9b381acb67f192174edd6fdde06f24c
Use dateutil to accept be more flexible in parsing timestamps. A recent
upgrade to ElasticSearch changed the timestamp format to use '+00:00' to
note the timezone instead of 'Z'
Co-Authored-By: Joe Gordon <joe.gordon0@gmail.com>
Change-Id: I11f441ba3bf7ba46c55921352fcc87eb5d1ce3ae
The new ElasticSearch uses the +00:00 notation instead of 'Z' to signify
the timezone. Since we have both old and new data this change is
backward compatible.
Change-Id: Iaccb6a21b6929826e08f3adfc0b601e4a90fa4d5
Note: this patch assumes the timezone is always +00:00
Our elasticSearch cluster previously used '@message', but we have since
moved over to using just 'message'. The rest of the uses of '@message'
were removed in I6fb0aa87a291660df879282e9a7851bbb27e9ac2
Change-Id: I2b5d0f176deddb1b1ab9e831395c3216e927d8bf
we are parsing at microsecond resolution, however the previous
floor methodology was only zeroing out seconds, not also
microseconds. This causes bucket alignment issues, and broke the
graphs page.
Change-Id: I688bb4bc9ef9fee2167dd2e94a25f060d4025afd
we need to support different histogram resolutions, this adds a
new parameter which is the number of seconds to bucket on.
Change-Id: If839c238f93a07b17240c8774e826f3217d447ef
Bring histogram facets (currently hard coded to 1h buckets) into
the FacetSet module for "timestamp" keys from elastic search. This
then enables us to gut a bunch of code from graph.py and do all
the calculations with facet counts instead.
At the same time, make all the graphs run for a full 2 weeks of
data, so they are comparible to each other visually (the sliding
window start time was less useful in seeing how the graphs
compared)
Change-Id: I971d52b5de514d0607bd8217837aed3895472d05
this is an implementation of facets, client side, with elastic
search results. This will let us get rid of a bunch of the
uniquify code in the graph and check_success scripts, and make
it simpler to analyze by other dimensions in web console additions.
Also make Hit implement __getitem__ for easier dynamic access of
contents. Useful for programatically accessing tags.
Change-Id: Ib63ff887eb82cff0ba00109471ee48d210fda571
it turns out, I was spending way to much time to make ResultSet act
like a list, when I could have just made it inherit from list and
be done with it. It manages to remove code and work just the same.
In addition, make the __repr__ for Hit be more meaningful by using
pprint. Makes print debugging of all the datastructures actually
work like you expect.
Change-Id: Ie104d4bfc06a0875f8da85121742c053b642f8f9
* elastic_recheck/results.py(Hit): Simple but subtle typographical
error in branching conditionals caused at_attr to be searched for in
_source even if attr was already present there. Brown bag fix.
Change-Id: I730f7b7c74a9d772edd0bf483f0089523cb5f6e8
in an attempt for long term simplification of the source tree, this
is the beginning of a ResultSet and Hit object type. The ResultSet
is contructed from the ElasticSearch returned json structure, and
it builds hits internally.
ResultSet is an iterator, and indexable, so that you can easily loop
through them. Both ResultSet and Hit objects have dynamic attributes
to make accessing the deep data structures easier (and without having
to make everything explicit), and also handling the multiline collapse
correctly.
A basic set of tests is included, as well as sample json dumps for all
the current bugs in the system for additional unit testing. Fortunately
this includes bugs which have hits, and those that don't.
In order to use ResultSet we need to pass everything through
our own SearchEngine object, so we get results back as expected.
We also need to teach ResultSet about facets, as those get used
when attempting to find specific files.
Lastly, we need __len__ implementation for ResultSet to support
the wait loop correctly.
ResultSet lets us simplify a bit of the code in elasticRecheck,
port it over.
There is a short term fix in the test_classifier test to get us
working here until real stub data can be applied.
Change-Id: I7b0d47a8802dcf6e6c052f137b5f9494b1b99501