Using bulk metrics for the log counters reduces largely the likelihood
of blocking the Heka pipeline. Instead of injecting (x services
* y levels) metric messages, the filter injects only one big message.
This changes also updates the configuration of the metric_collector
service to deserialize the bulk metric to support alarms on log
counters.
Change-Id: Icb71fd6faa4191795c0470ecc24aeafd25794f42
Closes-Bug: #1643280
The Lua sandboxes aren't used for now. To give it a try, one could the
configuration stored in contrib/ceilometer.toml.
blueprint: ceilometer-stacklight-integration
Co-Authored-By: Igor Degtiarov <idegtiarov@mirantis.com>
Co-Authored-By: Ilya Tyaptin <ityaptin@mirantis.com>
Change-Id: I7634dd0ee4f3200d1a82ab26feafa54a8ac74e51
This patch declares type as local to avoid the termination of the heka
monitoring collector filter due to an attempt to call global 'type' (a
nil value)
Change-Id: I8358e8a38f6db70e2058fd276263b4825746ed17
Closes-Bug: #1579796
HTTP metrics are now statistics aggregated every 10 seconds.
A new metric is emitted openstack_<service>_response_times with these
values:
- min
- max
- sum
- count
- percentile
Hence, the previous metric disappears (openstack_<service>_responses).
Implements-blueprint: aggregated-http-metrics
Change-Id: I48e92df6f4baa7be942ad138b7f23c3d15f5a24e
This change resets the table that holds data for the heka monitoring
filter. Otherwise the table may grow infinitely and the sandbox will
eventually be killed by Heka.
Change-Id: If8c07944e42700d913831b500466b33831a41482
Partial-Bug: #1545743
It appears that Nagios cannot ingest output which is larger than 1024
bytes so this change makes sure that the Nagios encoder complies with
this requirement.
Change-Id: I22c7186f0dc6edabe8c3372a8c06197b276a9d4d
Closes-Bug: #1517917
This change moves some functions related to table manipulation from the
lma_utils module to a dedicated module named table_utils.
Change-Id: I2263088d70ef7e9bc617e982a32f2bd26f714af0
A Lua sandbox raises an exception when it tries to inject a message
larger than the configured output_limit value (default: 63KiB). The
same applies to the cjson library when trying to encode a Lua structure
resulting in a string larger than the same limit.
This change adds safe_* versions of the inject_message(),
inject_payload() and cjson.encode() functions. It also modifies the
existing Lua plugins to use the safe versions instead.
Change-Id: I7351783e51efa046d483921cb79e14279178a13a
Closes-Bug: #1504141
This change modifies the implementation of the GSE filters. The main
differences are:
- level-1 dependencies define now the members of a cluster and the
status of a cluster is defined by the highest severity among all
members.
- level-2 dependencies are now known as 'hints', they define
relationships between clusters (eg, Nova depends on Keystone) but
have no influence on the status of a cluster.
Change-Id: I58bd79463de78b04b9bad92d02e3fb0da4bacdf4
This patch provides Lua libraries to evaluate metrics against thresholds.
The AFD evaluates a list of alarms, with an alarm defined like the
following:
name: 'fs-warning'
description: 'Filesystem usage'
severity: 'warning'
trigger:
logical_operator: 'or'
rules:
- metric: fs_space_percent_free
fields:
fs: '*'
relational_operator: '<'
threshold: 5
window: 60
period: 1
function: avg
where:
- *name* is required and must be unique,
- *description* is required,
- *severity* is one of 'okay', 'warning', 'critical', 'down', 'unknown'
- *logical_operator* optional (can be 'or' or 'and', default 'or')
- *metric*, *relational_operator*, *threshold*, *window* and *function*
are required,
- *fields* is optional
The AFD evaluates alarms in specified order and stop evaluation at
the first triggered alarm.
This implementation doesn't fully support all the specification, the
current limitation are:
- aggregation functions supported are: max, min, avg, sum, sd and variance and
these ones are not: last, median, mww, mww_nonparametric.
- *periods* rules parameter is supported for these functions in the sense that
thresholds are compared on the entire interval "window * periods" but
not compared between each period. In other words: it's equivalent
to write a rule with 'window=300/periods=1|0' and 'window=100/periods=3'.
Change-Id: Ia739ceb080971e3b7bb5a2212275d2a15d65d3e9
Level 2 dependencies are only some hints about current status but don't
modify the status of the cluster.
Change-Id: I2f41bc5b26af93c9083bf92ccd4c866841826224
This change removes the Heka filters that computed the services
statuses. It also cleans up the Puppet code that referred to it.
Change-Id: Ib6c1c9054333b9e71f5a8a2f08600eae5d287816
This change introduces a new type of Heka message called 'bulk_metric'.
A bulk metric message can be emitted by any filter plugin using the
add_to_metric() and inject_bulk_metric() function from the lma_utils
module:
local ts = read_message('Timestamp')
utils.add_to_metric('foo', 1, {tag1 = value1})
utils.add_to_metric('bar', 2, {})
utils.inject_bulk_metric(ts, 'node-1', 'custom_filter')
The structure of the message injected in the Heka pipeline will be:
Timestamp: <ts>
Severity: INFO
Hostname: node-1
Payload: >
[{"name":"foo","value":1,"tags":{"tag1":"value1"}},
{"name":"bar","value":2,"tags":[]}]
Fields:
- source: custom_filter
- hostname: node-1
Eventually the bulk metric message is caught by the InfluxDB
accumulator filter and encoded using the InfluxDB line protocol.
Change-Id: I96986fd8287d65ae018c7636f9dd745dba2fc761
Implements: blueprint upgrade-influxdb-grafana
This patch introduce the following changes:
* Message type is now 'status' (instead of 'event').
* Send one 'status' message per service in place of list of 'events'.
* Annotation titles are now formatted in the influxdb-annotations filter.
The status message structure is:
{
Timestamp = timestamp in nanosecond,
Payload = a list of events occured on the last period (JSON encoded),
Type = 'status', -- prepended with 'heka.sandbox',
Severity = INFO by default or mapped from the 'status code' bellow,
Fields = {
service = the service name (ie 'nova'),
status = the general status code of the service,
previous_status = the general previous status code,
updated = a boolean to indicates if the status has been updated,
}
}
The mapping from 'status code' to severity is the following:
* OKAY -> INFO
* WARN -> WARNING
* FAIL -> CRITICAL
* UNKNOWN -> NOTICE
implements blueprint alerting-lma-collector
Change-Id: Id92a5cb905fb477adb3d0455c89bf50cf51afb1a
Distinguish "global status" and "service status":
Global status is one of:
* OKAY
* WARN
* FAIL
* UNKNOWN
Service status is one of:
* UP
* DEGRADED
* DOWN
* UNKNOWN
Change-Id: Id3d8b2237788d8710b309197575aa1a82a90400a
A first Heka filter catches all service related metrics to consolidate service states
and periodically emit a message containing the whole information.
A second Heka filter consumes the previous message to compute the detailed status
of Openstack services and emit 2 kind of messages:
- metrics status: the general status of the service, HAproxy backend
server status, and per service/agent status when available (nova, cinder, neutron).
- events with status transition description, these events will be handled by a
future filter to fill the influxdb and enable annotation on Grafana graphs.
Note that events are only emitted by the node from where the
"vip__public" pacemaker resource is active to avoid duplicated events.
The general status of a service depends on underlying metrics:
- API checks: openstack.<service>.check_api
- HAproxy backend states: haproxy.backend.<backend>.server.(up|down)
- service/agent states for nova, cinder, neutron:
openstack.<service>.(services|agents).(up|down)
Status is one of OK, DEGRADED, DOWN, UNKNOWN.
Change-Id: Ifa34edadce87e1ecdd131315462f80b49c7edd6d
Some log files may contain messages generated before RSYSLOG is fully
configured (for instance, /var/log/kern.log). This change adds a
fallback Syslog grammar that will handle this kind of messages.
Change-Id: I9184abda924fbbb7d19884ead6775331d41f9468
This is an import of the initial LMA PoC code. For now, it only covers
the collection of logs (notifications will be added in a subsequent
commit).
There's been a bit of rewrite to:
- decouple the Heka configuration from the LMA collector.
- run the Heka service as non-root when possible (Ubuntu only for now
due to file permission issues on CentOS [1]).
- adapt to version 0.9 of Heka.
[1] https://bugs.launchpad.net/fuel/+bug/1425954
Change-Id: I4472b49a25e18e06984b5b29bdce18f917137bc8