Add documentation about alerts on InfluxDB metrics

Change-Id: Icd82f0ececeeec1117c303150a603e2c69ae54fa
This commit is contained in:
Simon Pasquier 2015-11-13 14:07:03 +01:00
parent 6a14e46a21
commit 61bb01d49c
1 changed files with 60 additions and 0 deletions

View File

@ -149,6 +149,66 @@ There are also two *virtual* hosts representing the service and node clusters:
These additional 2 entities offer the high-level view on the healthiness of the
OpenStack environment.
Configuring service checks on InfluxDB metrics
----------------------------------------------
You could configure addtional alarms (other than those already defined in the
LMA Collector) based on the metrics stored in the InfluxDB database. For
instance, if you wanted to be alerted when the system CPU usage of the
Elasticsearch process reaches a certain threshold, you could setup a 'warning'
alarm at say 30% of CPU usage threshold and a 'criticial' alarm at 50% of CPU
usage threshold. The steps to define those alarms in Nagios would be as follow:
#. Connect to the *LMA Infrastructure Alerting* node.
#. Install the Nagios plugin for querying InfluxDB::
[root@node-13 ~]# pip install influx-nagios-plugin
#. Define the command and the service check in the ``/etc/nagios3/conf.d/influxdb_services.conf`` file::
# Replace <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by the appropriate values for your deployment
define command {
command_line /usr/local/bin/check_influx -h localhost -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma -q "$ARG1$" -w $ARG2$ -c $ARG3$
command_name check_influx
}
define service {
service_description Elasticsearch system CPU
host node-13
check_command check_influx!select max(value) from lma_components_cputime_syst where time > now() - 5m and service='elasticsearch' group by time(5m) limit 1!30!50:
use generic-service
}
#. Verify that the Nagios configuration is valid::
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
[snip]
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check
#. Restart the Nagios server::
[root@node-13 ~]# /etc/init.d/nagios3 restart
#. Go the Nagios dashboard and verify that the service check has been added.
From there, you can define additional service checks for different hosts or hostgroups using the same ``check_influx`` command. You just need to provide the 3 required arguments when defining the service checks:
* A valid InfluxDB query that should return only one row with a single value. Check the `InfluxDB documentation <https://influxdb.com/docs/v0.9/query_language/index.html>`_ to learn how to use InfluxDB query language.
* A range specification for the warning threshold.
* A range specification for the critical threshold.
.. _note: Threshold ranges are defined following the `Nagios format <https://nagios-plugins.org/doc/guidelines.html#THRESHOLDFORMAT>`_.
Troubleshooting
---------------