Add documentation about alerts on InfluxDB metrics
Change-Id: Icd82f0ececeeec1117c303150a603e2c69ae54fa
This commit is contained in:
parent
6a14e46a21
commit
61bb01d49c
|
@ -149,6 +149,66 @@ There are also two *virtual* hosts representing the service and node clusters:
|
|||
These additional 2 entities offer the high-level view on the healthiness of the
|
||||
OpenStack environment.
|
||||
|
||||
Configuring service checks on InfluxDB metrics
|
||||
----------------------------------------------
|
||||
|
||||
You could configure addtional alarms (other than those already defined in the
|
||||
LMA Collector) based on the metrics stored in the InfluxDB database. For
|
||||
instance, if you wanted to be alerted when the system CPU usage of the
|
||||
Elasticsearch process reaches a certain threshold, you could setup a 'warning'
|
||||
alarm at say 30% of CPU usage threshold and a 'criticial' alarm at 50% of CPU
|
||||
usage threshold. The steps to define those alarms in Nagios would be as follow:
|
||||
|
||||
#. Connect to the *LMA Infrastructure Alerting* node.
|
||||
|
||||
#. Install the Nagios plugin for querying InfluxDB::
|
||||
|
||||
[root@node-13 ~]# pip install influx-nagios-plugin
|
||||
|
||||
#. Define the command and the service check in the ``/etc/nagios3/conf.d/influxdb_services.conf`` file::
|
||||
|
||||
# Replace <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by the appropriate values for your deployment
|
||||
define command {
|
||||
command_line /usr/local/bin/check_influx -h localhost -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma -q "$ARG1$" -w $ARG2$ -c $ARG3$
|
||||
command_name check_influx
|
||||
}
|
||||
|
||||
define service {
|
||||
service_description Elasticsearch system CPU
|
||||
host node-13
|
||||
check_command check_influx!select max(value) from lma_components_cputime_syst where time > now() - 5m and service='elasticsearch' group by time(5m) limit 1!30!50:
|
||||
use generic-service
|
||||
}
|
||||
|
||||
#. Verify that the Nagios configuration is valid::
|
||||
|
||||
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
|
||||
|
||||
[snip]
|
||||
|
||||
Total Warnings: 0
|
||||
Total Errors: 0
|
||||
|
||||
Things look okay - No serious problems were detected during the pre-flight check
|
||||
|
||||
|
||||
#. Restart the Nagios server::
|
||||
|
||||
[root@node-13 ~]# /etc/init.d/nagios3 restart
|
||||
|
||||
#. Go the Nagios dashboard and verify that the service check has been added.
|
||||
|
||||
|
||||
From there, you can define additional service checks for different hosts or hostgroups using the same ``check_influx`` command. You just need to provide the 3 required arguments when defining the service checks:
|
||||
|
||||
* A valid InfluxDB query that should return only one row with a single value. Check the `InfluxDB documentation <https://influxdb.com/docs/v0.9/query_language/index.html>`_ to learn how to use InfluxDB query language.
|
||||
|
||||
* A range specification for the warning threshold.
|
||||
|
||||
* A range specification for the critical threshold.
|
||||
|
||||
.. _note: Threshold ranges are defined following the `Nagios format <https://nagios-plugins.org/doc/guidelines.html#THRESHOLDFORMAT>`_.
|
||||
|
||||
Troubleshooting
|
||||
---------------
|
||||
|
||||
|
|
Loading…
Reference in New Issue