From 61bb01d49c743b9b57db7555d33b7c2ff51bfc3f Mon Sep 17 00:00:00 2001 From: Simon Pasquier Date: Fri, 13 Nov 2015 14:07:03 +0100 Subject: [PATCH] Add documentation about alerts on InfluxDB metrics Change-Id: Icd82f0ececeeec1117c303150a603e2c69ae54fa --- doc/source/user.rst | 60 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) diff --git a/doc/source/user.rst b/doc/source/user.rst index 0736846..0a63c54 100644 --- a/doc/source/user.rst +++ b/doc/source/user.rst @@ -149,6 +149,66 @@ There are also two *virtual* hosts representing the service and node clusters: These additional 2 entities offer the high-level view on the healthiness of the OpenStack environment. +Configuring service checks on InfluxDB metrics +---------------------------------------------- + +You could configure addtional alarms (other than those already defined in the +LMA Collector) based on the metrics stored in the InfluxDB database. For +instance, if you wanted to be alerted when the system CPU usage of the +Elasticsearch process reaches a certain threshold, you could setup a 'warning' +alarm at say 30% of CPU usage threshold and a 'criticial' alarm at 50% of CPU +usage threshold. The steps to define those alarms in Nagios would be as follow: + +#. Connect to the *LMA Infrastructure Alerting* node. + +#. Install the Nagios plugin for querying InfluxDB:: + + [root@node-13 ~]# pip install influx-nagios-plugin + +#. Define the command and the service check in the ``/etc/nagios3/conf.d/influxdb_services.conf`` file:: + + # Replace and by the appropriate values for your deployment + define command { + command_line /usr/local/bin/check_influx -h localhost -u -p -d lma -q "$ARG1$" -w $ARG2$ -c $ARG3$ + command_name check_influx + } + + define service { + service_description Elasticsearch system CPU + host node-13 + check_command check_influx!select max(value) from lma_components_cputime_syst where time > now() - 5m and service='elasticsearch' group by time(5m) limit 1!30!50: + use generic-service + } + +#. Verify that the Nagios configuration is valid:: + + [root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg + + [snip] + + Total Warnings: 0 + Total Errors: 0 + + Things look okay - No serious problems were detected during the pre-flight check + + +#. Restart the Nagios server:: + + [root@node-13 ~]# /etc/init.d/nagios3 restart + +#. Go the Nagios dashboard and verify that the service check has been added. + + +From there, you can define additional service checks for different hosts or hostgroups using the same ``check_influx`` command. You just need to provide the 3 required arguments when defining the service checks: + +* A valid InfluxDB query that should return only one row with a single value. Check the `InfluxDB documentation `_ to learn how to use InfluxDB query language. + +* A range specification for the warning threshold. + +* A range specification for the critical threshold. + +.. _note: Threshold ranges are defined following the `Nagios format `_. + Troubleshooting ---------------