Add documentation about alerts on InfluxDB metrics

Change-Id: Icd82f0ececeeec1117c303150a603e2c69ae54fa
2015-11-13 14:07:03 +01:00 · 2015-11-13 14:07:03 +01:00 · 61bb01d49c
parent 6a14e46a21
commit 61bb01d49c
1 changed files with 60 additions and 0 deletions
--- a/doc/source/user.rst
+++ b/doc/source/user.rst
@ -149,6 +149,66 @@ There are also two *virtual* hosts representing the service and node clusters:
 These additional 2 entities offer the high-level view on the healthiness of the
 OpenStack environment.

+Configuring service checks on InfluxDB metrics
+----------------------------------------------
+
+You could configure addtional alarms (other than those already defined in the
+LMA Collector) based on the metrics stored in the InfluxDB database. For
+instance, if you wanted to be alerted when the system CPU usage of the
+Elasticsearch process reaches a certain threshold, you could setup a 'warning'
+alarm at say 30% of CPU usage threshold and a 'criticial' alarm at 50% of CPU
+usage threshold. The steps to define those alarms in Nagios would be as follow:
+
+#. Connect to the *LMA Infrastructure Alerting* node.
+
+#. Install the Nagios plugin for querying InfluxDB::
+
+    [root@node-13 ~]# pip install influx-nagios-plugin
+
+#. Define the command and the service check in the ``/etc/nagios3/conf.d/influxdb_services.conf`` file::
+
+    # Replace <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by the appropriate values for your deployment
+    define command {
+            command_line /usr/local/bin/check_influx -h localhost -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma -q "$ARG1$" -w $ARG2$ -c $ARG3$
+            command_name check_influx
+    }
+
+    define service {
+        service_description Elasticsearch system CPU
+        host                node-13
+        check_command       check_influx!select max(value) from lma_components_cputime_syst where time > now() - 5m and service='elasticsearch' group by time(5m) limit 1!30!50:
+        use                 generic-service
+    }
+
+#. Verify that the Nagios configuration is valid::
+
+    [root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
+
+       [snip]
+
+    Total Warnings: 0
+    Total Errors:   0
+
+    Things look okay - No serious problems were detected during the pre-flight check
+
+
+#. Restart the Nagios server::
+
+    [root@node-13 ~]# /etc/init.d/nagios3 restart
+
+#. Go the Nagios dashboard and verify that the service check has been added.
+
+
+From there, you can define additional service checks for different hosts or hostgroups using the same ``check_influx`` command. You just need to provide the 3 required arguments when defining the service checks:
+
+* A valid InfluxDB query that should return only one row with a single value. Check the `InfluxDB documentation <https://influxdb.com/docs/v0.9/query_language/index.html>`_ to learn how to use InfluxDB query language.
+
+* A range specification for the warning threshold.
+
+* A range specification for the critical threshold.
+
+.. _note: Threshold ranges are defined following the `Nagios format <https://nagios-plugins.org/doc/guidelines.html#THRESHOLDFORMAT>`_.
+
 Troubleshooting
 ---------------