From ae323992ae32a3bf0cbe1ecd465e3a7c407bdc1d Mon Sep 17 00:00:00 2001 From: Swann Croiset Date: Tue, 30 Jun 2015 11:24:38 +0200 Subject: [PATCH] Initial spec for the plugin implements blueprint lma-infra-alerting-plugin Change-Id: I324973840ec2de04ae1514d2eb2c71523d2895dc --- specs/lma-infra-alerting.rst | 367 +++++++++++++++++++++++++++++++++++ 1 file changed, 367 insertions(+) create mode 100644 specs/lma-infra-alerting.rst diff --git a/specs/lma-infra-alerting.rst b/specs/lma-infra-alerting.rst new file mode 100644 index 0000000..3acec57 --- /dev/null +++ b/specs/lma-infra-alerting.rst @@ -0,0 +1,367 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +===================================================== +Fuel plugin for the OpenStack Infrastructure Alarming +===================================================== + + +https://blueprints.launchpad.net/fuel/+spec/lma-infra-alerting-plugin + +The `LMA Infrastructure Alerting` plugin is composed of several services +running on a node (base-os role). It provides alerting functionality for the +OpenStack Infrastructure inside the `LMA toolchain` [1]_ plugins suite. + + +Problem description +=================== + +The current implementation of the `LMA toolchain` [1]_ doesn't provide the +alerting functionality. + +This specification aims to address the following use cases: + +* OpenStack operator(s) want to be notified when the status of a component + within the infrastructure changes: + + * OpenStack service status has changed (for example OKAY -> FAIL) + * Cluster (RabbitMQ, MySQL, ..) status has changed (for example OKAY -> WARN) + * ... + +* OpenStack operators want to configure thresholds on the metrics collected by + the LMA collector and be notified when a metric crosses its threshold. + Operators should be able to configure alarms with their own threshold against + any of the available metrics collected by `LMA collector`: + + * Load average is too high on a controller node. + * File system is nearly full on a node. + * CPU usage is too high on a controller node. + * ... + +Proposed changes +================ + +Implement a Fuel plugin that will install and configure the LMA infrastructure +alerting system for an OpenStack environment. + +The initial implementation of this plugin plans to install and configure +Nagios [2]_ to manage alerts and send notifications to operators by email. + +There are two types of alerts which are initially supported: + + * Leverage the service status determinations computed by the `LMA collector` + plugins (OKAY, WARN, FAIL, UNKNOWN). + * Provide the ability to configure alarms over metrics by querying the + time series database provided by the `Influxdb-Grafana` plugin [8]_ + +In order to implement these features into the `LMA toolchain` it's necessary +to: + +0. Configure Nagios server. + +1. Plug the `LMA collector` [3]_ to this new alerting system with the native + Hekad [4]_ NagiosOutputPlugin [5]_ with HTTP method. + Following example shows the configuration of Heka and Nagios for the + Nova status: + +.. code:: + + # Heka configuation example + [NagiosOutput] + url = "http:///nagios3/cgi-bin/cmd.cgi" + username = "nagiosadmin" + password = "supersecret" + nagios_host = openstack-services" + nagios_service_description = "openstack.nova.status" + + # Nagios configuration + define service { + check_command return-unknown-openstack.nova.status + check_freshness 1 + check_interval 30 + contact_groups openstack-admin + display_name openstack.nova.status + host_name openstack-services-env9 + freshness_threshold 45 + max_check_attempts 1 + retry_interval 30 + passive_checks_enabled 1 + active_checks_enabled 0 + process_perf_data 0 + service_description openstack.nova.status + use generic-service + } + + +2. Integrate [7]_ or develop a Nagios plugin that will query metrics from the + InfluxDB database and trigger alerts when certain thresholds are met. + Note that this implies to declare all the nodes as hosts in the Nagios + configuration. + + Following example is the configuration of an alert on CPU usage for + primary controller: + +.. code:: + + # Nagios configuration to check CPU usage of nodes + define command { + command_name = check_cpu_for_host + command_line = check_influx_for_host -H $HOSTNAME$ -m cpu -w $ARG1$ -c $ARG2$ + } + + define host { + host_name = node-2 + display_name = primary-controller + address = 10.109.0.4 + contact_groups = openstack-admin + .. + } + + # Check CPU usage with threshold set to 75% for WARNING and 95% for critical + define service { + service_description = CPU usage + host_name = node-2 + contact_groups = openstack-admin + check_command = check_cpu_for_host!75!95 + ... + } + +The resulting InfluxDB 0.8 query would be : + +.. code:: + + select mean(value) from merge(/node-2.cpu.\d+.user/) where time > now() - 1m group by time(1m) + +With InfluxDB 0.9 the corresponding tag is used to filter per node: + +.. code:: + + select mean(value) from merge(/cpu.\d+.user/) where node='node-2' and time > now() - 1m group by time(1m) + + +Alternatives +------------ + +There are plenty of alerting solutions but Nagios is the dominant open +source monitoring solution. Hence Nagios brings a robust and proven solution +which matches perfectly both to our alerting use case and the integration within +a legacy infrastructure monitoring. + +It may be possible to leverage other open source solutions to complete and/or +replace Nagios in future. + +Writing a new alerting system would be also possible either by polling +the time serie database or by performing realtime computation of metrics. +But this would require to be scalable and would need to reinvent lots of things +that already exist. + +Alert severities +---------------- + +The service statutes computed by the `LMA collector` are mapped with the states +defined by Nagios by this way: + ++---------------+----------+ +| LMA collector | Nagios | ++===============+==========+ +| OKAY | OK | ++---------------+----------+ +| WARN | WARNING | ++---------------+----------+ +| FAIL | CRITICAL | ++---------------+----------+ +| UNKNOWN | UNKNOWN | ++---------------+----------+ + +Contacts, Alerting and Escalation +--------------------------------- + +The plugin allows to configure one email address to receive notifications, +it's up to the user to select which kind of event he/she will receive: + +* critical +* warning +* unknown +* recovery + +There is no escalation configuration enabled by the plugin. The user still have +the possiblity to configure it manually after the deployment of the plugin. + +Limitations +----------- + +Adding and removing node(s) to/from the OpenStack cluster won't re-configure +the Nagios server. + +This is a limitation of the Fuel Plugin Framework which doesn't trigger `task` +when those actions are performed. This limitation should be addressed by a +Fuel blueprint [9]_ in the future but might be not ready for MOS 7.0. + +This limitation is leading the user to adjust manually the Nagios +configuration: + + * to not receive alert notifications about a deleted node, + * to add the new node(s) to Nagios configuration. + +A possible workaround for the 'adding case' would be to use a SSH command from +the new node(s) deployed to run the appropriate Puppet manifest on the Nagios +node. This workaround may be investigated eventually but not in the first place. + +Data model impact +----------------- + +None + +REST API impact +--------------- +None + +Upgrade impact +-------------- + +If you want to use the LMA alerting plugin, you will have to upgrade your +LMA collector plugin too. + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +The Nagios server can have several ``active checks`` which poll servers/services +and can lead to add extra workload on these targets. + +This impact is minimized here by both: + * the usage of ``passive checks`` (ie. Nagios receives status but doesn't poll) + * Nagios doesn't poll servers to retrieve metrics but queries the time series + database. + + +Other deployer impact +--------------------- + +New configuration options: + +* email address of the operator +* SMTP gateway (optional) + +Developer impact +---------------- + +None + +Infrastructure impact +--------------------- + +None + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + Swann Croiset (developer) + +Other contributors: + Guillaume Thouvenin (developer) + Simon Pasquier (feature lead, developer) + +Work Items +---------- + +* Implement the Puppet manifests for both Ubuntu and CentOS to configure Nagios + + * Nagios server: main configuration. + * Nagios CGI (Web interface) served by Apache [10]_ and PhP [11]_. + * Nagios Objects configuration: Commands, Services, Hosts and Contacts. + +* Add support for Nagios output plugin of the LMA collector. + +* Implement or integrate [7]_ the Nagios plugin to query InfluxDB for alarm + evaluation over metrics. + +* Testing. + +* Write the documentation. + +Dependencies +============ + +* Fuel 6.1 and higher. + +* LMA Collector Fuel plugin. + +Testing +======= + +* Prepare a test plan. + +* Test the plugin by deploying environments with all Fuel deployment modes and + the LMA toolchain configured. + +* Create integration tests with the LMA toolchain + +Acceptance criteria +------------------- + +* The operator can login to the Nagios web interface. +* The operator must be notified by email when the state of an + OpenStack service change (OK -> DOWN, OK -> WARN, DOWN -> OK). +* The operator can define own alerts based on InfluxDB metrics and receive + notifications when the thresholds are reached. + +Documentation Impact +==================== + + +* Write the User Guide for this plugin: deploy and configure the solution. + +* Test Plan. + +* Test Report. + +References +========== + +.. [1] The LMA toolchain is currently composed of several Fuel plugins: + + * LMA collector plugin + * InfluxDB-Grafana plugin + * Elasticsearch-Kibana plugin + +.. [2] http://nagios.org + +.. [3] https://github.com/stackforge/fuel-plugin-lma-collector + +.. [4] http://hekad.readthedocs.org/ + +.. [5] http://hekad.readthedocs.org/en/v0.9.2/config/outputs/nagios.html + +.. [6] http://www.influxdb.com/ + +.. [7] https://github.com/shaharke/influx-nagios-plugin + +.. [8] https://github.com/stackforge/fuel-plugin-influxdb-grafana + +.. [9] https://blueprints.launchpad.net/fuel/+spec/fuel-task-notify-other-nodes + +.. [10] http://httpd.apache.org + +.. [11] http://php.net