Initial spec for the plugin

implements blueprint lma-infra-alerting-plugin

Change-Id: I324973840ec2de04ae1514d2eb2c71523d2895dc
This commit is contained in:
Swann Croiset 2015-06-30 11:24:38 +02:00
parent 24b12c78cc
commit ae323992ae
1 changed files with 367 additions and 0 deletions

View File

@ -0,0 +1,367 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================================================
Fuel plugin for the OpenStack Infrastructure Alarming
=====================================================
https://blueprints.launchpad.net/fuel/+spec/lma-infra-alerting-plugin
The `LMA Infrastructure Alerting` plugin is composed of several services
running on a node (base-os role). It provides alerting functionality for the
OpenStack Infrastructure inside the `LMA toolchain` [1]_ plugins suite.
Problem description
===================
The current implementation of the `LMA toolchain` [1]_ doesn't provide the
alerting functionality.
This specification aims to address the following use cases:
* OpenStack operator(s) want to be notified when the status of a component
within the infrastructure changes:
* OpenStack service status has changed (for example OKAY -> FAIL)
* Cluster (RabbitMQ, MySQL, ..) status has changed (for example OKAY -> WARN)
* ...
* OpenStack operators want to configure thresholds on the metrics collected by
the LMA collector and be notified when a metric crosses its threshold.
Operators should be able to configure alarms with their own threshold against
any of the available metrics collected by `LMA collector`:
* Load average is too high on a controller node.
* File system is nearly full on a node.
* CPU usage is too high on a controller node.
* ...
Proposed changes
================
Implement a Fuel plugin that will install and configure the LMA infrastructure
alerting system for an OpenStack environment.
The initial implementation of this plugin plans to install and configure
Nagios [2]_ to manage alerts and send notifications to operators by email.
There are two types of alerts which are initially supported:
* Leverage the service status determinations computed by the `LMA collector`
plugins (OKAY, WARN, FAIL, UNKNOWN).
* Provide the ability to configure alarms over metrics by querying the
time series database provided by the `Influxdb-Grafana` plugin [8]_
In order to implement these features into the `LMA toolchain` it's necessary
to:
0. Configure Nagios server.
1. Plug the `LMA collector` [3]_ to this new alerting system with the native
Hekad [4]_ NagiosOutputPlugin [5]_ with HTTP method.
Following example shows the configuration of Heka and Nagios for the
Nova status:
.. code::
# Heka configuation example
[NagiosOutput]
url = "http://<node-nagios>/nagios3/cgi-bin/cmd.cgi"
username = "nagiosadmin"
password = "supersecret"
nagios_host = openstack-services"
nagios_service_description = "openstack.nova.status"
# Nagios configuration
define service {
check_command return-unknown-openstack.nova.status
check_freshness 1
check_interval 30
contact_groups openstack-admin
display_name openstack.nova.status
host_name openstack-services-env9
freshness_threshold 45
max_check_attempts 1
retry_interval 30
passive_checks_enabled 1
active_checks_enabled 0
process_perf_data 0
service_description openstack.nova.status
use generic-service
}
2. Integrate [7]_ or develop a Nagios plugin that will query metrics from the
InfluxDB database and trigger alerts when certain thresholds are met.
Note that this implies to declare all the nodes as hosts in the Nagios
configuration.
Following example is the configuration of an alert on CPU usage for
primary controller:
.. code::
# Nagios configuration to check CPU usage of nodes
define command {
command_name = check_cpu_for_host
command_line = check_influx_for_host -H $HOSTNAME$ -m cpu -w $ARG1$ -c $ARG2$
}
define host {
host_name = node-2
display_name = primary-controller
address = 10.109.0.4
contact_groups = openstack-admin
..
}
# Check CPU usage with threshold set to 75% for WARNING and 95% for critical
define service {
service_description = CPU usage
host_name = node-2
contact_groups = openstack-admin
check_command = check_cpu_for_host!75!95
...
}
The resulting InfluxDB 0.8 query would be :
.. code::
select mean(value) from merge(/node-2.cpu.\d+.user/) where time > now() - 1m group by time(1m)
With InfluxDB 0.9 the corresponding tag is used to filter per node:
.. code::
select mean(value) from merge(/cpu.\d+.user/) where node='node-2' and time > now() - 1m group by time(1m)
Alternatives
------------
There are plenty of alerting solutions but Nagios is the dominant open
source monitoring solution. Hence Nagios brings a robust and proven solution
which matches perfectly both to our alerting use case and the integration within
a legacy infrastructure monitoring.
It may be possible to leverage other open source solutions to complete and/or
replace Nagios in future.
Writing a new alerting system would be also possible either by polling
the time serie database or by performing realtime computation of metrics.
But this would require to be scalable and would need to reinvent lots of things
that already exist.
Alert severities
----------------
The service statutes computed by the `LMA collector` are mapped with the states
defined by Nagios by this way:
+---------------+----------+
| LMA collector | Nagios |
+===============+==========+
| OKAY | OK |
+---------------+----------+
| WARN | WARNING |
+---------------+----------+
| FAIL | CRITICAL |
+---------------+----------+
| UNKNOWN | UNKNOWN |
+---------------+----------+
Contacts, Alerting and Escalation
---------------------------------
The plugin allows to configure one email address to receive notifications,
it's up to the user to select which kind of event he/she will receive:
* critical
* warning
* unknown
* recovery
There is no escalation configuration enabled by the plugin. The user still have
the possiblity to configure it manually after the deployment of the plugin.
Limitations
-----------
Adding and removing node(s) to/from the OpenStack cluster won't re-configure
the Nagios server.
This is a limitation of the Fuel Plugin Framework which doesn't trigger `task`
when those actions are performed. This limitation should be addressed by a
Fuel blueprint [9]_ in the future but might be not ready for MOS 7.0.
This limitation is leading the user to adjust manually the Nagios
configuration:
* to not receive alert notifications about a deleted node,
* to add the new node(s) to Nagios configuration.
A possible workaround for the 'adding case' would be to use a SSH command from
the new node(s) deployed to run the appropriate Puppet manifest on the Nagios
node. This workaround may be investigated eventually but not in the first place.
Data model impact
-----------------
None
REST API impact
---------------
None
Upgrade impact
--------------
If you want to use the LMA alerting plugin, you will have to upgrade your
LMA collector plugin too.
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
The Nagios server can have several ``active checks`` which poll servers/services
and can lead to add extra workload on these targets.
This impact is minimized here by both:
* the usage of ``passive checks`` (ie. Nagios receives status but doesn't poll)
* Nagios doesn't poll servers to retrieve metrics but queries the time series
database.
Other deployer impact
---------------------
New configuration options:
* email address of the operator
* SMTP gateway (optional)
Developer impact
----------------
None
Infrastructure impact
---------------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Swann Croiset <scroiset@mirantis.com> (developer)
Other contributors:
Guillaume Thouvenin <gthouvenin@mirantis.com> (developer)
Simon Pasquier <spasquier@mirantis.com> (feature lead, developer)
Work Items
----------
* Implement the Puppet manifests for both Ubuntu and CentOS to configure Nagios
* Nagios server: main configuration.
* Nagios CGI (Web interface) served by Apache [10]_ and PhP [11]_.
* Nagios Objects configuration: Commands, Services, Hosts and Contacts.
* Add support for Nagios output plugin of the LMA collector.
* Implement or integrate [7]_ the Nagios plugin to query InfluxDB for alarm
evaluation over metrics.
* Testing.
* Write the documentation.
Dependencies
============
* Fuel 6.1 and higher.
* LMA Collector Fuel plugin.
Testing
=======
* Prepare a test plan.
* Test the plugin by deploying environments with all Fuel deployment modes and
the LMA toolchain configured.
* Create integration tests with the LMA toolchain
Acceptance criteria
-------------------
* The operator can login to the Nagios web interface.
* The operator must be notified by email when the state of an
OpenStack service change (OK -> DOWN, OK -> WARN, DOWN -> OK).
* The operator can define own alerts based on InfluxDB metrics and receive
notifications when the thresholds are reached.
Documentation Impact
====================
* Write the User Guide for this plugin: deploy and configure the solution.
* Test Plan.
* Test Report.
References
==========
.. [1] The LMA toolchain is currently composed of several Fuel plugins:
* LMA collector plugin
* InfluxDB-Grafana plugin
* Elasticsearch-Kibana plugin
.. [2] http://nagios.org
.. [3] https://github.com/stackforge/fuel-plugin-lma-collector
.. [4] http://hekad.readthedocs.org/
.. [5] http://hekad.readthedocs.org/en/v0.9.2/config/outputs/nagios.html
.. [6] http://www.influxdb.com/
.. [7] https://github.com/shaharke/influx-nagios-plugin
.. [8] https://github.com/stackforge/fuel-plugin-influxdb-grafana
.. [9] https://blueprints.launchpad.net/fuel/+spec/fuel-task-notify-other-nodes
.. [10] http://httpd.apache.org
.. [11] http://php.net