Initial spec for the plugin
implements blueprint lma-infra-alerting-plugin Change-Id: I324973840ec2de04ae1514d2eb2c71523d2895dc
This commit is contained in:
parent
24b12c78cc
commit
ae323992ae
|
@ -0,0 +1,367 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
=====================================================
|
||||||
|
Fuel plugin for the OpenStack Infrastructure Alarming
|
||||||
|
=====================================================
|
||||||
|
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/fuel/+spec/lma-infra-alerting-plugin
|
||||||
|
|
||||||
|
The `LMA Infrastructure Alerting` plugin is composed of several services
|
||||||
|
running on a node (base-os role). It provides alerting functionality for the
|
||||||
|
OpenStack Infrastructure inside the `LMA toolchain` [1]_ plugins suite.
|
||||||
|
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
The current implementation of the `LMA toolchain` [1]_ doesn't provide the
|
||||||
|
alerting functionality.
|
||||||
|
|
||||||
|
This specification aims to address the following use cases:
|
||||||
|
|
||||||
|
* OpenStack operator(s) want to be notified when the status of a component
|
||||||
|
within the infrastructure changes:
|
||||||
|
|
||||||
|
* OpenStack service status has changed (for example OKAY -> FAIL)
|
||||||
|
* Cluster (RabbitMQ, MySQL, ..) status has changed (for example OKAY -> WARN)
|
||||||
|
* ...
|
||||||
|
|
||||||
|
* OpenStack operators want to configure thresholds on the metrics collected by
|
||||||
|
the LMA collector and be notified when a metric crosses its threshold.
|
||||||
|
Operators should be able to configure alarms with their own threshold against
|
||||||
|
any of the available metrics collected by `LMA collector`:
|
||||||
|
|
||||||
|
* Load average is too high on a controller node.
|
||||||
|
* File system is nearly full on a node.
|
||||||
|
* CPU usage is too high on a controller node.
|
||||||
|
* ...
|
||||||
|
|
||||||
|
Proposed changes
|
||||||
|
================
|
||||||
|
|
||||||
|
Implement a Fuel plugin that will install and configure the LMA infrastructure
|
||||||
|
alerting system for an OpenStack environment.
|
||||||
|
|
||||||
|
The initial implementation of this plugin plans to install and configure
|
||||||
|
Nagios [2]_ to manage alerts and send notifications to operators by email.
|
||||||
|
|
||||||
|
There are two types of alerts which are initially supported:
|
||||||
|
|
||||||
|
* Leverage the service status determinations computed by the `LMA collector`
|
||||||
|
plugins (OKAY, WARN, FAIL, UNKNOWN).
|
||||||
|
* Provide the ability to configure alarms over metrics by querying the
|
||||||
|
time series database provided by the `Influxdb-Grafana` plugin [8]_
|
||||||
|
|
||||||
|
In order to implement these features into the `LMA toolchain` it's necessary
|
||||||
|
to:
|
||||||
|
|
||||||
|
0. Configure Nagios server.
|
||||||
|
|
||||||
|
1. Plug the `LMA collector` [3]_ to this new alerting system with the native
|
||||||
|
Hekad [4]_ NagiosOutputPlugin [5]_ with HTTP method.
|
||||||
|
Following example shows the configuration of Heka and Nagios for the
|
||||||
|
Nova status:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
# Heka configuation example
|
||||||
|
[NagiosOutput]
|
||||||
|
url = "http://<node-nagios>/nagios3/cgi-bin/cmd.cgi"
|
||||||
|
username = "nagiosadmin"
|
||||||
|
password = "supersecret"
|
||||||
|
nagios_host = openstack-services"
|
||||||
|
nagios_service_description = "openstack.nova.status"
|
||||||
|
|
||||||
|
# Nagios configuration
|
||||||
|
define service {
|
||||||
|
check_command return-unknown-openstack.nova.status
|
||||||
|
check_freshness 1
|
||||||
|
check_interval 30
|
||||||
|
contact_groups openstack-admin
|
||||||
|
display_name openstack.nova.status
|
||||||
|
host_name openstack-services-env9
|
||||||
|
freshness_threshold 45
|
||||||
|
max_check_attempts 1
|
||||||
|
retry_interval 30
|
||||||
|
passive_checks_enabled 1
|
||||||
|
active_checks_enabled 0
|
||||||
|
process_perf_data 0
|
||||||
|
service_description openstack.nova.status
|
||||||
|
use generic-service
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
2. Integrate [7]_ or develop a Nagios plugin that will query metrics from the
|
||||||
|
InfluxDB database and trigger alerts when certain thresholds are met.
|
||||||
|
Note that this implies to declare all the nodes as hosts in the Nagios
|
||||||
|
configuration.
|
||||||
|
|
||||||
|
Following example is the configuration of an alert on CPU usage for
|
||||||
|
primary controller:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
# Nagios configuration to check CPU usage of nodes
|
||||||
|
define command {
|
||||||
|
command_name = check_cpu_for_host
|
||||||
|
command_line = check_influx_for_host -H $HOSTNAME$ -m cpu -w $ARG1$ -c $ARG2$
|
||||||
|
}
|
||||||
|
|
||||||
|
define host {
|
||||||
|
host_name = node-2
|
||||||
|
display_name = primary-controller
|
||||||
|
address = 10.109.0.4
|
||||||
|
contact_groups = openstack-admin
|
||||||
|
..
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check CPU usage with threshold set to 75% for WARNING and 95% for critical
|
||||||
|
define service {
|
||||||
|
service_description = CPU usage
|
||||||
|
host_name = node-2
|
||||||
|
contact_groups = openstack-admin
|
||||||
|
check_command = check_cpu_for_host!75!95
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
The resulting InfluxDB 0.8 query would be :
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
select mean(value) from merge(/node-2.cpu.\d+.user/) where time > now() - 1m group by time(1m)
|
||||||
|
|
||||||
|
With InfluxDB 0.9 the corresponding tag is used to filter per node:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
select mean(value) from merge(/cpu.\d+.user/) where node='node-2' and time > now() - 1m group by time(1m)
|
||||||
|
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
There are plenty of alerting solutions but Nagios is the dominant open
|
||||||
|
source monitoring solution. Hence Nagios brings a robust and proven solution
|
||||||
|
which matches perfectly both to our alerting use case and the integration within
|
||||||
|
a legacy infrastructure monitoring.
|
||||||
|
|
||||||
|
It may be possible to leverage other open source solutions to complete and/or
|
||||||
|
replace Nagios in future.
|
||||||
|
|
||||||
|
Writing a new alerting system would be also possible either by polling
|
||||||
|
the time serie database or by performing realtime computation of metrics.
|
||||||
|
But this would require to be scalable and would need to reinvent lots of things
|
||||||
|
that already exist.
|
||||||
|
|
||||||
|
Alert severities
|
||||||
|
----------------
|
||||||
|
|
||||||
|
The service statutes computed by the `LMA collector` are mapped with the states
|
||||||
|
defined by Nagios by this way:
|
||||||
|
|
||||||
|
+---------------+----------+
|
||||||
|
| LMA collector | Nagios |
|
||||||
|
+===============+==========+
|
||||||
|
| OKAY | OK |
|
||||||
|
+---------------+----------+
|
||||||
|
| WARN | WARNING |
|
||||||
|
+---------------+----------+
|
||||||
|
| FAIL | CRITICAL |
|
||||||
|
+---------------+----------+
|
||||||
|
| UNKNOWN | UNKNOWN |
|
||||||
|
+---------------+----------+
|
||||||
|
|
||||||
|
Contacts, Alerting and Escalation
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
The plugin allows to configure one email address to receive notifications,
|
||||||
|
it's up to the user to select which kind of event he/she will receive:
|
||||||
|
|
||||||
|
* critical
|
||||||
|
* warning
|
||||||
|
* unknown
|
||||||
|
* recovery
|
||||||
|
|
||||||
|
There is no escalation configuration enabled by the plugin. The user still have
|
||||||
|
the possiblity to configure it manually after the deployment of the plugin.
|
||||||
|
|
||||||
|
Limitations
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Adding and removing node(s) to/from the OpenStack cluster won't re-configure
|
||||||
|
the Nagios server.
|
||||||
|
|
||||||
|
This is a limitation of the Fuel Plugin Framework which doesn't trigger `task`
|
||||||
|
when those actions are performed. This limitation should be addressed by a
|
||||||
|
Fuel blueprint [9]_ in the future but might be not ready for MOS 7.0.
|
||||||
|
|
||||||
|
This limitation is leading the user to adjust manually the Nagios
|
||||||
|
configuration:
|
||||||
|
|
||||||
|
* to not receive alert notifications about a deleted node,
|
||||||
|
* to add the new node(s) to Nagios configuration.
|
||||||
|
|
||||||
|
A possible workaround for the 'adding case' would be to use a SSH command from
|
||||||
|
the new node(s) deployed to run the appropriate Puppet manifest on the Nagios
|
||||||
|
node. This workaround may be investigated eventually but not in the first place.
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
None
|
||||||
|
|
||||||
|
Upgrade impact
|
||||||
|
--------------
|
||||||
|
|
||||||
|
If you want to use the LMA alerting plugin, you will have to upgrade your
|
||||||
|
LMA collector plugin too.
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Notifications impact
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
The Nagios server can have several ``active checks`` which poll servers/services
|
||||||
|
and can lead to add extra workload on these targets.
|
||||||
|
|
||||||
|
This impact is minimized here by both:
|
||||||
|
* the usage of ``passive checks`` (ie. Nagios receives status but doesn't poll)
|
||||||
|
* Nagios doesn't poll servers to retrieve metrics but queries the time series
|
||||||
|
database.
|
||||||
|
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
New configuration options:
|
||||||
|
|
||||||
|
* email address of the operator
|
||||||
|
* SMTP gateway (optional)
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Infrastructure impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
Swann Croiset <scroiset@mirantis.com> (developer)
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
Guillaume Thouvenin <gthouvenin@mirantis.com> (developer)
|
||||||
|
Simon Pasquier <spasquier@mirantis.com> (feature lead, developer)
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Implement the Puppet manifests for both Ubuntu and CentOS to configure Nagios
|
||||||
|
|
||||||
|
* Nagios server: main configuration.
|
||||||
|
* Nagios CGI (Web interface) served by Apache [10]_ and PhP [11]_.
|
||||||
|
* Nagios Objects configuration: Commands, Services, Hosts and Contacts.
|
||||||
|
|
||||||
|
* Add support for Nagios output plugin of the LMA collector.
|
||||||
|
|
||||||
|
* Implement or integrate [7]_ the Nagios plugin to query InfluxDB for alarm
|
||||||
|
evaluation over metrics.
|
||||||
|
|
||||||
|
* Testing.
|
||||||
|
|
||||||
|
* Write the documentation.
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
* Fuel 6.1 and higher.
|
||||||
|
|
||||||
|
* LMA Collector Fuel plugin.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
* Prepare a test plan.
|
||||||
|
|
||||||
|
* Test the plugin by deploying environments with all Fuel deployment modes and
|
||||||
|
the LMA toolchain configured.
|
||||||
|
|
||||||
|
* Create integration tests with the LMA toolchain
|
||||||
|
|
||||||
|
Acceptance criteria
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
* The operator can login to the Nagios web interface.
|
||||||
|
* The operator must be notified by email when the state of an
|
||||||
|
OpenStack service change (OK -> DOWN, OK -> WARN, DOWN -> OK).
|
||||||
|
* The operator can define own alerts based on InfluxDB metrics and receive
|
||||||
|
notifications when the thresholds are reached.
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
|
||||||
|
* Write the User Guide for this plugin: deploy and configure the solution.
|
||||||
|
|
||||||
|
* Test Plan.
|
||||||
|
|
||||||
|
* Test Report.
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
.. [1] The LMA toolchain is currently composed of several Fuel plugins:
|
||||||
|
|
||||||
|
* LMA collector plugin
|
||||||
|
* InfluxDB-Grafana plugin
|
||||||
|
* Elasticsearch-Kibana plugin
|
||||||
|
|
||||||
|
.. [2] http://nagios.org
|
||||||
|
|
||||||
|
.. [3] https://github.com/stackforge/fuel-plugin-lma-collector
|
||||||
|
|
||||||
|
.. [4] http://hekad.readthedocs.org/
|
||||||
|
|
||||||
|
.. [5] http://hekad.readthedocs.org/en/v0.9.2/config/outputs/nagios.html
|
||||||
|
|
||||||
|
.. [6] http://www.influxdb.com/
|
||||||
|
|
||||||
|
.. [7] https://github.com/shaharke/influx-nagios-plugin
|
||||||
|
|
||||||
|
.. [8] https://github.com/stackforge/fuel-plugin-influxdb-grafana
|
||||||
|
|
||||||
|
.. [9] https://blueprints.launchpad.net/fuel/+spec/fuel-task-notify-other-nodes
|
||||||
|
|
||||||
|
.. [10] http://httpd.apache.org
|
||||||
|
|
||||||
|
.. [11] http://php.net
|
Loading…
Reference in New Issue