StackLight 0.10.0 documentation updates

Change-Id: Ib7aeffae78bb1e88cdc3a654bb2825d859b60439
This commit is contained in:
Patrick Petit 2016-06-28 18:40:22 +02:00
parent d7d89723c7
commit c1e2c54af0
1 changed files with 143 additions and 91 deletions

View File

@ -8,22 +8,22 @@ User Guide
Plugin configuration Plugin configuration
-------------------- --------------------
To configure your plugin, you need to follow these steps: To configure the **StackLight Intrastructure Alerting Plugin**, you need to follow these steps:
1. `Create a new environment <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#launch-wizard-to-create-new-environment>`_ 1. `Create a new environment
with the Fuel web user interface. <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/create-environment/start-create-env.html>`_.
#. Click the **Settings** tab and select the **Other** category. 2. Click on the *Settings* tab of the Fuel web UI and select the *Other* category.
#. Scroll down through the settings until you find the **LMA Infrastructure Alerting 3. Scroll down through the settings until you find the *StackLight Infrastructure
Plugin** section. You should see a page like this. Alerting Plugin* section. You should see a page like this.
.. image:: ../images/lma_infrastructure_alerting_settings.png .. image:: ../images/lma_infrastructure_alerting_settings.png
:width: 800 :width: 800
:align: center :align: center
#. Check the *LMA Infrastructure Alerting Plugin* box and fill-in the required fields 4. Tick the *StackLight Infrastructure Alerting Plugin* box and fill-in the required
as indicated below. fields as indicated below.
a. Change the Nagios web interface password (recommended). a. Change the Nagios web interface password (recommended).
#. Check the boxes corresponding to the type of notification you would #. Check the boxes corresponding to the type of notification you would
@ -34,43 +34,53 @@ To configure your plugin, you need to follow these steps:
#. Specify the SMTP authentication method. #. Specify the SMTP authentication method.
#. Specify the SMTP username and password (required if the authentication method isn't *None*). #. Specify the SMTP username and password (required if the authentication method isn't *None*).
#. When you are done with the settings, scroll down to the bottom of the page and click 5. `Configure your environment
the **Save Settings** button. <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/configure-environment.html>`_.
#. Click the *Nodes* tab and assign the *LMA Infrastructure Alerting* role to nodes .. note:: By default, StackLight is configured to use the *management network*,
as shown below. You can see in this example that the *Infrastructure_Alerting* of the so-called `Default Node Network Group
role is assigned to three different nodes along with the *Elasticsearch_Kibana* role <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/configure-environment/network-settings.html>`_.
and the *InfluxDB_Grafana* role. This means that the three plugins of the LMA toolchain While this default setup may be appropriate for small deployments or
can be installed on the same nodes. evaluation purposes, it is recommended not to use this network
for StackLight in production. Instead it is recommended to create a network
dedicated to StackLight. Using a dedicated network for monitoring should
improve the performance of StackLight and minimize the monitoring footprint
on the control-plane. It will also facilitate access to the Nagios web UI
after deployment. Please refer to the *StackLight Deployment Guide*
for further information about that subject.
6. Click the *Nodes* tab and assign the *Infrastructure_Alerting* role
to the node(s) where you want to install the plugin.
You can see in the example below that the *Infrastructure_Alerting*
role is assigned to three nodes along side with the
*Elasticsearch_Kibana* role and the *InfluxDB_Grafana* role.
Here, the three plugins of the LMA toolchain backend servers are
installed on the same node.
.. image:: ../images/lma_infrastructure_alerting_role.png .. image:: ../images/lma_infrastructure_alerting_role.png
:width: 800 :width: 800
:align: center :align: center
.. note:: You can assign the *Infrastructure_Alerting* role up to three nodes. .. note:: Nagios clustering for high availability requires that you assign
Nagios clustering for high availability requires that you assign the *Infrastructure_Alerting* role to at least three nodes.
the *Infrastructure_Alerting* role to at least three nodes. Note also that Note also that it is possible to add or remove nodes with the
it is possible to add or remove a node with the *Infrastructure_Alerting* *Infrastructure_Alerting* role after deployment.
role after deployment.
#. Click on **Apply Changes**. 7. `Adjust the disk partitioning if necessary
<http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/configure-environment/customize-partitions.html>`_.
#. Adjust the disk configuration if necessary (see the `Fuel User Guide By default, the StackLight Infrastructure Alerting Plugin allocates:
<http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#disk-partitioning>`_
for details). By default, the *LMA Infrastructure Alerting Plugin* allocates:
* 20% of the first available disk for the operating system by honoring a range of * 20% of the first available disk for the operating system
15GB minimum and 50GB maximum, by honoring a range of 15GB minimum and 50GB maximum,
* 10GB for */var/log*, * 10GB for */var/log*,
* At least 20 GB for the Nagios data in */var/nagios*. * At least 20 GB for the Nagios data in ``/var/nagios``.
#. `Configure your environment <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#configure-your-environment>`_ The deployment will fail if the above requirements are not met.
as needed.
#. `Verify the networks <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#verify-networks>`_ 8. `Deploy your environment
on the Networks tab of the Fuel web UI. <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/deploy-environment.html>`_.
#. And finally, `Deploy <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#deploy-changes>`_ your changes.
.. _plugin_install_verification: .. _plugin_install_verification:
@ -78,81 +88,123 @@ Plugin verification
------------------- -------------------
Be aware, that depending on the number of nodes and deployment setup, Be aware, that depending on the number of nodes and deployment setup,
deploying a Mirantis OpenStack environment can typically take anything deploying a Mirantis OpenStack environment may typically take between
from 30 minutes to several hours. But once your deployment is complete, 20 minutes to several hours. Once your deployment is complete,
you should see a deployment success notification message with you should see a deployment success notification message with
a link to the Nagios dashboard as shown below. a link to the Nagios web UI as shown below.
.. image:: ../images/deployment_notification.png .. image:: ../images/deployment_notification.png
:align: center :align: center
:width: 800 :width: 800
From the Fuel web UI **Dashboard** view, click on the **Nagios** link. Click on the *Nagios* link.
Once you have authenticated (username is ``nagiosadmin`` and the
password is defined in the settings of the plugin), you should be directed to
the *Nagios Home Page* as shown below.
.. note:: Be aware that Nagios is attached to the *management network*. Once you are authenticated,
Your desktop machine must have access to the OpenStack environment's you should be redirected to the **Nagios Home Page** as shown below.
*management network* you just created to get access to the Nagios dashboard.
.. image:: ../images/nagios_homepage.png .. image:: ../images/nagios_homepage.png
:align: center :align: center
:width: 800 :width: 800
Managing Nagios .. note:: *username* is ``nagiosadmin`` by default, *password* is defined
--------------- in the settings.
You can get the current status of the OpenStack environment by clicking on .. note:: Be aware that if Nagios is installed on the *management network*,
the *Services* menu item as shown below. you may not have direct access to the Nagios web UI. Some extra network
configuration may be required to create a tunnel to the *management network*.
Using Nagios
------------
The StackLight Infrastructure Alerting Plugin configures Nagios
to display the health status of all the nodes and services running
in the OpenStack environment. The alarms (or service checks in Nagios
terms) are created in **passive mode** which means that the actual
checks are not performed by Nagios itself, but by the Collector
and Aggregator agents of the LMA toolchain.
The best place to get an overview of your OpenStack environment
is to go the **Services Dashboard**.
If you click the *Services* link in the left panel of the
Nagios web UI, you should see a page like this:
.. image:: ../images/nagios_services.png .. image:: ../images/nagios_services.png
:align: center :align: center
:width: 800 :width: 800
The *LMA Infrastructure Alerting Plugin* configures Nagios for all the In this dashboard, there are two 'virtual hosts' representing
hosts and services that have been deployed in the environment. The alarms (or the health status of the so-called **global clusters** and
service checks in Nagios terms) are created in **passive mode** as **node clusters** entities:
they are received from the *LMA Collector* and *Aggregator* (see the `LMA
Collector documentation <http://fuel-plugin-lma-collector.readthedocs.io/>`_
for more details).
.. note:: The alert notifications for the nodes and clusters of nodes are * *00-global-clusters-env${ENVID}* is used to represent the
disabled by default to avoid the alert fatigue and because they are not aggregated health status of global clusters like 'Nova',
necessarily indicative of a condition affecting the overall health state 'Keystone' or 'RabbiMQ' to name a few.
of an OpenStack service cluster. If you nonetheless want to enable those alerts,
go to the service details page and click on the *Enable notifications * *00-node-clusters-env${ENVID}* is used to represent the
for this service* link within the *Service Commands* panel as shown below. aggregated health status of node clusters like
'Controller', 'Compute' and 'Storage'.
Following the 'virtual hosts' sections, there is a list
of checks received for each of the nodes provisioned in the
environment. These checks may vary depending on the role of
the node being monitored.
Alerting for the global cluster entities is enabled by default.
Alerting for the nodes and clusters of nodes is disabled
by default to avoid the alert fatigue since those alerts should
not be representative of a critical condition affecting
the overall health status of the global cluster entities.
If you nonetheless want to enable those alerts, we can go
to the service details page and click on the *Enable notifications
for this service* link within the *Service Commands* panel as shown below.
.. image:: ../images/nagios_enable_notifs.png .. image:: ../images/nagios_enable_notifs.png
:align: center :align: center
:width: 800 :width: 800
There are also two *Virtual Hosts* representing the health state of the Finally, you should pay attention to the fact that there is
*service clusters* and *node clusters*: a direct dependency between the configuraton of the passive
checks in Nagios and the `configuration of the alarms in
the Collectors
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/alarms.html>`_.
A change in ``/etc/hiera/override/alarming.yaml`` or
``/etc/hiera/override/gse_filters.yaml`` on any of the
nodes monitored by StackLight would require to reconfigure Nagios.
It also implies that these two files should be maintained
rigourously identical on all the nodes of the environment
**including those where Nagios is installed**. Fortunately,
StackLight provides Puppet artefacts to help you out with
that task. To reconfigure the passive checks in Nagios
when ``/etc/hiera/override/alarming.yaml`` or
``/etc/hiera/override/gse_filters.yaml`` are modified
you should run the command shown bellow on all the nodes where
Nagios is installed::
* *00-global-clusters-env${ENVID}* for the service clusters like the Nova # puppet apply --modulepath=/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/modules/ \
cluster, the Keystone cluster, the RabbiMQ cluster and so on. /etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp
* *00-node-clusters-env${ENVID}* for the physical node clusters like the Configuring service checks using the InfluxDB metrics
cluster of controller nodes, the cluster of storage nodes and so on. -----------------------------------------------------
These *Virtual Hosts* entities offer a high-level health state view for You could also configure Nagios to perform active checks,
those clusters in the OpenStack environment. which are not performed by StakLight by default, using the
metrics stored in InfluxDB's time-series.
For example, you could define active checks to be notified
when the CPU activity of particular process is too high.
Configuring service checks on InfluxDB metrics Let's assume the following scenario.
----------------------------------------------
You can configure additional alarms (other than those already defined in the * You want to monitor the Elasticsearch server
*LMA Collector*) based on the metrics stored in the InfluxDB database. You * The CPU activity of the Elasticsearch server is captured
can, for example, define an alert to be notified when the CPU activity for a in a time-series stored in InfluxDB.
particular process crosses a particular threshold. * You want to receive an alert at the 'warning' level
Say for example, you would like to set a 'warning' when the CPU load exceeds 30% of system activity.
alarm at 30% of system CPU usage and a 'criticial' alarm at 50% system CPU usage for the * You want to receive an alert at the 'critical' level
Elasticsearch process. when the CPU load exceeds 50% of system activity.
The steps to define those alarms in Nagios would be as follow:
#. Connect to the *LMA Infrastructure Alerting* node. The steps to create such an alarms in Nagios would be as follow:
#. Connect to each of the nodes running Nagios.
#. Install the Nagios plugin for querying InfluxDB:: #. Install the Nagios plugin for querying InfluxDB::
@ -190,14 +242,14 @@ The steps to define those alarms in Nagios would be as follow:
Here, things look okay. No serious problems were detected during the pre-flight check. Here, things look okay. No serious problems were detected during the pre-flight check.
5. Restart the Nagios server,:: #. Restart the Nagios server::
[root@node-13 ~]# /etc/init.d/nagios3 restart [root@node-13 ~]# /etc/init.d/nagios3 restart
#. Go the Nagios dashboard and verify that the service check has been added. #. Go to the Nagios Web UI to verify that the service check has been added.
From there, you can define additional service checks for different hosts or You can define additional service checks for different nodes or
host groups using the same ``check_influx`` command. node groups using the same ``check_influx`` command.
You will just need to provide these three required arguments for defining new service checks: You will just need to provide these three required arguments for defining new service checks:
* A valid InfluxDB query that should return only one row with a single value. * A valid InfluxDB query that should return only one row with a single value.
@ -262,9 +314,9 @@ your environment.
Troubleshooting Troubleshooting
--------------- ---------------
If you cannot access the Nagios UI, follow these troubleshooting tips. If you cannot access the Nagios web UI, follow these troubleshooting tips.
#. Check that the *LMA Collector* nodes are able to connect to the Nagios #. Check that the StackLight Collectors are able to connect to the Nagios
VIP address on port *8001*. VIP address on port *8001*.
#. Check that the Nagios configuration is valid:: #. Check that the Nagios configuration is valid::
@ -286,7 +338,7 @@ If you cannot access the Nagios UI, follow these troubleshooting tips.
[root@node-13 ~]# /etc/init.d/nagios3 start [root@node-13 ~]# /etc/init.d/nagios3 start
#. Check if Apache is up and running:: #. Check that Apache is up and running::
[root@node-13 ~]# /etc/init.d/apache2 status [root@node-13 ~]# /etc/init.d/apache2 status
@ -294,9 +346,9 @@ If you cannot access the Nagios UI, follow these troubleshooting tips.
[root@node-13 ~]# /etc/init.d/apache2 start [root@node-13 ~]# /etc/init.d/apache2 start
#. Look for errors in the Nagios log file (located at /var/nagios/nagios.log). #. Look for errors in the Nagios log file ``/var/nagios/nagios.log``.
#. Look for errors in the Apache log file (located at /var/log/apache2/nagios_error.log). #. Look for errors in the Apache log file ``/var/log/apache2/nagios_error.log``.
Finally, Nagios may report a host or service state as *UNKNOWN*. Finally, Nagios may report a host or service state as *UNKNOWN*.
Two cases can be distinguished: Two cases can be distinguished:
@ -305,12 +357,12 @@ Two cases can be distinguished:
* 'UNKNOWN: No datapoint have been received over the last X seconds'. * 'UNKNOWN: No datapoint have been received over the last X seconds'.
Both cases indicate that Nagios doesn't receive regular passive checks from Both cases indicate that Nagios doesn't receive regular passive checks from
the *LMA Collector*. This may be due to different problems: the StackLight Collector. This may be due to different problems:
* The 'hekad' process of the *LMA Collector* fails to communicate with Nagios, * The 'hekad' process fails to communicate with Nagios,
* The 'collectd' and/or 'hekad' process of the *LMA Collector* has crashed, * The 'collectd' and/or 'hekad' process have crashed,
* One or several alarm rules are misconfigured. * One or several alarm rules are misconfigured.
To remedy to the above situations, follow the `troubleshooting tips To remedy to the above situations, follow the `troubleshooting tips
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_ <http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_
of the *LMA Collector Plugin User Guide*. of the *StackLight Collector Plugin User Guide*.