From c1e2c54af01ff82c895278b9be0f8093084a17e9 Mon Sep 17 00:00:00 2001
From: Patrick Petit <ppetit@mirantis.com>
Date: Tue, 28 Jun 2016 18:40:22 +0200
Subject: [PATCH] StackLight 0.10.0 documentation updates

Change-Id: Ib7aeffae78bb1e88cdc3a654bb2825d859b60439
---
 doc/source/user.rst | 234 +++++++++++++++++++++++++++-----------------
 1 file changed, 143 insertions(+), 91 deletions(-)

diff --git a/doc/source/user.rst b/doc/source/user.rst
index 734279c..f7f4a90 100644
--- a/doc/source/user.rst
+++ b/doc/source/user.rst
@@ -8,22 +8,22 @@ User Guide
 Plugin configuration
 --------------------
 
-To configure your plugin, you need to follow these steps:
+To configure the **StackLight Intrastructure Alerting Plugin**, you need to follow these steps:
 
-1. `Create a new environment <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#launch-wizard-to-create-new-environment>`_
-   with the Fuel web user interface.
+1. `Create a new environment
+   <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/create-environment/start-create-env.html>`_.
 
-#. Click the **Settings** tab and select the **Other** category.
+2. Click on the *Settings* tab of the Fuel web UI and select the *Other* category.
 
-#. Scroll down through the settings until you find the **LMA Infrastructure Alerting
-   Plugin** section. You should see a page like this.
+3. Scroll down through the settings until you find the *StackLight Infrastructure 
+   Alerting Plugin* section. You should see a page like this.
 
    .. image:: ../images/lma_infrastructure_alerting_settings.png
       :width: 800
       :align: center
 
-#. Check the *LMA Infrastructure Alerting Plugin* box and fill-in the required fields
-   as indicated below.
+4. Tick the *StackLight Infrastructure Alerting Plugin* box and fill-in the required
+   fields as indicated below.
 
    a. Change the Nagios web interface password (recommended).
    #. Check the boxes corresponding to the type of notification you would
@@ -34,43 +34,53 @@ To configure your plugin, you need to follow these steps:
    #. Specify the SMTP authentication method.
    #. Specify the SMTP username and password (required if the authentication method isn't *None*).
 
-#. When you are done with the settings, scroll down to the bottom of the page and click
-   the **Save Settings** button.
+5. `Configure your environment
+   <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/configure-environment.html>`_.
 
-#. Click the *Nodes* tab and assign the *LMA Infrastructure Alerting* role to nodes
-   as shown below. You can see in this example that the *Infrastructure_Alerting*
-   role is assigned to three different nodes along with the *Elasticsearch_Kibana* role
-   and the *InfluxDB_Grafana* role. This means that the three plugins of the LMA toolchain
-   can be installed on the same nodes.
+   .. note:: By default, StackLight is configured to use the *management network*,
+      of the so-called `Default Node Network Group
+      <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/configure-environment/network-settings.html>`_.
+      While this default setup may be appropriate for small deployments or
+      evaluation purposes, it is recommended not to use this network 
+      for StackLight in production. Instead it is recommended to create a network
+      dedicated to StackLight. Using a dedicated network for  monitoring should 
+      improve the performance of StackLight and minimize the monitoring footprint 
+      on the control-plane. It will also facilitate access to the Nagios web UI
+      after deployment. Please refer to the *StackLight Deployment Guide*
+      for further information about that subject. 
+
+6. Click the *Nodes* tab and assign the *Infrastructure_Alerting* role
+   to the node(s) where you want to install the plugin.
+
+   You can see in the example below that the *Infrastructure_Alerting*
+   role is assigned to three nodes along side with the
+   *Elasticsearch_Kibana* role and the *InfluxDB_Grafana* role.
+   Here, the three plugins of the LMA toolchain backend servers are
+   installed on the same node.
 
    .. image:: ../images/lma_infrastructure_alerting_role.png
       :width: 800
       :align: center
 
-   .. note:: You can assign the *Infrastructure_Alerting* role up to three nodes.
-      Nagios clustering for high availability requires that you assign
-      the *Infrastructure_Alerting* role to at least three nodes. Note also that
-      it is possible to add or remove a node with the *Infrastructure_Alerting*
-      role after deployment.
+   .. note:: Nagios clustering for high availability requires that you assign
+      the *Infrastructure_Alerting* role to at least three nodes.
+      Note also that it is possible to add or remove nodes with the
+      *Infrastructure_Alerting* role after deployment.
 
-#. Click on **Apply Changes**.
+7. `Adjust the disk partitioning if necessary
+   <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/configure-environment/customize-partitions.html>`_.
 
-#. Adjust the disk configuration if necessary (see the `Fuel User Guide
-   <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#disk-partitioning>`_
-   for details). By default, the *LMA Infrastructure Alerting Plugin* allocates:
+   By default, the StackLight Infrastructure Alerting Plugin allocates:
 
-     * 20% of the first available disk for the operating system by honoring a range of
-       15GB minimum and 50GB maximum,
+     * 20% of the first available disk for the operating system
+       by honoring a range of 15GB minimum and 50GB maximum,
      * 10GB for */var/log*,
-     * At least 20 GB for the Nagios data in */var/nagios*.
+     * At least 20 GB for the Nagios data in ``/var/nagios``.
 
-#. `Configure your environment <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#configure-your-environment>`_
-   as needed.
+   The deployment will fail if the above requirements are not met.
 
-#. `Verify the networks <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#verify-networks>`_
-   on the Networks tab of the Fuel web UI.
-
-#. And finally, `Deploy <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#deploy-changes>`_ your changes.
+8. `Deploy your environment
+   <http://docs.openstack.org/developer/fuel-docs/userdocs/fuel-user-guide/deploy-environment.html>`_.
 
 .. _plugin_install_verification:
 
@@ -78,81 +88,123 @@ Plugin verification
 -------------------
 
 Be aware, that depending on the number of nodes and deployment setup,
-deploying a Mirantis OpenStack environment can typically take anything
-from 30 minutes to several hours. But once your deployment is complete,
+deploying a Mirantis OpenStack environment may typically take between 
+20 minutes to several hours. Once your deployment is complete,
 you should see a deployment success notification message with
-a link to the Nagios dashboard as shown below.
+a link to the Nagios web UI as shown below.
 
 .. image:: ../images/deployment_notification.png
    :align: center
    :width: 800
 
-From the Fuel web UI **Dashboard** view, click on the **Nagios** link.
-Once you have authenticated (username is ``nagiosadmin`` and the
-password is defined in the settings of the plugin), you should be directed to
-the *Nagios Home Page* as shown below.
+Click on the *Nagios* link.
 
-.. note:: Be aware that Nagios is attached to the *management network*.
-   Your desktop machine must have access to the OpenStack environment's
-   *management network* you just created to get access to the Nagios dashboard.
+Once you are authenticated,
+you should be redirected to the **Nagios Home Page** as shown below.
 
 .. image:: ../images/nagios_homepage.png
    :align: center
    :width: 800
 
-Managing Nagios
----------------
+.. note:: *username* is ``nagiosadmin`` by default, *password* is defined
+   in the settings.
 
-You can get the current status of the OpenStack environment by clicking on
-the *Services* menu item as shown below.
+.. note:: Be aware that if Nagios is installed on the *management network*,
+   you may not have direct access to the Nagios web UI. Some extra network
+   configuration may be required to create a tunnel to the *management network*. 
+
+Using Nagios
+------------
+
+The StackLight Infrastructure Alerting Plugin configures Nagios
+to display the health status of all the nodes and services running
+in the OpenStack environment. The alarms (or service checks in Nagios
+terms) are created in **passive mode** which means that the actual
+checks are not performed by Nagios itself, but by the Collector
+and Aggregator agents of the LMA toolchain.
+
+The best place to get an overview of your OpenStack environment
+is to go the **Services Dashboard**.
+If you click the *Services* link in the left panel of the
+Nagios web UI, you should see a page like this:
 
 .. image:: ../images/nagios_services.png
    :align: center
    :width: 800
 
-The *LMA Infrastructure Alerting Plugin* configures Nagios for all the
-hosts and services that have been deployed in the environment. The alarms (or
-service checks in Nagios terms) are created in **passive mode** as
-they are received from the *LMA Collector* and *Aggregator* (see the `LMA
-Collector documentation <http://fuel-plugin-lma-collector.readthedocs.io/>`_
-for more details).
+In this dashboard, there are two 'virtual hosts' representing
+the health status of the so-called **global clusters** and
+**node clusters** entities:
 
-.. note:: The alert notifications for the nodes and clusters of nodes are
-   disabled by default to avoid the alert fatigue and because they are not
-   necessarily indicative of a condition affecting the overall health state
-   of an OpenStack service cluster. If you nonetheless want to enable those alerts,
-   go to the service details page and click on the *Enable notifications
-   for this service* link within the *Service Commands* panel as shown below.
+  * *00-global-clusters-env${ENVID}* is used to represent the
+    aggregated health status of global clusters like 'Nova',
+    'Keystone' or 'RabbiMQ' to name a few. 
+
+  * *00-node-clusters-env${ENVID}* is used to represent the
+    aggregated health status of  node clusters like
+    'Controller', 'Compute' and 'Storage'.
+
+Following the 'virtual hosts' sections, there is a list
+of checks received for each of the nodes provisioned in the 
+environment. These checks may vary depending on the role of
+the node being monitored.
+ 
+Alerting for the global cluster entities is enabled by default.
+Alerting for the nodes and clusters of nodes is disabled
+by default to avoid the alert fatigue since those alerts should 
+not be representative of a critical condition affecting
+the overall health status of the global cluster entities.
+If you nonetheless want to enable those alerts, we can go
+to the service details page and click on the *Enable notifications
+for this service* link within the *Service Commands* panel as shown below.
 
 .. image:: ../images/nagios_enable_notifs.png
    :align: center
    :width: 800
 
-There are also two *Virtual Hosts* representing the health state of the
-*service clusters* and *node clusters*:
+Finally, you should pay attention to the fact that there is
+a direct dependency between the configuraton of the passive
+checks in Nagios and the `configuration of the alarms in
+the Collectors
+<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/alarms.html>`_.
+A change in ``/etc/hiera/override/alarming.yaml`` or  
+``/etc/hiera/override/gse_filters.yaml`` on any of the
+nodes monitored by StackLight would require to reconfigure Nagios. 
+It also implies that these two files should be maintained
+rigourously identical on all the nodes of the environment
+**including those where Nagios is installed**. Fortunately,
+StackLight provides Puppet artefacts to help you out with
+that task. To reconfigure the passive checks in Nagios
+when ``/etc/hiera/override/alarming.yaml`` or
+``/etc/hiera/override/gse_filters.yaml`` are modified
+you should run the command shown bellow on all the nodes where
+Nagios is installed::
 
-  * *00-global-clusters-env${ENVID}* for the service clusters like the Nova
-    cluster, the Keystone cluster, the RabbiMQ cluster and so on.
+  # puppet apply --modulepath=/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/modules/ \
+  /etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp  
 
-  * *00-node-clusters-env${ENVID}* for the physical node clusters like the
-    cluster of controller nodes, the cluster of storage nodes and so on.
+Configuring service checks using the InfluxDB metrics
+-----------------------------------------------------
 
-These *Virtual Hosts* entities offer a high-level health state view for
-those clusters in the OpenStack environment.
+You could also configure Nagios to perform active checks,
+which are not performed by StakLight by default, using the
+metrics stored in InfluxDB's time-series.
+For example, you could define active checks to be notified
+when the CPU activity of particular process is too high. 
 
-Configuring service checks on InfluxDB metrics
-----------------------------------------------
+Let's assume the following scenario.
 
-You can configure additional alarms (other than those already defined in the
-*LMA Collector*) based on the metrics stored in the InfluxDB database. You
-can, for example, define an alert to be notified when the CPU activity for a
-particular process crosses a particular threshold.
-Say for example, you would like to set a 'warning'
-alarm at 30% of system CPU usage and a 'criticial' alarm at 50% system CPU usage for the
-Elasticsearch process.
-The steps to define those alarms in Nagios would be as follow:
+  * You want to monitor the Elasticsearch server
+  * The CPU activity of the Elasticsearch server is captured
+    in a time-series stored in InfluxDB. 
+  * You want to receive an alert at the 'warning' level
+    when the CPU load exceeds 30% of system activity.
+  * You want to receive an alert at the 'critical' level
+    when the CPU load exceeds 50% of system activity.
 
-#. Connect to the *LMA Infrastructure Alerting* node.
+The steps to create such an alarms in Nagios would be as follow:
+
+#. Connect to each of the nodes running Nagios.
 
 #. Install the Nagios plugin for querying InfluxDB::
 
@@ -190,14 +242,14 @@ The steps to define those alarms in Nagios would be as follow:
 
   Here, things look okay. No serious problems were detected during the pre-flight check.
 
-5. Restart the Nagios server,::
+#. Restart the Nagios server::
 
     [root@node-13 ~]# /etc/init.d/nagios3 restart
 
-#. Go the Nagios dashboard and verify that the service check has been added.
+#. Go to the Nagios Web UI to verify that the service check has been added.
 
-From there, you can define additional service checks for different hosts or
-host groups using the same ``check_influx`` command.
+You can define additional service checks for different nodes or
+node groups using the same ``check_influx`` command.
 You will just need to provide these three required arguments for defining new service checks:
 
   * A valid InfluxDB query that should return only one row with a single value.
@@ -262,9 +314,9 @@ your environment.
 Troubleshooting
 ---------------
 
-If you cannot access the Nagios UI, follow these troubleshooting tips.
+If you cannot access the Nagios web UI, follow these troubleshooting tips.
 
-#. Check that the *LMA Collector* nodes are able to connect to the Nagios
+#. Check that the StackLight Collectors are able to connect to the Nagios
    VIP address on port *8001*.
 
 #. Check that the Nagios configuration is valid::
@@ -286,7 +338,7 @@ If you cannot access the Nagios UI, follow these troubleshooting tips.
 
     [root@node-13 ~]# /etc/init.d/nagios3 start
 
-#. Check if Apache is up and running::
+#. Check that Apache is up and running::
 
     [root@node-13 ~]# /etc/init.d/apache2 status
 
@@ -294,9 +346,9 @@ If you cannot access the Nagios UI, follow these troubleshooting tips.
 
     [root@node-13 ~]# /etc/init.d/apache2 start
 
-#. Look for errors in the Nagios log file (located at /var/nagios/nagios.log).
+#. Look for errors in the Nagios log file ``/var/nagios/nagios.log``.
 
-#. Look for errors in the Apache log file (located at /var/log/apache2/nagios_error.log).
+#. Look for errors in the Apache log file ``/var/log/apache2/nagios_error.log``.
 
 Finally, Nagios may report a host or service state as *UNKNOWN*.
 Two cases can be distinguished:
@@ -305,12 +357,12 @@ Two cases can be distinguished:
   * 'UNKNOWN: No datapoint have been received over the last X seconds'.
 
 Both cases indicate that Nagios doesn't receive regular passive checks from
-the *LMA Collector*. This may be due to different problems:
+the StackLight Collector. This may be due to different problems:
 
-  * The 'hekad' process of the *LMA Collector* fails to communicate with Nagios,
-  * The 'collectd' and/or 'hekad' process of the *LMA Collector* has crashed,
+  * The 'hekad' process fails to communicate with Nagios,
+  * The 'collectd' and/or 'hekad' process have crashed,
   * One or several alarm rules are misconfigured.
 
 To remedy to the above situations, follow the `troubleshooting tips
 <http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_
-of the *LMA Collector Plugin User Guide*.
+of the *StackLight Collector Plugin User Guide*.