[docs] Edits Alarms and Appendix

Edits the following sections of the StackLight Collector
plugin 0.10.0 documentation:

* Configuring alarms
* Appendix

Change-Id: I534611a4eae9aeb97bfedb3971d7a8ec76e20bac
This commit is contained in:
Maria Zlatkova 2016-07-19 19:56:52 +03:00
parent 5c0d43aaec
commit 8581289600
17 changed files with 760 additions and 654 deletions

View File

@ -1,9 +1,13 @@
.. _alarms:
.. raw:: latex
\pagebreak
List of built-in alarms
-----------------------
Here is a list of all the alarms that are built-in in StackLight::
The following is a list of StackLight built-in alarms::
alarms:
- name: 'cpu-critical-controller'
@ -732,5 +736,4 @@ Here is a list of all the alarms that are built-in in StackLight::
threshold: 5
window: 60
periods: 0
function: min
function: min

View File

@ -3,8 +3,8 @@
List of metrics
---------------
Here is a list of metrics that are emitted by the StackLight Collector.
They are listed by category then by metric name.
The following is a list of metrics that are emitted by the StackLight Collector.
The metrics are listed by category, then by metric name.
System
++++++
@ -63,7 +63,7 @@ Clusters
.. include:: metrics/clusters.rst
Self Monitoring
Self-monitoring
+++++++++++++++
.. include:: metrics/lma.rst
@ -78,4 +78,4 @@ Elasticsearch
InfluxDB
++++++++
.. include:: metrics/influxdb.rst
.. include:: metrics/influxdb.rst

View File

@ -3,139 +3,130 @@
Overview
--------
The process of running alarms in StackLight is not centralized
(as it is often the case in more conventional monitoring systems)
but distributed across all the StackLight Collectors.
The process of running alarms in StackLight is not centralized, as it is often
the case in more conventional monitoring systems, but distributed across all
the StackLight Collectors.
Each Collector is individually responsible for monitoring the
resources and the services that are deployed on the node and for reporting
any anomaly or fault it has detected to the Aggregator.
Each Collector is individually responsible for monitoring the resources and
services that are deployed on the node and for reporting any anomaly or fault
it has detected to the Aggregator.
The anomaly and fault detection logic in StackLight is designed
more like an *expert system* in that the Collector and the Aggregator
use artifacts we could refer to as *facts* and *rules*.
The anomaly and fault detection logic in StackLight is designed more like an
*expert system* in that the Collector and the Aggregator use artifacts we
can refer to as *facts* and *rules*.
The *facts* are the operational data ingested in the StackLight's
stream processing pipeline.
The *rules* are either alarm rules or aggregation rules.
They are declaratively defined in YAML files that can be modified.
Those rules are turned into a collection of Lua plugins
that are executed by the Collector and the Aggregator.
They are generated dynamically using the Puppet modules of the StackLight
Collector Plugin.
stream-processing pipeline. The *rules* are either alarm rules or aggregation
rules. They are declaratively defined in YAML files that can be modified.
Those rules are turned into a collection of Lua plugins that are executed by
the Collector and the Aggregator. They are generated dynamically using the
Puppet modules of the StackLight Collector Plugin.
There are two types of Lua plugins related to the processing
of alarms.
The following are the two types of Lua plugins related to the processing of
alarms:
* The **AFD plugin** for Anomaly and Fault Detection plugin.
* The **GSE plugin** for Global Status Evaluation plugin.
* The **AFD plugin** -- Anomaly and Fault Detection plugin
* The **GSE plugin** -- Global Status Evaluation plugin
These plugins create a special type of metric called respectively
the **AFD metric** and the **GSE metric**.
These plugins create special types of metrics, as follows:
* The AFD metric contains information about the health status
of a node or service in the OpenStack environment.
The AFD metrics are sent on a regular basis to the Aggregator
where they are further processed by the GSE plugins.
* The GSE metric contains information about the health status
of a cluster in the OpenStack environment. A cluster is a
logical grouping of nodes or services. We call
them node clusters and service clusters hereafter.
A service cluster can be anything like a cluster of API endpoints
or a cluster of workers. A cluster of nodes is a grouping of
nodes that have the same role. For example 'compute' or 'storage'.
* The **AFD metric**, which contains information about the health status of a
node or service in the OpenStack environment. The AFD metrics are sent on a
regular basis to the Aggregator where they are further processed by the GSE
plugins.
.. note:: The AFD and GSE metrics are new types of metrics introduced
in StackLight version 0.8.
They contain detailed information about the fault and anomalies
detected by StackLight. Please refer to the
* The **GSE metric**, which contains information about the health status of a
cluster in the OpenStack environment. A cluster is a logical grouping of
nodes or services. We call them node clusters and service clusters hereafter.
A service cluster can be anything like a cluster of API endpoints or a
cluster of workers. A cluster of nodes is a grouping of nodes that have the
same role. For example, *compute* or *storage*.
.. note:: The AFD and GSE metrics are new types of metrics introduced in
StackLight version 0.8. They contain detailed information about the fault
and anomalies detected by StackLight. For more information about the
message structure of these metrics, refer to
`Metrics section of the Developer Guide
<http://lma-developer-guide.readthedocs.io/en/latest/metrics.html>`_
for more information about the message structure of these metrics.
<http://lma-developer-guide.readthedocs.io/en/latest/metrics.html>`_.
The StackLight stream processing pipeline workflow is shown in the figure below:
The following figure shows the StackLight stream-processing pipeline workflow:
.. figure:: ../../images/AFD_and_GSE_message_flow.*
:width: 800
:alt: Message flow for the AFD and GSE metrics
:align: center
.. raw:: latex
\pagebreak
The AFD and GSE plugins
-----------------------
In the current version of StackLight, there are three types of GSE plugins:
The current version of StackLight contains the following three types of GSE
plugins:
* The **Service Cluster GSE Plugin** which receives AFD metrics for services
* The **Service Cluster GSE Plugin**, which receives AFD metrics for services
from the AFD plugins.
* The **Node Cluster GSE Plugin** which receives AFD metrics for nodes
* The **Node Cluster GSE Plugin**, which receives AFD metrics for nodes
from the AFD plugins.
* The **Global Cluster GSE Plugin** which receives GSE metrics from the
GSE plugins above. It aggregates and correlates the GSE metrics to issue a global
health status for the top-level clusters like Nova, MySQL and so forth.
* The **Global Cluster GSE Plugin**, which receives GSE metrics from the
GSE plugins above. It aggregates and correlates the GSE metrics to issue a
global health status for the top-level clusters like Nova, MySQL, and others.
The health status exposed in the GSE metrics is as follow:
The health status exposed in the GSE metrics is as follows:
* *Down*: One or several primary functions of a cluster has failed or is failing.
For example, the API service for Nova or Cinder isn't accessible.
* *Critical*: One or several primary functions of a
cluster are severely degraded. The quality
of service delivered to the end-user is severely impacted.
* *Warning*: One or several primary functions of the
cluster are slightly degraded. The quality
of service delivered to the end-user is slightly
* ``Down``: One or several primary functions of a cluster has failed or is
failing. For example, the API service for Nova or Cinder is not accessible.
* ``Critical``: One or several primary functions of a cluster are severely
degraded. The quality of service delivered to the end user is severely
impacted.
* *Unknown*: There is not enough data to infer the actual
health status of the cluster.
* *Okay*: None of the above was found to be true.
* ``Warning``: One or several primary functions of the cluster are slightly
degraded. The quality of service delivered to the end user is slightly
impacted.
* ``Unknown``: There is not enough data to infer the actual health status of
the cluster.
* ``Okay``: None of the above was found to be true.
The AFD and GSE persisters
--------------------------
The AFD and GSE metrics are also consumed by other types
of Lua plugins called the **persisters**.
The AFD and GSE metrics are also consumed by other types of Lua plugins called
**persisters**:
* The **InfluxDB persister** transforms the GSE metrics
into InfluxDB data-points and Grafana annotations. They
are used in Grafana to graph the health status of
the OpenStack clusters.
* The **Elasticsearch persister** transforms the AFD metrics
into events that are indexed in Elasticsearch. Using Kibana,
these events can be searched to display a fault or an anomaly
that occured in the environment (not implemented yet).
* The **Nagios persister** transforms the GSE and AFD metrics
into passive checks that are sent to Nagios for alerting and
escalation.
* The **InfluxDB persister** transforms the GSE metrics into InfluxDB data
points and Grafana annotations. They are used in Grafana to graph the health
status of the OpenStack clusters.
* The **Elasticsearch persister** transforms the AFD metrics into events that
are indexed in Elasticsearch. Using Kibana, these events can be searched to
display a fault or an anomaly that occurred in the environment (not yet
implemented).
* The **Nagios persister** transforms the GSE and AFD metrics into passive
checks that are sent to Nagios for alerting and escalation.
New persisters could be created easely to feed other
systems with the operational insight contained in the
AFD and GSE metrics.
New persisters can be easily created to feed other systems with the
operational insight contained in the AFD and GSE metrics.
.. _alarm_configuration:
Alarms configuration
--------------------
StackLight comes with a predefined set of alarm rules.
We have tried to make these rules as comprehensive and relevant
as possible, but your mileage may vary depending on the specifics of
your OpenStack environment and monitoring requirements.
Therefore, it is possible to modify those predefined rules
and create new ones.
To do so, you will be required to modify the
``/etc/hiera/override/alarming.yaml`` file
and apply the :ref:`Puppet manifest <puppet_apply>`
that will dynamically generate Lua plugins known as
the AFD Plugins which are the actuators of the alarm rules.
But before you proceed, you need to understand the structure
of that file.
StackLight comes with a predefined set of alarm rules. We have tried to make
these rules as comprehensive and relevant as possible, but your mileage may
vary depending on the specifics of your OpenStack environment and monitoring
requirements. Therefore, it is possible to modify those predefined rules and
create new ones. To do so, modify the ``/etc/hiera/override/alarming.yaml``
file and apply the :ref:`Puppet manifest <puppet_apply>` that will dynamically
generate Lua plugins, known as the AFD Plugins, which are the actuators of the
alarm rules. But before you proceed, verify that understand the structure of
that file.
.. _alarm_structure:
Alarm structure
+++++++++++++++
An alarm rule is defined declaratively using the YAML syntax
as shown in the example below::
An alarm rule is defined declaratively using the YAML syntax. For example::
name: 'fs-warning'
description: 'Filesystem free space is low'
@ -180,7 +171,7 @@ as shown in the example below::
| logical_operator
| Type: Enum('and' | '&&' | 'or' | '||')
| The conjonction relation for the alarm rules.
| The conjunction relation for the alarm rules
| metric
| Type: unicode
@ -192,24 +183,25 @@ as shown in the example below::
| fields
| Type: list
| List of field name / value pairs (a.k.a dimensions) used to select
a particular device for the metric such as a network interface name or file
system mount point. If value is specified as an empty string (""), then the rule
is applied to all the aggregated values for the specified field name. For example
the file system mount point.
If value is specified as the '*' wildcard character,
then the rule is applied to each of the metrics matching the metric name and field name.
For example, the alarm definition sample given above would run the rule
for each of the file system mount points associated with the *fs_space_percent_free* metric.
| List of field name / value pairs, also known as dimensions, used to select
a particular device for the metric, such as a network interface name or
file system mount point. If the value is specified as an empty string (""),
then the rule is applied to all the aggregated values for the specified
field name. For example, the file system mount point. If value is
specified as the '*' wildcard character, then the rule is applied to each
of the metrics matching the metric name and field name. For example, the
alarm definition sample given above would run the rule for each of the
file system mount points associated with the *fs_space_percent_free*
metric.
| window
| Type: integer
| The in memory time-series analysis window in seconds
| The in-memory time-series analysis window in seconds
| periods
| Type: integer
| The number of prior time-series analysis window to compare the window with (this is
| not implemented yet)
| The number of prior time-series analysis window to compare the window with
| (this is not implemented yet).
| function
| Type: enum('last' | 'min' | 'max' | 'sum' | 'count' | 'avg' | 'median' | 'mode' | 'roc' | 'mww' | 'mww_nonparametric')
@ -232,46 +224,49 @@ as shown in the example below::
| returns the value that occurs most often in all the values
| (not implemented yet)
| roc:
| The 'roc' function detects a significant rate
of change when comparing current metrics values with historical data.
To achieve this, it computes the average of the values in the current window,
and the average of the values in the window before the current window and
compare the difference against the standard deviation of the
historical window. The function returns true if the difference
| The 'roc' function detects a significant rate of change when comparing
current metrics values with historical data. To achieve this, it
computes the average of the values in the current window and the
average of the values in the window before the current window and
compares the difference against the standard deviation of the
historical window. The function returns ``true`` if the difference
exceeds the standard deviation multiplied by the 'threshold' value.
This function uses the rate of change algorithm already available in the
anomaly detection module of Heka. It can only be applied on normal
distributions.
With an alarm rule using the 'roc' function, the 'window' parameter
specifies the duration in seconds of the current window and the 'periods'
parameter specifies the number of windows used for the historical data.
You need at least one period and so, the 'periods' parameter must not be zero.
If you choose a period of 'p', the function will compute the rate of
change using an historical data window of ('p' * window) seconds.
For example, if you specify in the alarm rule:
anomaly detection module of Heka. It can only be applied to normal
distributions. With an alarm rule using the 'roc' function, the
'window' parameter specifies the duration in seconds of the current
window, and the 'periods' parameter specifies the number of windows
used for the historical data. You need at least one period and the
'periods' parameter must not be zero. If you choose a period of 'p',
the function will compute the rate of change using a historical data
window of ('p' * window) seconds. For example, if you specify the
following in the alarm rule:
|
| window = 60
| periods = 3
| threshold = 1.5
|
| The function will store in a circular buffer the value of the metrics
| the function will store in a circular buffer the value of the metrics
received during the last 300 seconds (5 minutes) where:
|
| Current window (CW) = 60 sec
| Previous window (PW) = 60 sec
| Historical window (HW) = 180 sec
|
| And apply the following formula:
| and apply the following formula:
|
| abs(avg(CW) - avg(PW)) > std(HW) * 1.5 ? true : false
| mww:
| returns the result (true, false) of the Mann-Whitney-Wilcoxon test function
of Heka that can be used only with normal distributions (not implemented yet)
| returns the result (true, false) of the Mann-Whitney-Wilcoxon test
function of Heka that can be used only with normal distributions (not
implemented yet)
| mww-nonparametric:
| returns the result (true, false) of the Mann-Whitney-Wilcoxon
test function of Heka that can be used with non-normal distributions (not implemented yet)
| returns the result (true, false) of the Mann-Whitney-Wilcoxon test
function of Heka that can be used with non-normal distributions (not
implemented yet)
| diff:
| returns the difference between the last value and the first value of all the values
| returns the difference between the last value and the first value of
all the values
| threshold
| Type: float
@ -281,15 +276,13 @@ as shown in the example below::
Modify or create an alarm
+++++++++++++++++++++++++
To modify (or create) an alarm, you need to edit the
``/etc/hiera/override/alarming.yaml`` file.
This file has four sections:
To modify or create an alarm, edit the ``/etc/hiera/override/alarming.yaml``
file. This file has the following sections:
1. The *alarms* section contains a global list of alarms that
are executed by the Collectors. These alarms are global to
the LMA toolchain and should be kept identical
on all nodes of the OpenStack environment.
Here is another example of the definition of an alarm::
#. The ``alarms`` section contains a global list of alarms that are executed
by the Collectors. These alarms are global to the LMA toolchain and should
be kept identical on all nodes of the OpenStack environment. The following
is another example of the definition of an alarm::
alarms:
- name: 'cpu-critical-controller'
@ -312,30 +305,29 @@ This file has four sections:
periods: 0
function: avg
This alarm is called 'cpu-critical-controller'.
It says that CPU activity is critical (severity: 'critical')
if any of the rules in the alarm definition evaluates to true.
This alarm is called 'cpu-critical-controller'. It says that CPU activity
is critical (severity: 'critical') if any of the rules in the alarm
definition evaluate to true.
The rule says that the alarm
will evaluate to 'true' if the value of the metric *cpu_idle*
has been in average (function: avg) below or equal
The rule says that the alarm will evaluate to 'true' if the value of the
metric ``cpu_idle`` has been in average (function: avg), below or equal
(relational_operator: <=) to 5 for the last 5 minutes (window: 120).
OR (logical_operator: 'or')
If the value of the metric **cpu_wait** has been in average
(function: avg) superior or equal (relational_operator: >=) to 35
for the last 5 minutes (window: 120)
If the value of the metric **cpu_wait** has been in average (function: avg),
superior or equal (relational_operator: >=) to 35 for the last 5 minutes
(window: 120)
Note that these metrics are expressed in percentage.
What alarms are executed on which node depends on
the mapping between the alarm definition and the
definition of a cluster as described in the following sections.
What alarms are executed on which node depends on the mapping between the
alarm definition and the definition of a cluster as described in the
following sections.
2. The *node_cluster_roles* section defines the mapping between
the internal definition of a cluster of nodes and one or
several Fuel roles. For example::
#. The ``node_cluster_roles`` section defines the mapping between the internal
definition of a cluster of nodes and one or several Fuel roles.
For example::
node_cluster_roles:
controller: ['primary-controller', 'controller']
@ -343,22 +335,19 @@ This file has four sections:
storage: ['cinder', 'ceph-osd']
[ ... ]
Creates a mapping between the 'primary-controller'
and 'controller' Fuel roles and the internal defintion of a cluster
of nodes called 'controller'.
Likewise, the internal definition of a cluster of nodes called
'storage' is mapped to the 'cinder' and 'ceph-osd' Fuel roles.
The internal definition of a cluster of nodes is used to assign
the alarms to the relevant category of nodes.
This mapping is also used to configure the **passive checks**
in Nagios. This is the reason why, it is criticaly important
to keep the exact same copy of ``/etc/hiera/override/alarming.yaml``
across all the nodes of the OpenStack environment including the
node(s) where Nagios is installed.
Creates a mapping between the 'primary-controller' and 'controller' Fuel
roles, and the internal definition of a cluster of nodes called 'controller'.
Likewise, the internal definition of a cluster of nodes called 'storage' is
mapped to the 'cinder' and 'ceph-osd' Fuel roles. The internal definition
of a cluster of nodes is used to assign the alarms to the relevant category
of nodes. This mapping is also used to configure the **passive checks**
in Nagios. Therefore, it is critically important to keep exactly the same
copy of ``/etc/hiera/override/alarming.yaml`` across all nodes of the
OpenStack environment including the node(s) where Nagios is installed.
3. The *service_cluster_roles* section defines the mapping between
the internal definition of a cluster of services and one or
several Fuel roles. For example::
#. The ``service_cluster_roles`` section defines the mapping between the
internal definition of a cluster of services and one or several Fuel roles.
For example::
service_cluster_roles:
rabbitmq: ['primary-controller', 'controller']
@ -366,18 +355,17 @@ This file has four sections:
elasticsearch: ['primary-elasticsearch_kibana', 'elasticsearch_kibana']
[ ... ]
Creates a mapping between the 'primary-controller'
and 'controller' Fuel roles and the internal defintion of a cluster
of services called 'rabbitmq'.
Creates a mapping between the 'primary-controller' and 'controller' Fuel
roles, and the internal definition of a cluster of services called 'rabbitmq'.
Likewise, the internal definition of a cluster of services called
'elasticsearch' is mapped to the 'primary-elasticsearch_kibana'
and 'elasticsearch_kibana' Fuel roles.
As for the clusters of nodes, the internal definition of a cluster
of services is used to assign the alarns to the relevant category of services.
'elasticsearch' is mapped to the 'primary-elasticsearch_kibana' and
'elasticsearch_kibana' Fuel roles. As for the clusters of nodes, the
internal definition of a cluster of services is used to assign the alarms
to the relevant category of services.
4. The *node_cluster_alarms* section defines the mapping between
the internal definition of a cluster of nodes and the alarms that
are assigned to that category of nodes. For example::
#. The ``node_cluster_alarms`` section defines the mapping between the
internal definition of a cluster of nodes and the alarms that are assigned
to that category of nodes. For example::
node_cluster_alarms:
controller:
@ -385,121 +373,105 @@ This file has four sections:
root-fs: ['root-fs-critical', 'root-fs-warning']
log-fs: ['log-fs-critical', 'log-fs-warning']
Creates three alarm groups for the cluster of nodes called
'controller'.
Creates three alarm groups for the cluster of nodes called 'controller':
* The *cpu* alarm group is mapped to two alarms defined in the
*alarms* section known as the 'cpu-critical-controller' and
'cpu-warning-controller' alarms. Those alarms monitor the
CPU on the controller nodes. Note that the order matters
here since the first alarm which evaluates to 'true' stops
the evaluation. Hence, it is important to start the list
with the most critical alarms.
* The *root-fs* alarm group is mapped to two alarms defined
in the *alarms* section known as the 'root-fs-critical'
and 'root-fs-warning' alarms. Those alarms monitor the
root file system on the controller nodes.
* The *log-fs* alarm group is mapped to two alarms defined
in the *alarms* section known as the 'log-fs-critical' and
'log-fs-warning' alarms. Those alarms monitor the file
system where the logs are created on the controller
nodes.
* The *cpu* alarm group is mapped to two alarms defined in the ``alarms``
section known as the 'cpu-critical-controller' and
'cpu-warning-controller' alarms. These alarms monitor the CPU on the
controller nodes. The order matters here since the first alarm that
evaluates to 'true' stops the evaluation. Therefore, it is important
to start the list with the most critical alarms.
* The *root-fs* alarm group is mapped to two alarms defined in the
``alarms`` section known as the 'root-fs-critical' and 'root-fs-warning'
alarms. These alarms monitor the root file system on the controller nodes.
* The *log-fs* alarm group is mapped to two alarms defined in the ``alarms``
section known as the 'log-fs-critical' and 'log-fs-warning' alarms. These
alarms monitor the file system where the logs are created on the
controller nodes.
.. note:: An *alarm group* is a mere implementaton artifact
(although it has several functional usefulness) that is
primarily used to distribute the alarms evaluation workload
across several Lua plugins. Since the Lua plugins
runtime is sandboxed within Heka, it is preferable to run
smaller sets of alarms in different plugins rather than a
large set of alarms in a single plugin. This is to avoid
having alarms evaluation plugins shutdown by Heka.
Furthermore, the alarm groups are used to identify what is
called a *source*. A *source* is a tuple in which we associate
a cluster with an alarm group. For example the tuple ['controller', 'cpu']
is a *source*. It associates a 'controller' cluster with the 'cpu'
alarm group. The tuple ['controller', 'root-fs'] is another *source*
example. The *source* is used by the GSE Plugins to remember the
AFD metrics it has received. If a GSE Plugin stops receiving
AFD metrics it used to get, then the GSE Plugin will
infer that the health status for the cluster associated
with the source is *Unknown*.
.. note:: An *alarm group* is a mere implementation artifact (although it
has functional value) that is primarily used to distribute the alarms
evaluation workload across several Lua plugins. Since the Lua plugins
runtime is sandboxed within Heka, it is preferable to run smaller sets
of alarms in different plugins rather than a large set of alarms in a
single plugin. This is to avoid having alarms evaluation plugins
shut down by Heka. Furthermore, the alarm groups are used to identify
what is called a *source*. A *source* is a tuple in which we associate
a cluster with an alarm group. For example, the tuple
['controller', 'cpu'] is a *source*. It associates a 'controller'
cluster with the 'cpu' alarm group. The tuple ['controller', 'root-fs']
is another *source* example. The *source* is used by the GSE Plugins to
remember the AFD metrics it has received. If a GSE Plugin stops receiving
AFD metrics it used to get, then the GSE Plugin infers that the health
status of the cluster associated with the source is *Unknown*.
This is evaluated every *ticker-interval*. By default,
the *ticker interval* for the GSE Plugins is set to
10 seconds.
This is evaluated every *ticker-interval*. By default, the
*ticker interval* for the GSE Plugins is set to 10 seconds.
.. _aggreg_correl_config:
Aggregation and correlation configuration
-----------------------------------------
StackLight comes with a predefined set of aggregation rules and
correlation policies. As for the alarms, it is possible to
create new aggregation rules and correlation policies or modify
existing ones. To do so, you will be required to modify the
``/etc/hiera/override/gse_filters.yaml`` file
and apply the :ref:`Puppet manifest <puppet_apply>`
that will generate Lua plugins known as
the GSE Plugins which are the actuators of these aggregation
rules and correlation policies.
But before you proceed, you need to undestand the structure
of that file.
StackLight comes with a predefined set of aggregation rules and correlation
policies. However, you can create new aggregation rules and correlation
policies or modify the existing ones. To do so, modify the ``/etc/hiera/override/gse_filters.yaml`` file and apply the
:ref:`Puppet manifest <puppet_apply>` that will generate Lua plugins known as
the GSE Plugins, which are the actuators of these aggregation rules and
correlation policies. But before you proceed, verify that you understand the
structure of that file.
.. note:: As for ``/etc/hiera/override/alarming.yaml``,
it is criticaly important to keep the exact same copy of
``/etc/hiera/override/gse_filters.yaml``
across all the nodes of the OpenStack environment including the
node(s) where Nagios is installed.
.. note:: As for ``/etc/hiera/override/alarming.yaml``, it is critically
important to keep exactly the same copy of
``/etc/hiera/override/gse_filters.yaml`` across all the nodes of the
OpenStack environment including the node(s) where Nagios is installed.
The aggregation rules and correlation policies are defined
in the ``/etc/hiera/override/gse_filters.yaml`` configuration file.
The aggregation rules and correlation policies are defined in the ``/etc/hiera/override/gse_filters.yaml`` configuration file.
This file has four sections:
This file has the following sections:
1. The *gse_policies* section contains the :ref:`health status
correlation policies <gse_policies>` that apply to the node
clusters and service clusters.
2. The *gse_cluster_service* section contains the :ref:`aggregation rules
<gse_cluster_service>` for the service clusters. These
aggregation rules are actuated by the Service Cluster GSE
Plugin which runs on the Aggregator.
3. The *gse_cluster_node* section contains the :ref:`aggreagion rules
<gse_cluster_node>` for the node clusters. These aggregation rules
are actuated by the Node Cluster GSE Plugin which runs on the
Aggregator.
4. The *gse_cluster_global* section contains the :ref:`aggregation
rules <gse_cluster_global>` for the so-called top-level clusters.
A global cluster is a kind of logical construct of node clusters
and service clusters. These aggregation rules are actuated by
the Global Cluster GSE Plugin which runs on the Aggregator.
#. The ``gse_policies`` section contains the :ref:`health status correlation
policies <gse_policies>` that apply to the node clusters and service
clusters.
#. The ``gse_cluster_service` section contains the :ref:`aggregation rules
<gse_cluster_service>` for the service clusters. These aggregation rules
are actuated by the Service Cluster GSE Plugin that runs on the Aggregator.
#. The ``gse_cluster_node`` section contains the :ref:`aggregation rules
<gse_cluster_node>` for the node clusters. These aggregation rules are
actuated by the Node Cluster GSE Plugin that runs on the Aggregator.
#. The ``gse_cluster_global`` section contains the :ref:`aggregation
rules <gse_cluster_global>` for the so-called top-level clusters. A global
cluster is a kind of logical construct of node clusters and service
clusters. These aggregation rules are actuated by the Global Cluster GSE
Plugin that runs on the Aggregator.
.. _gse_policies:
Health status policies
++++++++++++++++++++++
The correlation logic implemented by the GSE plugins is policy-based.
The policies define how the GSE plugins infer the health status of a
cluster.
The correlation logic implemented by the GSE plugins is policy-based. The
policies define how the GSE plugins infer the health status of a cluster.
By default, two policies are defined:
By default, there are two policies:
* The **highest_severity** policy defines that the cluster's status depends on the
member with the highest severity, typically used for a cluster of services.
* The **majority_of_members** policy defines that the cluster is healthy as long as
(N+1)/2 members of the cluster are healthy. This is typically used for
clusters managed by Pacemaker.
* The **highest_severity** policy defines that the cluster's status depends on
the member with the highest severity, typically used for a cluster of
services.
* The **majority_of_members** policy defines that the cluster is healthy as
long as (N+1)/2 members of the cluster are healthy. This is typically used
for clusters managed by Pacemaker.
A policy consists of a list of rules that are evaluated against the
current status of the cluster's members. When one of the rules matches, the
cluster's status gets the value associated with the rule and the evaluation
stops here. The last rule of the list is usually a catch-all rule that
defines the default status in case none of the previous rules could be matched.
A policy consists of a list of rules that are evaluated against the current
status of the cluster's members. When one of the rules matches, the cluster's
status gets the value associated with the rule and the evaluation stops. The
last rule of the list is usually a catch-all rule that defines the default
status if none of the previous rules matches.
A policy rule is defined as shown in the example below::
The following example shows the policy rule definition::
# The following rule definition reads as: "the cluster's status is critical
# if more than 50% of its members are either down or criticial"
# if more than 50% of its members are either down or critical"
- status: critical
trigger:
logical_operator: or
@ -517,7 +489,7 @@ Where
| logical_operator
| Type: Enum('and' | '&&' | 'or' | '||')
| The conjonction relation for the condition rules
| The conjunction relation for the condition rules
| rules
| Type: list
@ -543,7 +515,7 @@ Where
| Type: float
| The threshold value
Lets take a closer look at the policy called *highest_severity*::
Consider the policy called *highest_severity*::
gse_policies:
@ -582,28 +554,31 @@ Lets take a closer look at the policy called *highest_severity*::
threshold: 0
- status: unknown
The policy definition reads as:
The policy definition reads as follows:
* The status of the cluster is *Down* if the status of at least one cluster's member is *Down*.
* The status of the cluster is ``Down`` if the status of at least one
cluster's member is ``Down``.
* Otherwise the status of the cluster is *Critical* if the status of at least one cluster's member is *Critical*.
* Otherwise, the status of the cluster is ``Critical`` if the status of at
least one cluster's member is ``Critical``.
* Otherwise the status of the cluster is *Warning* if the status of at least one cluster's member is *Warning*.
* Otherwise, the status of the cluster is ``Warning`` if the status of at
least one cluster's member is ``Warning``.
* Otherwise the status of the cluster is *Okay* if the status of at least one cluster's entity is *Okay*.
* Otherwise, the status of the cluster is ``Okay`` if the status of at least
one cluster's entity is *Okay*.
* Otherwise the status of the cluster is *Unknown*.
* Otherwise, the status of the cluster is ``Unknown``.
.. _gse_cluster_service:
Service cluster aggregation rules
+++++++++++++++++++++++++++++++++
The service cluster aggregation rules are used to designate
the members of a service cluster along with
the AFD metrics that must be taken into account to derive an
health status for the service cluster.
Here is an example of the service cluster aggregation rules::
The service cluster aggregation rules are used to designate the members of a
service cluster along with the AFD metrics that must be taken into account to
derive a health status for the service cluster. The following is an example of
the service cluster aggregation rules::
gse_cluster_service:
input_message_types:
@ -673,7 +648,7 @@ Where
Service cluster definition
++++++++++++++++++++++++++
The service clusters are defined as shown in the example below::
The following example shows the service clusters definition::
gse_cluster_service:
[...]
@ -691,36 +666,36 @@ Where
| members
| Type: list
| The list of cluster members.
The AFD messages that are associated to the cluster when the *cluster_field*
value is equal to the cluster name and the *member_field* value is in this
list.
The AFD messages that are associated with the cluster when the
``cluster_field`` value is equal to the cluster name and the
``member_field`` value is in this list.
| group_by
| Type: Enum(member, hostname)
| This parameter defines how the incoming AFD metrics are aggregated.
|
| member:
| aggregation by member, irrespective of the host that emitted the AFD metric.
| This setting is typically used for AFD metrics that are not host-centric.
| aggregation by member, irrespective of the host that emitted the AFD
| metric. This setting is typically used for AFD metrics that are not
| host-centric.
|
| hostname:
| aggregation by hostname then by member.
| This setting is typically used for AFD metrics that are host-centric such as
| those working on filesystem or CPU usage metrics.
| This setting is typically used for AFD metrics that are host-centric,
| such as those working on the file system or CPU usage metrics.
| policy:
| Type: unicode
| The policy to use for computing the service cluster status. See :ref:`gse_policies`
for details.
| The policy to use for computing the service cluster status.
See :ref:`gse_policies` for details.
If we look more closely into the example above, it defines that the Service
Cluster GSE plugin resulting from those rules will emit a
*gse_service_cluster_metric* message every 10
seconds to report the current status of the *nova-api* cluster. This
status is computed using the *afd_service_metric* metric for which
Fields[service] is 'nova-api' and Fields[source] is one of 'backends',
'endpoint' or 'http_errors'. The 'nova-api' cluster's status is computed using
the 'highest_severity' policy which means that it will be equal to the 'worst'
A closer look into the example above defines that the Service Cluster GSE
plugin resulting from those rules will emit a *gse_service_cluster_metric*
message every 10 seconds to report the current status of the *nova-api*
cluster. This status is computed using the *afd_service_metric* metric for
which Fields[service] is 'nova-api' and Fields[source] is one of 'backends',
'endpoint', or 'http_errors'. The 'nova-api' cluster's status is computed using
the 'highest_severity' policy, which means that it will be equal to the 'worst'
status across all members.
.. _gse_cluster_node:
@ -728,11 +703,10 @@ status across all members.
Node cluster aggregation rules
++++++++++++++++++++++++++++++
The node cluster aggregation rules are used to designate
the members of a node cluster along with
the AFD metrics that must be taken into account to derive
an health status for the node cluster.
Here is an example of the node cluster aggregation rules::
The node cluster aggregation rules are used to designate the members of a node
cluster along with the AFD metrics that must be taken into account to derive
a health status for the node cluster. The following is an example of the node
cluster aggregation rules::
gse_cluster_node:
input_message_types:
@ -804,7 +778,7 @@ Where
Node cluster definition
+++++++++++++++++++++++
The node clusters are defined as shown in the example below::
The following example shows the node clusters definition::
gse_cluster_node:
[...]
@ -822,36 +796,35 @@ Where
| members
| Type: list
| The list of cluster members.
The AFD messages are associated to the cluster when the *cluster_field*
value is equal to the cluster name and the *member_field* value is in this
list.
The AFD messages are associated to the cluster when the ``cluster_field``
value is equal to the cluster name and the ``member_field`` value is in
this list.
| group_by
| Type: Enum(member, hostname)
| This parameter defines how the incoming AFD metrics are aggregated.
|
| member:
| aggregation by member, irrespective of the host that emitted the AFD metric.
| This setting is typically used for AFD metrics that are not host-centric.
| aggregation by member, irrespective of the host that emitted the AFD
| metric. This setting is typically used for AFD metrics that are not
| host-centric.
|
| hostname:
| aggregation by hostname then by member.
| This setting is typically used for AFD metrics that are host-centric such as
| those working on filesystem or CPU usage metrics.
| This setting is typically used for AFD metrics that are host-centric,
| such as those working on the file system or CPU usage metrics.
| policy:
| Type: unicode
| The policy to use for computing the node cluster status. See :ref:`gse_policies`
for details.
| The policy to use for computing the node cluster status.
See :ref:`gse_policies` for details.
If we look more closely into the example above, it defines that the Node
Cluster GSE plugin resulting from those rules will emit a
*gse_node_cluster_metric* message every 10
seconds to report the current status of the *controller* cluster. This
A closer look into the example above defines that the Node Cluster GSE plugin
resulting from those rules will emit a *gse_node_cluster_metric* message every
10 seconds to report the current status of the *controller* cluster. This
status is computed using the *afd_node_metric* metric for which
Fields[node_role] is 'controller' and Fields[source] is one of 'cpu',
'root-fs' or 'log-fs'. The 'controller' cluster's status is computed using
the 'majority_of_members' policy which means that it will be equal to the 'majority'
'root-fs' or 'log-fs'. The 'controller' cluster's status is computed using the 'majority_of_members' policy which means that it will be equal to the 'majority'
status across all members.
.. _gse_cluster_global:
@ -859,23 +832,20 @@ status across all members.
Top-level cluster aggregation rules
+++++++++++++++++++++++++++++++++++
The top-level agggregation rules aggregate GSE metrics from the
Service Cluster GSE Plugin and the Node Cluster GSE Plugin.
This is the last aggregation stage that issues health status
for the top-level clusters. A top-level cluster is a logical
contruct of service and node clustering. By default, we define
that the health status of Nova, as a top-level cluster,
depends on the health status of several service clusters
related to Nova and the health status of the 'controller' and
'compute' node clusters. But it can be anything. For example, you
could define a 'control-plane' top-level cluster that would
exclude the health status of the 'compute' node cluster if
you wanted to... In summary, the top-level cluster aggregation
rules are used to designate the node clusters and service
clusters members of a top-level cluster along with
the GSE metrics that must be taken into account to derive
an health status for the top-level cluster.
Here is an example of a top-level cluster aggregation rules::
The top-level aggregation rules aggregate GSE metrics from the Service
Cluster GSE Plugin and the Node Cluster GSE Plugin. This is the last
aggregation stage that issues health status for the top-level clusters.
A top-level cluster is a logical construct of service and node clustering.
By default, we define that the health status of Nova, as a top-level cluster,
depends on the health status of several service clusters related to Nova and
the health status of the 'controller' and 'compute' node clusters. But it can
be anything. For example, you can define a 'control-plane' top-level cluster
that would exclude the health status of the 'compute' node cluster if required.
The top-level cluster aggregation rules are used to designate the node
clusters and service clusters members of a top-level cluster along with the
GSE metrics that must be taken into account to derive a health status for the
top-level cluster. The following is an example of a top-level cluster
aggregation rules::
gse_cluster_global:
input_message_types:
@ -954,7 +924,7 @@ Where
Top-level cluster definition
++++++++++++++++++++++++++++
The top-level clusters are defined as shown in the example below::
The following example shows the top-level clusters definition::
gse_cluster_global:
[...]
@ -987,15 +957,16 @@ Where
| members
| Type: list
| The list of cluster members.
| The GSE messages are associated to the cluster when the *member_field* value
| (i.e *cluster_name*) is in this list.
| The GSE messages are associated to the cluster when the ``member_field``
| value (``cluster_name``), is on this list.
| hints
| Type: list
| The list of clusters that are indirectly associated with the top-level cluster.
| The GSE messages are indirectly associated to the cluster when the *member_field* value
| (i.e *cluster_name*) is in this list. This means that they are not used to derive
| the health status of the top-level cluster but as 'hints' for root cause analysis.
| The list of clusters that are indirectly associated with the top-level
| cluster. The GSE messages are indirectly associated to the cluster when
| the ``member_field`` value (``cluster_name``) is on this list. This means
| that they are not used to derive the health status of the top-level
| cluster but as 'hints' for root cause analysis.
| group_by
| Type: Enum(member, hostname)
@ -1004,8 +975,8 @@ Where
| policy:
| Type: unicode
| The policy to use for computing the top-level cluster status. See :ref:`gse_policies`
for details.
| The policy to use for computing the top-level cluster status.
See :ref:`gse_policies` for details.
.. _puppet_apply:
@ -1015,11 +986,10 @@ Apply your configuration changes
Once you have edited and saved your changes in
``/etc/hiera/override/alarmaing.yaml`` and / or
``/etc/hiera/override/gse_filters.yaml``,
you need to apply the following Puppet manifest on
all the nodes of your OpenStack
environment (**including the node(s) where Nagios is installed**)
apply the following Puppet manifest on all the nodes of your OpenStack
environment **including the node(s) where Nagios is installed**
for the changes to take effect::
# puppet apply --modulepath=/etc/fuel/plugins/lma_collector-<version>/puppet/modules:\
/etc/puppet/modules \
/etc/fuel/plugins/lma_collector-<version>/puppet/manifests/configure_afd_filters.pp
/etc/fuel/plugins/lma_collector-<version>/puppet/manifests/configure_afd_filters.pp

View File

@ -77,6 +77,10 @@ Plugin configuration
.. _plugin_verification:
.. raw:: latex
\pagebreak
Plugin verification
-------------------

View File

@ -1,108 +1,128 @@
.. _Ceph_metrics:
All Ceph metrics have a ``cluster`` field containing the name of the Ceph cluster
(*ceph* by default).
All Ceph metrics have a ``cluster`` field containing the name of the Ceph
cluster (*ceph* by default).
See `cluster monitoring`_ and `RADOS monitoring`_ for further details.
For details, see
`Cluster monitoring <http://docs.ceph.com/docs/master/rados/operations/monitoring/>`_
and `RADOS monitoring <http://docs.ceph.com/docs/master/rados/operations/monitoring-osd-pg/>`_.
Cluster
^^^^^^^
* ``ceph_health``, the health status of the entire cluster where values ``1``, ``2``
, ``3`` represent respectively ``OK``, ``WARNING`` and ``ERROR``.
* ``ceph_health``, the health status of the entire cluster where values
``1``, ``2``, ``3`` represent ``OK``, ``WARNING`` and ``ERROR``, respectively.
* ``ceph_monitor_count``, number of ceph-mon processes.
* ``ceph_monitor_count``, the number of ceph-mon processes.
* ``ceph_quorum_count``, number of ceph-mon processes participating in the
* ``ceph_quorum_count``, the number of ceph-mon processes participating in the
quorum.
Pools
^^^^^
* ``ceph_pool_total_avail_bytes``, total available size in bytes for all pools.
* ``ceph_pool_total_bytes``, total number of bytes for all pools.
* ``ceph_pool_total_number``, total number of pools.
* ``ceph_pool_total_used_bytes``, total used size in bytes by all pools.
* ``ceph_pool_total_avail_bytes``, the total available size in bytes for all
pools.
* ``ceph_pool_total_bytes``, the total number of bytes for all pools.
* ``ceph_pool_total_number``, the total number of pools.
* ``ceph_pool_total_used_bytes``, the total used size in bytes by all pools.
The folllowing metrics have a ``pool`` field that contains the name of the Ceph pool.
The following metrics have a ``pool`` field that contains the name of the
Ceph pool.
* ``ceph_pool_bytes_used``, amount of data in bytes used by the pool.
* ``ceph_pool_max_avail``, available size in bytes for the pool.
* ``ceph_pool_objects``, number of objects in the pool.
* ``ceph_pool_op_per_sec``, number of operations per second for the pool.
* ``ceph_pool_pg_num``, number of placement groups for the pool.
* ``ceph_pool_read_bytes_sec``, number of bytes read by second for the pool.
* ``ceph_pool_size``, number of data replications for the pool.
* ``ceph_pool_write_bytes_sec``, number of bytes written by second for the pool.
* ``ceph_pool_bytes_used``, the amount of data in bytes used by the pool.
* ``ceph_pool_max_avail``, the available size in bytes for the pool.
* ``ceph_pool_objects``, the number of objects in the pool.
* ``ceph_pool_op_per_sec``, the number of operations per second for the pool.
* ``ceph_pool_pg_num``, the number of placement groups for the pool.
* ``ceph_pool_read_bytes_sec``, the number of bytes read by second for the pool.
* ``ceph_pool_size``, the number of data replications for the pool.
* ``ceph_pool_write_bytes_sec``, the number of bytes written by second for the
pool.
Placement Groups
^^^^^^^^^^^^^^^^
* ``ceph_pg_bytes_avail``, available size in bytes.
* ``ceph_pg_bytes_total``, cluster total size in bytes.
* ``ceph_pg_bytes_used``, data stored size in bytes.
* ``ceph_pg_data_bytes``, stored data size in bytes before it is replicated, cloned
or snapshotted.
* ``ceph_pg_state``, number of placement groups in a given state. The metric
contains a ``state`` field whose value is ``<state>`` is a combination
* ``ceph_pg_bytes_avail``, the available size in bytes.
* ``ceph_pg_bytes_total``, the cluster total size in bytes.
* ``ceph_pg_bytes_used``, the data stored size in bytes.
* ``ceph_pg_data_bytes``, the stored data size in bytes before it is
replicated, cloned or snapshotted.
* ``ceph_pg_state``, the number of placement groups in a given state. The
metric contains a ``state`` field whose ``<state>`` value is a combination
separated by ``+`` of 2 or more states of this list: ``creating``,
``active``, ``clean``, ``down``, ``replay``, ``splitting``, ``scrubbing``,
``degraded``, ``inconsistent``, ``peering``, ``repair``, ``recovering``,
``recovery_wait``, ``backfill``, ``backfill-wait``, ``backfill_toofull``,
``incomplete``, ``stale``, ``remapped``.
* ``ceph_pg_total``, total number of placement groups.
* ``ceph_pg_total``, the total number of placement groups.
OSD Daemons
^^^^^^^^^^^
* ``ceph_osd_down``, number of OSD daemons DOWN.
* ``ceph_osd_in``, number of OSD daemons IN.
* ``ceph_osd_out``, number of OSD daemons OUT.
* ``ceph_osd_up``, number of OSD daemons UP.
* ``ceph_osd_down``, the number of OSD daemons DOWN.
* ``ceph_osd_in``, the number of OSD daemons IN.
* ``ceph_osd_out``, the number of OSD daemons OUT.
* ``ceph_osd_up``, the number of OSD daemons UP.
The following metrics have an ``osd`` field that contains the OSD identifier.
The following metrics have an ``osd`` field that contains the OSD identifier:
* ``ceph_osd_apply_latency``, apply latency in ms for the given OSD.
* ``ceph_osd_commit_latency``, commit latency in ms for the given OSD.
* ``ceph_osd_total``, total size in bytes for the given OSD.
* ``ceph_osd_used``, data stored size in bytes for the given OSD.
* ``ceph_osd_total``, the total size in bytes for the given OSD.
* ``ceph_osd_used``, the data stored size in bytes for the given OSD.
OSD Performance
^^^^^^^^^^^^^^^
All the following metrics are retrieved per OSD daemon from the corresponding
socket ``/var/run/ceph/ceph-osd.<ID>.asok`` by issuing the command ``perf dump``.
``/var/run/ceph/ceph-osd.<ID>.asok`` socket by issuing the :command:`perf dump`
command.
All metrics have an ``osd`` field that contains the OSD identifier.
.. note:: These metrics are not collected when a node has both the ceph-osd and controller roles.
.. note:: These metrics are not collected when a node has both the ceph-osd
and controller roles.
See `OSD performance counters`_ for further details.
For details, see `OSD performance counters <http://ceph.com/docs/firefly/dev/perf_counters/>`_.
* ``ceph_perf_osd_op``, number of client operations.
* ``ceph_perf_osd_op_in_bytes``, number of bytes received from clients for write operations.
* ``ceph_perf_osd_op_latency``, average latency in ms for client operations (including queue time).
* ``ceph_perf_osd_op_out_bytes``, number of bytes sent to clients for read operations.
* ``ceph_perf_osd_op_process_latency``, average latency in ms for client operations (excluding queue time).
* ``ceph_perf_osd_op_r``, number of client read operations.
* ``ceph_perf_osd_op_r_latency``, average latency in ms for read operation (including queue time).
* ``ceph_perf_osd_op_r_out_bytes``, number of bytes sent to clients for read operations.
* ``ceph_perf_osd_op_r_process_latency``, average latency in ms for read operation (excluding queue time).
* ``ceph_perf_osd_op_rw``, number of client read-modify-write operations.
* ``ceph_perf_osd_op_rw_in_bytes``, number of bytes per second received from clients for read-modify-write operations.
* ``ceph_perf_osd_op_rw_latency``, average latency in ms for read-modify-write operations (including queue time).
* ``ceph_perf_osd_op_rw_out_bytes``, number of bytes per second sent to clients for read-modify-write operations.
* ``ceph_perf_osd_op_rw_process_latency``, average latency in ms for read-modify-write operations (excluding queue time).
* ``ceph_perf_osd_op_rw_rlat``, average latency in ms for read-modify-write operations with readable/applied.
* ``ceph_perf_osd_op_w``, number of client write operations.
* ``ceph_perf_osd_op_wip``, number of replication operations currently being processed (primary).
* ``ceph_perf_osd_op_w_in_bytes``, number of bytes received from clients for write operations.
* ``ceph_perf_osd_op_w_latency``, average latency in ms for write operations (including queue time).
* ``ceph_perf_osd_op_w_process_latency``, average latency in ms for write operation (excluding queue time).
* ``ceph_perf_osd_op_w_rlat``, average latency in ms for write operations with readable/applied.
* ``ceph_perf_osd_recovery_ops``, number of recovery operations in progress.
.. _cluster monitoring: http://docs.ceph.com/docs/master/rados/operations/monitoring/
.. _RADOS monitoring: http://docs.ceph.com/docs/master/rados/operations/monitoring-osd-pg/
.. _OSD performance counters: http://ceph.com/docs/firefly/dev/perf_counters/
* ``ceph_perf_osd_op``, the number of client operations.
* ``ceph_perf_osd_op_in_bytes``, the number of bytes received from clients for
write operations.
* ``ceph_perf_osd_op_latency``, the average latency in ms for client operations
(including queue time).
* ``ceph_perf_osd_op_out_bytes``, the number of bytes sent to clients for read
operations.
* ``ceph_perf_osd_op_process_latency``, the average latency in ms for client
operations (excluding queue time).
* ``ceph_perf_osd_op_r``, the number of client read operations.
* ``ceph_perf_osd_op_r_latency``, the average latency in ms for read operation
(including queue time).
* ``ceph_perf_osd_op_r_out_bytes``, the number of bytes sent to clients for
read operations.
* ``ceph_perf_osd_op_r_process_latency``, the average latency in ms for read
operation (excluding queue time).
* ``ceph_perf_osd_op_rw``, the number of client read-modify-write operations.
* ``ceph_perf_osd_op_rw_in_bytes``, the number of bytes per second received
from clients for read-modify-write operations.
* ``ceph_perf_osd_op_rw_latency``, the average latency in ms for
read-modify-write operations (including queue time).
* ``ceph_perf_osd_op_rw_out_bytes``, the number of bytes per second sent to
clients for read-modify-write operations.
* ``ceph_perf_osd_op_rw_process_latency``, the average latency in ms for
read-modify-write operations (excluding queue time).
* ``ceph_perf_osd_op_rw_rlat``, the average latency in ms for read-modify-write
operations with readable/applied.
* ``ceph_perf_osd_op_w``, the number of client write operations.
* ``ceph_perf_osd_op_wip``, the number of replication operations currently
being processed (primary).
* ``ceph_perf_osd_op_w_in_bytes``, the number of bytes received from clients
for write operations.
* ``ceph_perf_osd_op_w_latency``, the average latency in ms for write
operations (including queue time).
* ``ceph_perf_osd_op_w_process_latency``, the average latency in ms for write
operation (excluding queue time).
* ``ceph_perf_osd_op_w_rlat``, the average latency in ms for write operations
with readable/applied.
* ``ceph_perf_osd_recovery_ops``, the number of recovery operations in progress.

View File

@ -3,24 +3,23 @@
The cluster metrics are emitted by the GSE plugins. For details, see
:ref:`Configuring alarms <configure_alarms>`.
* ``cluster_node_status``, the status of the node cluster.
The metric contains a ``cluster_name`` field that identifies the node cluster.
* ``cluster_node_status``, the status of the node cluster. The metric contains
a ``cluster_name`` field that identifies the node cluster.
* ``cluster_service_status``, the status of the service cluster.
The metric contains a ``cluster_name`` field that identifies the service cluster.
* ``cluster_status``, the status of the global cluster.
The metric contains a ``cluster_name`` field that identifies the global cluster.
* ``cluster_service_status``, the status of the service cluster. The metric
contains a ``cluster_name`` field that identifies the service cluster.
* ``cluster_status``, the status of the global cluster. The metric contains a
``cluster_name`` field that identifies the global cluster.
The supported values for these metrics are:
* `0` for the *Okay* status.
* ``0`` for the *Okay* status.
* `1` for the *Warning* status.
* ``1`` for the *Warning* status.
* `2` for the *Unknown* status.
* ``2`` for the *Unknown* status.
* `3` for the *Critical* status.
* ``3`` for the *Critical* status.
* `4` for the *Down* status.
* ``4`` for the *Down* status.

View File

@ -1,20 +1,19 @@
.. _Elasticsearch:
The following metrics represent the simple status on the health of the cluster.
See `cluster health`_ for further details.
For details, see `Cluster health <https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cluster-health.html>`_.
* ``elasticsearch_cluster_active_primary_shards``, the number of active primary
shards.
* ``elasticsearch_cluster_active_shards``, the number of active shards.
* ``elasticsearch_cluster_health``, the health status of the entire cluster
where values ``1``, ``2`` , ``3`` represent respectively ``green``,
``yellow`` and ``red``. The ``red`` status may also be reported when the
Elasticsearch API returns an unexpected result (network failure for instance).
where values ``1``, ``2`` , ``3`` represent ``green``, ``yellow`` and
``red``, respectively. The ``red`` status may also be reported when the
Elasticsearch API returns an unexpected result, for example, a network
failure.
* ``elasticsearch_cluster_initializing_shards``, the number of initializing
shards.
* ``elasticsearch_cluster_number_of_nodes``, the number of nodes in the cluster.
* ``elasticsearch_cluster_number_of_pending_tasks``, the number of pending tasks.
* ``elasticsearch_cluster_relocating_shards``, the number of relocating shards.
* ``elasticsearch_cluster_unassigned_shards``, the number of unassigned shards.
.. _cluster health: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cluster-health.html
* ``elasticsearch_cluster_unassigned_shards``, the number of unassigned shards.

View File

@ -1,6 +1,6 @@
.. _haproxy_metrics:
``frontend`` and ``backend`` field values can be:
The ``frontend`` and ``backend`` field values can be as follows:
* cinder-api
* glance-api
@ -35,7 +35,8 @@ Server
Frontends
^^^^^^^^^
The following metrics have a ``frontend`` field that contains the name of the frontend server.
The following metrics have a ``frontend`` field that contains the name of the
front-end server:
* ``haproxy_frontend_bytes_in``, the number of bytes received by the frontend.
* ``haproxy_frontend_bytes_out``, the number of bytes transmitted by the frontend.
@ -55,25 +56,33 @@ Backends
^^^^^^^^
.. _haproxy_backend_metric:
The following metrics have a ``backend`` field that contains the name of the backend server.
The following metrics have a ``backend`` field that contains the name of the
back-end server:
* ``haproxy_backend_bytes_in``, the number of bytes received by the backend.
* ``haproxy_backend_bytes_out``, the number of bytes transmitted by the backend.
* ``haproxy_backend_bytes_in``, the number of bytes received by the back end.
* ``haproxy_backend_bytes_out``, the number of bytes transmitted by the back end.
* ``haproxy_backend_denied_requests``, the number of denied requests.
* ``haproxy_backend_denied_responses``, the number of denied responses.
* ``haproxy_backend_downtime``, the total downtime in second.
* ``haproxy_backend_downtime``, the total downtime in seconds.
* ``haproxy_backend_error_connection``, the number of error connections.
* ``haproxy_backend_error_responses``, the number of error responses.
* ``haproxy_backend_queue_current``, the number of requests in queue.
* ``haproxy_backend_redistributed``, the number of times a request was redispatched to another server.
* ``haproxy_backend_redistributed``, the number of times a request was
redispatched to another server.
* ``haproxy_backend_response_1xx``, the number of HTTP responses with 1xx code.
* ``haproxy_backend_response_2xx``, the number of HTTP responses with 2xx code.
* ``haproxy_backend_response_3xx``, the number of HTTP responses with 3xx code.
* ``haproxy_backend_response_4xx``, the number of HTTP responses with 4xx code.
* ``haproxy_backend_response_5xx``, the number of HTTP responses with 5xx code.
* ``haproxy_backend_response_other``, the number of HTTP responses with other code.
* ``haproxy_backend_retries``, the number of times a connection to a server was retried.
* ``haproxy_backend_servers``, the count of servers grouped by state. This metric has an additional ``state`` field that contains the state of the backends (either 'down' or 'up').
* ``haproxy_backend_response_other``, the number of HTTP responses with other
code.
* ``haproxy_backend_retries``, the number of times a connection to a server
was retried.
* ``haproxy_backend_servers``, the count of servers grouped by state. This
metric has an additional ``state`` field that contains the state of the
back ends (either 'down' or 'up').
* ``haproxy_backend_session_current``, the number of current sessions.
* ``haproxy_backend_session_total``, the cumulative number of sessions.
* ``haproxy_backend_status``, the global backend status where values ``0`` and ``1`` represent respectively ``DOWN`` (all backends are down) and ``UP`` (at least one backend is up).
* ``haproxy_backend_status``, the global back-end status where values ``0``
and ``1`` represent, respectively, ``DOWN`` (all back ends are down) and ``UP``
(at least one back end is up).

View File

@ -1,37 +1,47 @@
.. InfluxDB:
The following metrics are extracted from the output of ``show stats`` command.
The values are reset to zero when InfluxDB is restarted.
The following metrics are extracted from the output of the :command:`show stats`
command. The values are reset to zero when InfluxDB is restarted.
cluster
^^^^^^^
These metrics are only available if there are more than one node in the cluster.
The following metrics are only available if there is more than one node in the
cluster:
* ``influxdb_cluster_write_shard_points_requests``, the number of requests for writing a time series points to a shard.
* ``influxdb_cluster_write_shard_requests``, the number of requests for writing to a shard.
* ``influxdb_cluster_write_shard_points_requests``, the number of requests for
writing a time series points to a shard.
* ``influxdb_cluster_write_shard_requests``, the number of requests for writing
to a shard.
httpd
^^^^^
* ``influxdb_httpd_failed_auths``, the number of times failed authentications.
* ``influxdb_httpd_failed_auths``, the number of failed authentications.
* ``influxdb_httpd_ping_requests``, the number of ping requests.
* ``influxdb_httpd_query_requests``, the number of query requests received.
* ``influxdb_httpd_query_response_bytes``, the number of bytes returned to the client.
* ``influxdb_httpd_query_response_bytes``, the number of bytes returned to the
client.
* ``influxdb_httpd_requests``, the number of requests received.
* ``influxdb_httpd_write_points_ok``, the number of points successfully written.
* ``influxdb_httpd_write_request_bytes``, the number of bytes received for write requests.
* ``influxdb_httpd_write_request_bytes``, the number of bytes received for
write requests.
* ``influxdb_httpd_write_requests``, the number of write requests received.
write
^^^^^
* ``influxdb_write_local_point_requests``, the number of write points requests from the local data node.
* ``influxdb_write_local_point_requests``, the number of write points requests
from the local data node.
* ``influxdb_write_ok``, the number of successful writes of consistency level.
* ``influxdb_write_point_requests``, the number of write points requests across all data nodes.
* ``influxdb_write_remote_point_requests``, the number of write points requests to remote data nodes.
* ``influxdb_write_requests``, the number of write requests across all data nodes.
* ``influxdb_write_sub_ok``, the number of successful points send to subscriptions.
* ``influxdb_write_point_requests``, the number of write points requests across
all data nodes.
* ``influxdb_write_remote_point_requests``, the number of write points requests
to remote data nodes.
* ``influxdb_write_requests``, the number of write requests across all data
nodes.
* ``influxdb_write_sub_ok``, the number of successful points sent to
subscriptions.
runtime
^^^^^^^
@ -41,11 +51,12 @@ runtime
* ``influxdb_heap_idle``, the number of bytes in idle spans.
* ``influxdb_heap_in_use``, the number of bytes in non-idle spans.
* ``influxdb_heap_objects``, the total number of allocated objects.
* ``influxdb_heap_released``, the number of bytes released to the operating system.
* ``influxdb_heap_released``, the number of bytes released to the operating
system.
* ``influxdb_heap_system``, the number of bytes obtained from the system.
* ``influxdb_memory_alloc``, the number of bytes allocated and not yet freed.
* ``influxdb_memory_frees``, the number of free operations.
* ``influxdb_memory_lookups``, the number of pointer lookups.
* ``influxdb_memory_mallocs``, the number of malloc operations.
* ``influxdb_memory_system``, the number of bytes obtained from the system.
* ``influxdb_memory_total_alloc``, the number of bytes allocated (even if freed).
* ``influxdb_memory_total_alloc``, the number of bytes allocated (even if freed).

View File

@ -1,6 +1,6 @@
.. _libvirt-metrics:
Every metric contains an ``instance_id`` field which is the UUID of the
Every metric contains an ``instance_id`` field, which is the UUID of the
instance for the Nova service.
CPU
@ -17,7 +17,7 @@ Disk
^^^^
Metrics have a ``device`` field that contains the virtual disk device to which
the metric applies (eg 'vda', 'vdb' and so on).
the metric applies. For example, 'vda', 'vdb', and others.
* ``virt_disk_octets_read``, the number of octets (bytes) read per second.
@ -37,7 +37,7 @@ Network
^^^^^^^
Metrics have an ``interface`` field that contains the interface name to which
the metric applies (eg 'tap0dc043a6-dd', 'tap769b123a-2e' and so on).
the metric applies. For example, 'tap0dc043a6-dd', 'tap769b123a-2e', and others.
* ``virt_if_dropped_rx``, the number of dropped packets per second when
receiving from the interface.
@ -61,4 +61,4 @@ the metric applies (eg 'tap0dc043a6-dd', 'tap769b123a-2e' and so on).
interface.
* ``virt_if_packets_tx``, the number of packets transmitted per second by the
interface.
interface.

View File

@ -3,49 +3,67 @@
System
^^^^^^
Metrics have a ``service`` field with the name of the service it applies to. Values can be: hekad, collectd, influxd, grafana-server or elasticsearch.
The metrics have a ``service`` field with the name of the service it applies
to. The values can be: ``hekad``, ``collectd``, ``influxd``, ``grafana-server``
or ``elasticsearch``.
* ``lma_components_count_processes``, number of processes currently running.
* ``lma_components_count_threads``, number of threads currently running.
* ``lma_components_cputime_syst``, percentage of CPU time spent in system mode by the service.
It can be greater than 100% when the node has more than one CPU.
* ``lma_components_cputime_user``, percentage of CPU time spent in user mode by the service.
It can be greater than 100% when the node has more than one CPU.
* ``lma_components_disk_bytes_read``, number of bytes read from disk(s) per second.
* ``lma_components_disk_bytes_write``, number of bytes written to disk(s) per second.
* ``lma_components_disk_ops_read``, number of read operations from disk(s) per second.
* ``lma_components_disk_ops_write``, number of write operations to disk(s) per second.
* ``lma_components_memory_code``, physical memory devoted to executable code (bytes).
* ``lma_components_memory_data``, physical memory devoted to other than executable code (bytes).
* ``lma_components_memory_rss``, non-swapped physical memory used (bytes).
* ``lma_components_memory_vm``, virtual memory size (bytes).
* ``lma_components_count_processes``, the number of processes currently running.
* ``lma_components_count_threads``, the number of threads currently running.
* ``lma_components_cputime_syst``, the percentage of CPU time spent in system
mode by the service. It can be greater than 100% when the node has more than
one CPU.
* ``lma_components_cputime_user``, the percentage of CPU time spent in user
mode by the service. It can be greater than 100% when the node has more than
one CPU.
* ``lma_components_disk_bytes_read``, the number of bytes read from disk(s) per
second.
* ``lma_components_disk_bytes_write``, the number of bytes written to disk(s)
per second.
* ``lma_components_disk_ops_read``, the number of read operations from disk(s)
per second.
* ``lma_components_disk_ops_write``, the number of write operations to disk(s)
per second.
* ``lma_components_memory_code``, the physical memory devoted to executable code
in bytes.
* ``lma_components_memory_data``, the physical memory devoted to other than
executable code in bytes.
* ``lma_components_memory_rss``, the non-swapped physical memory used in bytes.
* ``lma_components_memory_vm``, the virtual memory size in bytes.
* ``lma_components_pagefaults_majflt``, major page faults per second.
* ``lma_components_pagefaults_minflt``, minor page faults per second.
* ``lma_components_stacksize``, absolute value of the start address (the bottom)
* ``lma_components_stacksize``, the absolute value of the start address (the bottom)
of the stack minus the address of the current stack pointer.
Heka pipeline
^^^^^^^^^^^^^
Metrics have two fields: ``name`` that contains the name of the decoder or filter as defined by *Heka* and ``type`` that is either *decoder* or *filter*.
The metrics have two fields: ``name`` that contains the name of the decoder
or filter as defined by *Heka* and ``type`` that is either *decoder* or
*filter*.
Metrics for both types:
The metrics for both types are as follows:
* ``hekad_memory``, the total memory used by the Sandbox (in bytes).
* ``hekad_msg_avg_duration``, the average time for processing the message (in nanoseconds).
* ``hekad_msg_count``, the total number of messages processed by the decoder. This will reset to 0 when the process is restarted.
* ``hekad_memory``, the total memory in bytes used by the Sandbox.
* ``hekad_msg_avg_duration``, the average time in nanoseconds for processing
the message.
* ``hekad_msg_count``, the total number of messages processed by the decoder.
This resets to ``0`` when the process is restarted.
Additional metrics for *filter* type:
* ``heakd_timer_event_avg_duration``, the average time for executing the *timer_event* function (in nanoseconds).
* ``hekad_timer_event_count``, the total number of executions of the *timer_event* function. This will reset to 0 when the process is restarted.
* ``heakd_timer_event_avg_duration``, the average time in nanoseconds for
executing the *timer_event* function.
* ``hekad_timer_event_count``, the total number of executions of the
*timer_event* function. This resets to ``0`` when the process is restarted.
Backend checks
^^^^^^^^^^^^^^
Back-end checks
^^^^^^^^^^^^^^^
* ``http_check``, the backend's API status, 1 if it is responsive, if not 0.
The metric contains a ``service`` field that identifies the LMA backend service being checked.
* ``http_check``, the API status of the back end, ``1`` if it is responsive,
if not, then ``0``. The metric contains a ``service`` field that identifies
the LMA back-end service being checked.
``<service>`` is one of the following values (depending of which Fuel plugins are deployed in the environment):
``<service>`` is one of the following values, depending on which Fuel plugins
are deployed in the environment:
* 'influxdb'
* 'influxdb'

View File

@ -1,25 +1,26 @@
.. _memcached_metrics:
* ``memcached_command_flush``, cumulative number of flush reqs.
* ``memcached_command_get``, cumulative number of retrieval reqs.
* ``memcached_command_set``, cumulative number of storage reqs.
* ``memcached_command_touch``, cumulative number of touch reqs.
* ``memcached_connections_current``, number of open connections.
* ``memcached_df_cache_free``, current number of free bytes to store items.
* ``memcached_df_cache_used``, current number of bytes used to store items.
* ``memcached_items_current``, current number of items stored.
* ``memcached_octets_rx``, total number of bytes read by this server from network.
* ``memcached_octets_tx``, total number of bytes sent by this server to network.
* ``memcached_ops_decr_hits``, number of successful decr reqs.
* ``memcached_ops_decr_misses``, number of decr reqs against missing keys.
* ``memcached_ops_evictions``, number of valid items removed from cache to free memory for new items.
* ``memcached_ops_hits``, number of keys that have been requested.
* ``memcached_ops_incr_hits``, number of successful incr reqs.
* ``memcached_ops_incr_misses``, number of successful incr reqs.
* ``memcached_ops_misses``, number of items that have been requested and not found.
* ``memcached_percent_hitratio``, percentage of get command hits (in cache).
* ``memcached_command_flush``, the cumulative number of flush reqs.
* ``memcached_command_get``, the cumulative number of retrieval reqs.
* ``memcached_command_set``, the cumulative number of storage reqs.
* ``memcached_command_touch``, the cumulative number of touch reqs.
* ``memcached_connections_current``, the number of open connections.
* ``memcached_df_cache_free``, the current number of free bytes to store items.
* ``memcached_df_cache_used``, the current number of bytes used to store items.
* ``memcached_items_current``, the current number of items stored.
* ``memcached_octets_rx``, the total number of bytes read by this server from
the network.
* ``memcached_octets_tx``, the total number of bytes sent by this server to
the network.
* ``memcached_ops_decr_hits``, the number of successful decr reqs.
* ``memcached_ops_decr_misses``, the number of decr reqs against missing keys.
* ``memcached_ops_evictions``, the number of valid items removed from cache to
free memory for new items.
* ``memcached_ops_hits``, the number of keys that have been requested.
* ``memcached_ops_incr_hits``, the number of successful incr reqs.
* ``memcached_ops_incr_misses``, the number of successful incr reqs.
* ``memcached_ops_misses``, the number of items that have been requested and
not found.
* ``memcached_percent_hitratio``, the percentage of get command hits (in cache).
See `memcached documentation`_ for further details.
.. _memcached documentation: https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L488
For details, see the `Memcached documentation <https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L488>`_.

View File

@ -4,8 +4,8 @@ Commands
^^^^^^^^
``mysql_commands``, the number of times per second a given statement has been
executed. The metric has a ``statement`` field that contains the statement to
which it applies. The values can be:
executed. The metric has a ``statement`` field that contains the statement to
which it applies. The values can be as follows:
* ``change_db`` for the USE statement.
* ``commit`` for the COMMIT statement.
@ -29,7 +29,7 @@ Handlers
``mysql_handler``, the number of times per second a given handler has been
executed. The metric has a ``handler`` field that contains the handler
it applies to. The values can be:
it applies to. The values can be as follows:
* ``commit`` for the internal COMMIT statements.
* ``delete`` for the internal DELETE statements.
@ -40,56 +40,69 @@ it applies to. The values can be:
* ``read_prev`` for the requests that read the previous row in key order.
* ``read_rnd`` for the requests that read a row based on a fixed position.
* ``read_rnd_next`` for the requests that read the next row in the data file.
* ``rollback`` the requests that perform rollback operation.
* ``rollback`` the requests that perform the rollback operation.
* ``update`` the requests that update a row in a table.
* ``write`` the requests that insert a row in a table.
Locks
^^^^^
* ``mysql_locks_immediate``, the number of times per second the requests for table locks could be granted immediately.
* ``mysql_locks_waited``, the number of times per second the requests for table locks had to wait.
* ``mysql_locks_immediate``, the number of times per second the requests for
table locks could be granted immediately.
* ``mysql_locks_waited``, the number of times per second the requests for
table locks had to wait.
Network
^^^^^^^
* ``mysql_octets_rx``, the number of bytes received per second by the server.
* ``mysql_octets_tx``, the number of bytes sent per second by the server.
* ``mysql_octets_rx``, the number of bytes per second received by the server.
* ``mysql_octets_tx``, the number of bytes per second sent by the server.
Threads
^^^^^^^
* ``mysql_threads_cached``, the number of threads in the thread cache.
* ``mysql_threads_connected``, the number of currently open connections.
* ``mysql_threads_created``, the number of threads created per second to handle connections.
* ``mysql_threads_created``, the number of threads created per second to
handle connections.
* ``mysql_threads_running``, the number of threads that are not sleeping.
Cluster
^^^^^^^
These metrics are collected with statement 'SHOW STATUS'. see `Percona documentation`_
for further details.
The following metrics are collected with statement 'SHOW STATUS'. For details,
see `Percona documentation <http://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-status-index.html>`_.
* ``mysql_cluster_connected``, ``1`` when the node is connected to the cluster, if not ``0``.
* ``mysql_cluster_local_cert_failures``, number of writesets that failed the certification test.
* ``mysql_cluster_local_commits``, number of writesets commited on the node.
* ``mysql_cluster_local_recv_queue``, the number of writesets waiting to be applied.
* ``mysql_cluster_local_send_queue``, the number of writesets waiting to be sent.
* ``mysql_cluster_ready``, ``1`` when the node is ready to accept queries, if not ``0``.
* ``mysql_cluster_received``, total number of writesets received from other nodes.
* ``mysql_cluster_received_bytes``, total size in bytes of writesets received from other nodes.
* ``mysql_cluster_replicated``, total number of writesets sent to other nodes.
* ``mysql_cluster_replicated_bytes`` total size in bytes of writesets sent to other nodes.
* ``mysql_cluster_size``, current number of nodes in the cluster.
* ``mysql_cluster_status``, ``1`` when the node is 'Primary', ``2`` if 'Non-Primary' and ``3`` if 'Disconnected'.
* ``mysql_cluster_connected``, ``1`` when the node is connected to the cluster,
if not, then ``0``.
* ``mysql_cluster_local_cert_failures``, the number of write sets that failed
the certification test.
* ``mysql_cluster_local_commits``, the number of write sets committed on the
node.
* ``mysql_cluster_local_recv_queue``, the number of write sets waiting to be
applied.
* ``mysql_cluster_local_send_queue``, the number of write sets waiting to be
sent.
* ``mysql_cluster_ready``, ``1`` when the node is ready to accept queries, if
not, then ``0``.
* ``mysql_cluster_received``, the total number of write sets received from
other nodes.
* ``mysql_cluster_received_bytes``, the total size in bytes of write sets
received from other nodes.
* ``mysql_cluster_replicated``, the total number of write sets sent to other
nodes.
* ``mysql_cluster_replicated_bytes`` the total size in bytes of write sets sent
to other nodes.
* ``mysql_cluster_size``, the current number of nodes in the cluster.
* ``mysql_cluster_status``, ``1`` when the node is 'Primary', ``2`` if
'Non-Primary', and ``3`` if 'Disconnected'.
.. _Percona documentation: http://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-status-index.html
Slow Queries
Slow queries
^^^^^^^^^^^^
This metric is collected with statement 'SHOW STATUS where Variable_name = 'Slow_queries'.
* ``mysql_slow_queries``, number of queries that have taken more than X seconds,
depending of the MySQL configuration parameter 'long_query_time' (10s per default)
The following metric is collected with statement
'SHOW STATUS where Variable_name = 'Slow_queries'.
* ``mysql_slow_queries``, the number of queries that have taken more than X
seconds, depending on the MySQL configuration parameter 'long_query_time'
(10s per default).

View File

@ -4,10 +4,12 @@ Service checks
^^^^^^^^^^^^^^
.. _service_checks:
* ``openstack_check_api``, the service's API status, 1 if it is responsive, if not 0.
The metric contains a ``service`` field that identifies the OpenStack service being checked.
* ``openstack_check_api``, the service's API status, ``1`` if it is responsive,
if not, then ``0``. The metric contains a ``service`` field that identifies
the OpenStack service being checked.
``<service>`` is one of the following values with their respective resource checks:
``<service>`` is one of the following values with their respective resource
checks:
* 'ceilometer-api': '/v2/capabilities'
* 'cinder-api': '/'
@ -21,61 +23,75 @@ Service checks
* 'swift-api': '/healthcheck'
* 'swift-s3-api': '/healthcheck'
.. note:: All checks are performed without authentication except for Ceilometer.
.. note:: All checks except for Ceilometer are performed without authentication.
Compute
^^^^^^^
These metrics are emitted per compute node.
The following metrics are emitted per compute node:
* ``openstack_nova_free_disk``, the disk space (in GB) available for new instances.
* ``openstack_nova_free_ram``, the memory (in MB) available for new instances.
* ``openstack_nova_free_vcpus``, the number of virtual CPU available for new instances.
* ``openstack_nova_instance_creation_time``, the time (in seconds) it took to launch a new instance.
* ``openstack_nova_instance_state``, the number of instances which entered a given state (the value is always 1).
* ``openstack_nova_free_disk``, the disk space in GB available for new instances.
* ``openstack_nova_free_ram``, the memory in MB available for new instances.
* ``openstack_nova_free_vcpus``, the number of virtual CPU available for new
instances.
* ``openstack_nova_instance_creation_time``, the time in seconds it took to
launch a new instance.
* ``openstack_nova_instance_state``, the number of instances which entered a
given state (the value is always ``1``).
The metric contains a ``state`` field.
* ``openstack_nova_running_instances``, the number of running instances.
* ``openstack_nova_running_tasks``, the number of tasks currently executed.
* ``openstack_nova_used_disk``, the disk space (in GB) used by the instances.
* ``openstack_nova_used_ram``, the memory (in MB) used by the instances.
* ``openstack_nova_used_vcpus``, the number of virtual CPU used by the instances.
* ``openstack_nova_used_disk``, the disk space in GB used by the instances.
* ``openstack_nova_used_ram``, the memory in MB used by the instances.
* ``openstack_nova_used_vcpus``, the number of virtual CPU used by the
instances.
These metrics are retrieved from the Nova API and represent the aggregated
values across all compute nodes.
The following metrics are retrieved from the Nova API and represent the
aggregated values across all compute nodes.
* ``openstack_nova_total_free_disk``, the total amount of disk space (in GB) available for new instances.
* ``openstack_nova_total_free_ram``, the total amount of memory (in MB) available for new instances.
* ``openstack_nova_total_free_vcpus``, the total number of virtual CPU available for new instances.
* ``openstack_nova_total_running_instances``, the total number of running instances.
* ``openstack_nova_total_running_tasks``, the total number of tasks currently executed.
* ``openstack_nova_total_used_disk``, the total amount of disk space (in GB) used by the instances.
* ``openstack_nova_total_used_ram``, the total amount of memory (in MB) used by the instances.
* ``openstack_nova_total_used_vcpus``, the total number of virtual CPU used by the instances.
* ``openstack_nova_total_free_disk``, the total amount of disk space in GB
available for new instances.
* ``openstack_nova_total_free_ram``, the total amount of memory in MB available
for new instances.
* ``openstack_nova_total_free_vcpus``, the total number of virtual CPU
available for new instances.
* ``openstack_nova_total_running_instances``, the total number of running
instances.
* ``openstack_nova_total_running_tasks``, the total number of tasks currently
executed.
* ``openstack_nova_total_used_disk``, the total amount of disk space in GB
used by the instances.
* ``openstack_nova_total_used_ram``, the total amount of memory in MB used by
the instances.
* ``openstack_nova_total_used_vcpus``, the total number of virtual CPU used by
the instances.
These metrics are retrieved from the Nova API.
The following metrics are retrieved from the Nova API:
* ``openstack_nova_instances``, the total count of instances in a given state.
The metric contains a ``state`` field which is one of 'active', 'deleted',
'error', 'paused', 'resumed', 'rescued', 'resized', 'shelved_offloaded' or
'suspended'.
These metrics are retrieved from the Nova database.
The following metrics are retrieved from the Nova database:
.. _compute-service-state-metrics:
* ``openstack_nova_service``, the Nova service state (either 0 for 'up', 1 for 'down' or 2 for 'disabled').
The metric contains a ``service`` field (one of 'compute', 'conductor', 'scheduler', 'cert'
or 'consoleauth') and a ``state`` field (one of 'up', 'down' or 'disabled').
* ``openstack_nova_service``, the Nova service state (either ``0`` for 'up',
``1`` for 'down' or ``2`` for 'disabled').
The metric contains a ``service`` field (one of 'compute', 'conductor',
'scheduler', 'cert' or 'consoleauth') and a ``state`` field (one of 'up',
'down' or 'disabled').
* ``openstack_nova_services``, the total count of Nova
services by state. The metric contains a ``service`` field (one of 'compute',
'conductor', 'scheduler', 'cert' or 'consoleauth') and a ``state`` field (one
of 'up', 'down' or 'disabled').
of 'up', 'down', or 'disabled').
Identity
^^^^^^^^
These metrics are retrieved from the Keystone API.
The following metrics are retrieved from the Keystone API:
* ``openstack_keystone_roles``, the total number of roles.
* ``openstack_keystone_tenants``, the number of tenants by state. The metric
@ -86,28 +102,37 @@ These metrics are retrieved from the Keystone API.
Volume
^^^^^^
These metrics are emitted per volume node.
The following metrics are emitted per volume node:
* ``openstack_cinder_volume_creation_time``, the time (in seconds) it took to create a new volume.
* ``openstack_cinder_volume_creation_time``, the time in seconds it took to
create a new volume.
.. note:: When using Ceph as the backend storage for volumes, the ``hostname`` value is always set to ``rbd``.
.. note:: When using Ceph as the back end storage for volumes, the ``hostname``
value is always set to ``rbd``.
These metrics are retrieved from the Cinder API.
The following metrics are retrieved from the Cinder API:
* ``openstack_cinder_snapshots``, the number of snapshots by state. The metric contains a ``state`` field.
* ``openstack_cinder_snapshots_size``, the total size (in bytes) of snapshots by state. The metric contains a ``state`` field.
* ``openstack_cinder_volumes``, the number of volumes by state. The metric contains a ``state`` field.
* ``openstack_cinder_volumes_size``, the total size (in bytes) of volumes by state. The metric contains a ``state`` field.
* ``openstack_cinder_snapshots``, the number of snapshots by state. The metric
contains a ``state`` field.
* ``openstack_cinder_snapshots_size``, the total size (in bytes) of snapshots
by state. The metric contains a ``state`` field.
* ``openstack_cinder_volumes``, the number of volumes by state. The metric
contains a ``state`` field.
* ``openstack_cinder_volumes_size``, the total size (in bytes) of volumes by
state. The metric contains a ``state`` field.
``state`` is one of 'available', 'creating', 'attaching', 'in-use', 'deleting', 'backing-up', 'restoring-backup', 'error', 'error_deleting', 'error_restoring', 'error_extending'.
``state`` is one of 'available', 'creating', 'attaching', 'in-use', 'deleting',
'backing-up', 'restoring-backup', 'error', 'error_deleting', 'error_restoring',
'error_extending'.
These metrics are retrieved from the Cinder database.
The following metrics are retrieved from the Cinder database:
.. _volume-service-state-metrics:
* ``openstack_cinder_service``, the Cinder service state (either 0 for 'up', 1 for 'down' or 2 for 'disabled').
The metric contains a ``service`` field (one of 'volume', 'backup', 'scheduler'),
and a ``state`` field (one of 'up', 'down' or 'disabled').
* ``openstack_cinder_service``, the Cinder service state (either ``0`` for
'up', ``1`` for 'down', or ``2`` for 'disabled'). The metric contains a
``service`` field (one of 'volume', 'backup', 'scheduler') and a ``state``
field (one of 'up', 'down' or 'disabled').
* ``openstack_cinder_services``, the total count of Cinder services by state.
The metric contains a ``service`` field (one of 'volume', 'backup',
@ -116,17 +141,18 @@ These metrics are retrieved from the Cinder database.
Image
^^^^^
These metrics are retrieved from the Glance API.
The following metrics are retrieved from the Glance API:
* ``openstack_glance_images``, the number of images by state and visibility.
The metric contains ``state`` and ``visibility`` field.
The metric contains ``state`` and ``visibility`` fields.
* ``openstack_glance_images_size``, the total size (in bytes) of images by
state and visibility. The metric contains ``state`` and ``visibility`` field.
state and visibility. The metric contains ``state`` and ``visibility``
fields.
* ``openstack_glance_snapshots``, the number of snapshot images by state and
visibility. The metric contains ``state`` and ``visibility`` field.
visibility. The metric contains ``state`` and ``visibility`` fields.
* ``openstack_glance_snapshots_size``, the total size (in bytes) of snapshots
by state and visibility. The metric contains ``state`` and ``visibility``
field.
fields.
``state`` is one of 'queued', 'saving', 'active', 'killed', 'deleted',
'pending_delete'. ``visibility`` is either 'public' or 'private'.
@ -134,27 +160,32 @@ These metrics are retrieved from the Glance API.
Network
^^^^^^^
These metrics are retrieved from the Neutron API.
The following metrics are retrieved from the Neutron API:
* ``openstack_neutron_floatingips``, the total number of floating IP addresses.
* ``openstack_neutron_networks``, the number of virtual networks by state. The metric contains a ``state`` field.
* ``openstack_neutron_ports``, the number of virtual ports by owner and state. The metric contains ``owner`` and ``state`` fields.
* ``openstack_neutron_routers``, the number of virtual routers by state. The metric contains a ``state`` field.
* ``openstack_neutron_networks``, the number of virtual networks by state. The
metric contains a ``state`` field.
* ``openstack_neutron_ports``, the number of virtual ports by owner and state.
The metric contains ``owner`` and ``state`` fields.
* ``openstack_neutron_routers``, the number of virtual routers by state. The
metric contains a ``state`` field.
* ``openstack_neutron_subnets``, the number of virtual subnets.
``<state>`` is one of 'active', 'build', 'down' or 'error'.
``<owner>`` is one of 'compute', 'dhcp', 'floatingip', 'floatingip_agent_gateway', 'router_interface', 'router_gateway', 'router_ha_interface', 'router_interface_distributed' or 'router_centralized_snat'.
``<owner>`` is one of 'compute', 'dhcp', 'floatingip', 'floatingip_agent_gateway', 'router_interface', 'router_gateway', 'router_ha_interface',
'router_interface_distributed', or 'router_centralized_snat'.
These metrics are retrieved from the Neutron database.
The following metrics are retrieved from the Neutron database:
.. _network-agent-state-metrics:
.. note:: These metrics are not collected when the Contrail plugin is deployed.
* ``openstack_neutron_agent``, the Neutron agent state (either 0 for 'up', 1 for 'down' or 2 for 'disabled').
The metric contains a ``service`` field (one of 'dhcp', 'l3', 'metadata' or 'openvswitch'),
and a ``state`` field (one of 'up', 'down' or 'disabled').
* ``openstack_neutron_agent``, the Neutron agent state (either ``0`` for 'up',
``1`` for 'down', or ``2`` for 'disabled').
The metric contains a ``service`` field (one of 'dhcp', 'l3', 'metadata', or
'openvswitch'), and a ``state`` field (one of 'up', 'down' or 'disabled').
* ``openstack_neutron_agents``, the total number of Neutron agents by service
and state. The metric contains ``service`` (one of 'dhcp', 'l3', 'metadata'
@ -164,12 +195,17 @@ API response times
^^^^^^^^^^^^^^^^^^
* ``openstack_<service>_http_response_times``, HTTP response time statistics.
The statistics are ``min``, ``max``, ``sum``, ``count``, ``upper_90`` (90 percentile) over 10 seconds.
The metric contains ``http_method`` (eg 'GET', 'POST', and so forth) and ``http_status`` (eg '2xx', '4xx', and so forth) fields.
The statistics are ``min``, ``max``, ``sum``, ``count``, ``upper_90``
(90 percentile) over 10 seconds. The metric contains an ``http_method`` field,
for example, 'GET', 'POST', and others, and an ``http_status`` field, for
example, '2xx', '4xx', and others.
``<service>`` is one of 'cinder', 'glance', 'heat' 'keystone', 'neutron' or 'nova'.
``<service>`` is one of 'cinder', 'glance', 'heat' 'keystone', 'neutron' or
'nova'.
Logs
^^^^
* ``log_messages``, the number of log messages per second for the given service and severity level. The metric contains ``service`` and ``level`` (one of 'debug', 'info', ... ) fields.
* ``log_messages``, the number of log messages per second for the given
service and severity level. The metric contains ``service`` and ``level``
(one of 'debug', 'info', and others) fields.

View File

@ -4,6 +4,6 @@ Resource location
^^^^^^^^^^^^^^^^^
* ``pacemaker_resource_local_active``, ``1`` when the resource is located on
the host reporting the metric, if not ``0``. The metric contains a
the host reporting the metric, if not, then ``0``. The metric contains a
``resource`` field which is one of 'vip__public', 'vip__management',
'vip__vrouter_pub' or 'vip__vrouter'.
'vip__vrouter_pub', or 'vip__vrouter'.

View File

@ -3,16 +3,23 @@
Cluster
^^^^^^^
* ``rabbitmq_connections``, total number of connections.
* ``rabbitmq_consumers``, total number of consumers.
* ``rabbitmq_channels``, total number of channels.
* ``rabbitmq_exchanges``, total number of exchanges.
* ``rabbitmq_messages``, total number of messages which are ready to be consumed or not yet acknowledged.
* ``rabbitmq_queues``, total number of queues.
* ``rabbitmq_running_nodes``, total number of running nodes in the cluster.
* ``rabbitmq_disk_free``, the disk free space.
* ``rabbitmq_disk_free_limit``, the minimum amount of free disk for RabbitMQ. When ``rabbitmq_disk_free`` drops below this value, all producers are blocked.
* ``rabbitmq_remaining_disk``, the difference between ``rabbitmq_disk_free`` and ``rabbitmq_disk_free_limit``.
* ``rabbitmq_connections``, the total number of connections.
* ``rabbitmq_consumers``, the total number of consumers.
* ``rabbitmq_channels``, the total number of channels.
* ``rabbitmq_exchanges``, the total number of exchanges.
* ``rabbitmq_messages``, the total number of messages which are ready to be
consumed or not yet acknowledged.
* ``rabbitmq_queues``, the total number of queues.
* ``rabbitmq_running_nodes``, the total number of running nodes in the cluster.
* ``rabbitmq_disk_free``, the free disk space.
* ``rabbitmq_disk_free_limit``, the minimum amount of free disk space for
RabbitMQ.
When ``rabbitmq_disk_free`` drops below this value, all producers are blocked.
* ``rabbitmq_remaining_disk``, the difference between ``rabbitmq_disk_free``
and ``rabbitmq_disk_free_limit``.
* ``rabbitmq_used_memory``, bytes of memory used by the whole RabbitMQ process.
* ``rabbitmq_vm_memory_limit``, the maximum amount of memory allocated for RabbitMQ. When ``rabbitmq_used_memory`` uses more than this value, all producers are blocked.
* ``rabbitmq_remaining_memory``, the difference between ``rabbitmq_vm_memory_limit`` and ``rabbitmq_used_memory``.
* ``rabbitmq_vm_memory_limit``, the maximum amount of memory allocated for
RabbitMQ. When ``rabbitmq_used_memory`` uses more than this value, all
producers are blocked.
* ``rabbitmq_remaining_memory``, the difference between
``rabbitmq_vm_memory_limit`` and ``rabbitmq_used_memory``.

View File

@ -3,36 +3,45 @@
CPU
^^^
Metrics have a ``cpu_number`` field that contains the CPU number to which the metric applies.
Metrics have a ``cpu_number`` field that contains the CPU number to which the
metric applies.
* ``cpu_idle``, percentage of CPU time spent in the idle task.
* ``cpu_interrupt``, percentage of CPU time spent servicing interrupts.
* ``cpu_nice``, percentage of CPU time spent in user mode with low priority (nice).
* ``cpu_softirq``, percentage of CPU time spent servicing soft interrupts.
* ``cpu_steal``, percentage of CPU time spent in other operating systems.
* ``cpu_system``, percentage of CPU time spent in system mode.
* ``cpu_user``, percentage of CPU time spent in user mode.
* ``cpu_wait``, percentage of CPU time spent waiting for I/O operations to complete.
* ``cpu_idle``, the percentage of CPU time spent in the idle task.
* ``cpu_interrupt``, the percentage of CPU time spent servicing interrupts.
* ``cpu_nice``, the percentage of CPU time spent in user mode with low
priority (nice).
* ``cpu_softirq``, the percentage of CPU time spent servicing soft interrupts.
* ``cpu_steal``, the percentage of CPU time spent in other operating systems.
* ``cpu_system``, the percentage of CPU time spent in system mode.
* ``cpu_user``, the percentage of CPU time spent in user mode.
* ``cpu_wait``, the percentage of CPU time spent waiting for I/O operations to
complete.
Disk
^^^^
Metrics have a ``device`` field that contains the disk device number the metric applies to (eg 'sda', 'sdb' and so on).
Metrics have a ``device`` field that contains the disk device number the metric
applies to. For example, 'sda', 'sdb', and others.
* ``disk_merged_read``, the number of read operations per second that could be merged with already queued operations.
* ``disk_merged_write``, the number of write operations per second that could be merged with already queued operations.
* ``disk_merged_read``, the number of read operations per second that could be
merged with already queued operations.
* ``disk_merged_write``, the number of write operations per second that could
be merged with already queued operations.
* ``disk_octets_read``, the number of octets (bytes) read per second.
* ``disk_octets_write``, the number of octets (bytes) written per second.
* ``disk_ops_read``, the number of read operations per second.
* ``disk_ops_write``, the number of write operations per second.
* ``disk_time_read``, the average time for a read operation to complete in the last interval.
* ``disk_time_write``, the average time for a write operation to complete in the last interval.
* ``disk_time_read``, the average time for a read operation to complete in the
last interval.
* ``disk_time_write``, the average time for a write operation to complete in
the last interval.
File system
^^^^^^^^^^^
Metrics have a ``fs`` field that contains the partition's mount point to which the metric applies (eg '/', '/var/lib' and so on).
Metrics have a ``fs`` field that contains the partition's mount point to which
the metric applies. For example, '/', '/var/lib', and others.
* ``fs_inodes_free``, the number of free inodes on the file system.
* ``fs_inodes_percent_free``, the percentage of free inodes on the file system.
@ -52,46 +61,53 @@ System load
* ``load_longterm``, the system load average over the last 15 minutes.
* ``load_midterm``, the system load average over the last 5 minutes.
* ``load_shortterm``, the system load averge over the last minute.
* ``load_shortterm``, the system load average over the last minute.
Memory
^^^^^^
* ``memory_buffered``, the amount of memory (in bytes) which is buffered.
* ``memory_cached``, the amount of memory (in bytes) which is cached.
* ``memory_free``, the amount of memory (in bytes) which is free.
* ``memory_used``, the amount of memory (in bytes) which is used.
* ``memory_buffered``, the amount of buffered memory in bytes.
* ``memory_cached``, the amount of cached memory in bytes.
* ``memory_free``, the amount of free memory in bytes.
* ``memory_used``, the amount of used memory in bytes.
Network
^^^^^^^
Metrics have a ``interface`` field that contains the interface name the metric applies to (eg 'eth0', 'eth1' and so on).
Metrics have an ``interface`` field that contains the interface name the
metric applies to. For example, 'eth0', 'eth1', and others.
* ``if_errors_rx``, the number of errors per second detected when receiving from the interface.
* ``if_errors_tx``, the number of errors per second detected when transmitting from the interface.
* ``if_octets_rx``, the number of octets (bytes) received per second by the interface.
* ``if_octets_tx``, the number of octets (bytes) transmitted per second by the interface.
* ``if_packets_rx``, the number of packets received per second by the interface.
* ``if_packets_tx``, the number of packets transmitted per second by the interface.
* ``if_errors_rx``, the number of errors per second detected when receiving
from the interface.
* ``if_errors_tx``, the number of errors per second detected when transmitting
from the interface.
* ``if_octets_rx``, the number of octets (bytes) received per second by the
interface.
* ``if_octets_tx``, the number of octets (bytes) transmitted per second by the
interface.
* ``if_packets_rx``, the number of packets received per second by the
interface.
* ``if_packets_tx``, the number of packets transmitted per second by the
interface.
Processes
^^^^^^^^^
* ``processes_count``, the number of processes in a given state. The metric has
a ``state`` field (one of 'blocked', 'paging', 'running', 'sleeping', 'stopped'
or 'zombies').
a ``state`` field (one of 'blocked', 'paging', 'running', 'sleeping',
'stopped' or 'zombies').
* ``processes_fork_rate``, the number of processes forked per second.
Swap
^^^^
* ``swap_cached``, the amount of cached memory (in bytes) which is in the swap.
* ``swap_free``, the amount of free memory (in bytes) which is in the swap.
* ``swap_cached``, the amount of cached memory (in bytes) that is in the swap.
* ``swap_free``, the amount of free memory (in bytes) that is in the swap.
* ``swap_io_in``, the number of swap pages written per second.
* ``swap_io_out``, the number of swap pages read per second.
* ``swap_used``, the amount of used memory (in bytes) which is in the swap.
* ``swap_used``, the amount of used memory (in bytes) that is in the swap.
Users
^^^^^
* ``logged_users``, the number of users currently logged-in.
* ``logged_users``, the number of users currently logged in.