diff --git a/doc/user/source/appendix_alarms.rst b/doc/user/source/appendix_alarms.rst index 17ddcb66d..e3302cda5 100644 --- a/doc/user/source/appendix_alarms.rst +++ b/doc/user/source/appendix_alarms.rst @@ -1,9 +1,13 @@ .. _alarms: +.. raw:: latex + + \pagebreak + List of built-in alarms ----------------------- -Here is a list of all the alarms that are built-in in StackLight:: +The following is a list of StackLight built-in alarms:: alarms: - name: 'cpu-critical-controller' @@ -732,5 +736,4 @@ Here is a list of all the alarms that are built-in in StackLight:: threshold: 5 window: 60 periods: 0 - function: min - + function: min \ No newline at end of file diff --git a/doc/user/source/appendix_metrics.rst b/doc/user/source/appendix_metrics.rst index aa3f03abf..00a829f2f 100644 --- a/doc/user/source/appendix_metrics.rst +++ b/doc/user/source/appendix_metrics.rst @@ -3,8 +3,8 @@ List of metrics --------------- -Here is a list of metrics that are emitted by the StackLight Collector. -They are listed by category then by metric name. +The following is a list of metrics that are emitted by the StackLight Collector. +The metrics are listed by category, then by metric name. System ++++++ @@ -63,7 +63,7 @@ Clusters .. include:: metrics/clusters.rst -Self Monitoring +Self-monitoring +++++++++++++++ .. include:: metrics/lma.rst @@ -78,4 +78,4 @@ Elasticsearch InfluxDB ++++++++ -.. include:: metrics/influxdb.rst +.. include:: metrics/influxdb.rst \ No newline at end of file diff --git a/doc/user/source/configure_alarms.rst b/doc/user/source/configure_alarms.rst index 9472dcac1..af578f068 100644 --- a/doc/user/source/configure_alarms.rst +++ b/doc/user/source/configure_alarms.rst @@ -3,139 +3,130 @@ Overview -------- -The process of running alarms in StackLight is not centralized -(as it is often the case in more conventional monitoring systems) -but distributed across all the StackLight Collectors. +The process of running alarms in StackLight is not centralized, as it is often +the case in more conventional monitoring systems, but distributed across all +the StackLight Collectors. -Each Collector is individually responsible for monitoring the -resources and the services that are deployed on the node and for reporting -any anomaly or fault it has detected to the Aggregator. +Each Collector is individually responsible for monitoring the resources and +services that are deployed on the node and for reporting any anomaly or fault +it has detected to the Aggregator. -The anomaly and fault detection logic in StackLight is designed -more like an *expert system* in that the Collector and the Aggregator -use artifacts we could refer to as *facts* and *rules*. +The anomaly and fault detection logic in StackLight is designed more like an +*expert system* in that the Collector and the Aggregator use artifacts we +can refer to as *facts* and *rules*. The *facts* are the operational data ingested in the StackLight's -stream processing pipeline. -The *rules* are either alarm rules or aggregation rules. -They are declaratively defined in YAML files that can be modified. -Those rules are turned into a collection of Lua plugins -that are executed by the Collector and the Aggregator. -They are generated dynamically using the Puppet modules of the StackLight -Collector Plugin. +stream-processing pipeline. The *rules* are either alarm rules or aggregation +rules. They are declaratively defined in YAML files that can be modified. +Those rules are turned into a collection of Lua plugins that are executed by +the Collector and the Aggregator. They are generated dynamically using the +Puppet modules of the StackLight Collector Plugin. -There are two types of Lua plugins related to the processing -of alarms. +The following are the two types of Lua plugins related to the processing of +alarms: -* The **AFD plugin** for Anomaly and Fault Detection plugin. -* The **GSE plugin** for Global Status Evaluation plugin. +* The **AFD plugin** -- Anomaly and Fault Detection plugin +* The **GSE plugin** -- Global Status Evaluation plugin -These plugins create a special type of metric called respectively -the **AFD metric** and the **GSE metric**. +These plugins create special types of metrics, as follows: -* The AFD metric contains information about the health status - of a node or service in the OpenStack environment. - The AFD metrics are sent on a regular basis to the Aggregator - where they are further processed by the GSE plugins. -* The GSE metric contains information about the health status - of a cluster in the OpenStack environment. A cluster is a - logical grouping of nodes or services. We call - them node clusters and service clusters hereafter. - A service cluster can be anything like a cluster of API endpoints - or a cluster of workers. A cluster of nodes is a grouping of - nodes that have the same role. For example 'compute' or 'storage'. +* The **AFD metric**, which contains information about the health status of a + node or service in the OpenStack environment. The AFD metrics are sent on a + regular basis to the Aggregator where they are further processed by the GSE + plugins. -.. note:: The AFD and GSE metrics are new types of metrics introduced - in StackLight version 0.8. - They contain detailed information about the fault and anomalies - detected by StackLight. Please refer to the +* The **GSE metric**, which contains information about the health status of a + cluster in the OpenStack environment. A cluster is a logical grouping of + nodes or services. We call them node clusters and service clusters hereafter. + A service cluster can be anything like a cluster of API endpoints or a + cluster of workers. A cluster of nodes is a grouping of nodes that have the + same role. For example, *compute* or *storage*. + +.. note:: The AFD and GSE metrics are new types of metrics introduced in + StackLight version 0.8. They contain detailed information about the fault + and anomalies detected by StackLight. For more information about the + message structure of these metrics, refer to `Metrics section of the Developer Guide - `_ - for more information about the message structure of these metrics. + `_. -The StackLight stream processing pipeline workflow is shown in the figure below: +The following figure shows the StackLight stream-processing pipeline workflow: .. figure:: ../../images/AFD_and_GSE_message_flow.* :width: 800 :alt: Message flow for the AFD and GSE metrics - :align: center + +.. raw:: latex + + \pagebreak The AFD and GSE plugins ----------------------- -In the current version of StackLight, there are three types of GSE plugins: +The current version of StackLight contains the following three types of GSE +plugins: -* The **Service Cluster GSE Plugin** which receives AFD metrics for services +* The **Service Cluster GSE Plugin**, which receives AFD metrics for services from the AFD plugins. -* The **Node Cluster GSE Plugin** which receives AFD metrics for nodes +* The **Node Cluster GSE Plugin**, which receives AFD metrics for nodes from the AFD plugins. -* The **Global Cluster GSE Plugin** which receives GSE metrics from the - GSE plugins above. It aggregates and correlates the GSE metrics to issue a global - health status for the top-level clusters like Nova, MySQL and so forth. +* The **Global Cluster GSE Plugin**, which receives GSE metrics from the + GSE plugins above. It aggregates and correlates the GSE metrics to issue a + global health status for the top-level clusters like Nova, MySQL, and others. -The health status exposed in the GSE metrics is as follow: +The health status exposed in the GSE metrics is as follows: -* *Down*: One or several primary functions of a cluster has failed or is failing. - For example, the API service for Nova or Cinder isn't accessible. -* *Critical*: One or several primary functions of a - cluster are severely degraded. The quality - of service delivered to the end-user is severely impacted. -* *Warning*: One or several primary functions of the - cluster are slightly degraded. The quality - of service delivered to the end-user is slightly +* ``Down``: One or several primary functions of a cluster has failed or is + failing. For example, the API service for Nova or Cinder is not accessible. +* ``Critical``: One or several primary functions of a cluster are severely + degraded. The quality of service delivered to the end user is severely impacted. -* *Unknown*: There is not enough data to infer the actual - health status of the cluster. -* *Okay*: None of the above was found to be true. +* ``Warning``: One or several primary functions of the cluster are slightly + degraded. The quality of service delivered to the end user is slightly + impacted. +* ``Unknown``: There is not enough data to infer the actual health status of + the cluster. +* ``Okay``: None of the above was found to be true. The AFD and GSE persisters -------------------------- -The AFD and GSE metrics are also consumed by other types -of Lua plugins called the **persisters**. +The AFD and GSE metrics are also consumed by other types of Lua plugins called +**persisters**: -* The **InfluxDB persister** transforms the GSE metrics - into InfluxDB data-points and Grafana annotations. They - are used in Grafana to graph the health status of - the OpenStack clusters. -* The **Elasticsearch persister** transforms the AFD metrics - into events that are indexed in Elasticsearch. Using Kibana, - these events can be searched to display a fault or an anomaly - that occured in the environment (not implemented yet). -* The **Nagios persister** transforms the GSE and AFD metrics - into passive checks that are sent to Nagios for alerting and - escalation. +* The **InfluxDB persister** transforms the GSE metrics into InfluxDB data + points and Grafana annotations. They are used in Grafana to graph the health + status of the OpenStack clusters. +* The **Elasticsearch persister** transforms the AFD metrics into events that + are indexed in Elasticsearch. Using Kibana, these events can be searched to + display a fault or an anomaly that occurred in the environment (not yet + implemented). +* The **Nagios persister** transforms the GSE and AFD metrics into passive + checks that are sent to Nagios for alerting and escalation. -New persisters could be created easely to feed other -systems with the operational insight contained in the -AFD and GSE metrics. +New persisters can be easily created to feed other systems with the +operational insight contained in the AFD and GSE metrics. .. _alarm_configuration: Alarms configuration -------------------- -StackLight comes with a predefined set of alarm rules. -We have tried to make these rules as comprehensive and relevant -as possible, but your mileage may vary depending on the specifics of -your OpenStack environment and monitoring requirements. -Therefore, it is possible to modify those predefined rules -and create new ones. -To do so, you will be required to modify the -``/etc/hiera/override/alarming.yaml`` file -and apply the :ref:`Puppet manifest ` -that will dynamically generate Lua plugins known as -the AFD Plugins which are the actuators of the alarm rules. -But before you proceed, you need to understand the structure -of that file. +StackLight comes with a predefined set of alarm rules. We have tried to make +these rules as comprehensive and relevant as possible, but your mileage may +vary depending on the specifics of your OpenStack environment and monitoring +requirements. Therefore, it is possible to modify those predefined rules and +create new ones. To do so, modify the ``/etc/hiera/override/alarming.yaml`` +file and apply the :ref:`Puppet manifest ` that will dynamically +generate Lua plugins, known as the AFD Plugins, which are the actuators of the +alarm rules. But before you proceed, verify that understand the structure of +that file. .. _alarm_structure: Alarm structure +++++++++++++++ -An alarm rule is defined declaratively using the YAML syntax -as shown in the example below:: +An alarm rule is defined declaratively using the YAML syntax. For example:: name: 'fs-warning' description: 'Filesystem free space is low' @@ -180,7 +171,7 @@ as shown in the example below:: | logical_operator | Type: Enum('and' | '&&' | 'or' | '||') -| The conjonction relation for the alarm rules. +| The conjunction relation for the alarm rules | metric | Type: unicode @@ -192,24 +183,25 @@ as shown in the example below:: | fields | Type: list -| List of field name / value pairs (a.k.a dimensions) used to select - a particular device for the metric such as a network interface name or file - system mount point. If value is specified as an empty string (""), then the rule - is applied to all the aggregated values for the specified field name. For example - the file system mount point. - If value is specified as the '*' wildcard character, - then the rule is applied to each of the metrics matching the metric name and field name. - For example, the alarm definition sample given above would run the rule - for each of the file system mount points associated with the *fs_space_percent_free* metric. +| List of field name / value pairs, also known as dimensions, used to select + a particular device for the metric, such as a network interface name or + file system mount point. If the value is specified as an empty string (""), + then the rule is applied to all the aggregated values for the specified + field name. For example, the file system mount point. If value is + specified as the '*' wildcard character, then the rule is applied to each + of the metrics matching the metric name and field name. For example, the + alarm definition sample given above would run the rule for each of the + file system mount points associated with the *fs_space_percent_free* + metric. | window | Type: integer -| The in memory time-series analysis window in seconds +| The in-memory time-series analysis window in seconds | periods | Type: integer -| The number of prior time-series analysis window to compare the window with (this is -| not implemented yet) +| The number of prior time-series analysis window to compare the window with +| (this is not implemented yet). | function | Type: enum('last' | 'min' | 'max' | 'sum' | 'count' | 'avg' | 'median' | 'mode' | 'roc' | 'mww' | 'mww_nonparametric') @@ -232,46 +224,49 @@ as shown in the example below:: | returns the value that occurs most often in all the values | (not implemented yet) | roc: -| The 'roc' function detects a significant rate - of change when comparing current metrics values with historical data. - To achieve this, it computes the average of the values in the current window, - and the average of the values in the window before the current window and - compare the difference against the standard deviation of the - historical window. The function returns true if the difference +| The 'roc' function detects a significant rate of change when comparing + current metrics values with historical data. To achieve this, it + computes the average of the values in the current window and the + average of the values in the window before the current window and + compares the difference against the standard deviation of the + historical window. The function returns ``true`` if the difference exceeds the standard deviation multiplied by the 'threshold' value. This function uses the rate of change algorithm already available in the - anomaly detection module of Heka. It can only be applied on normal - distributions. - With an alarm rule using the 'roc' function, the 'window' parameter - specifies the duration in seconds of the current window and the 'periods' - parameter specifies the number of windows used for the historical data. - You need at least one period and so, the 'periods' parameter must not be zero. - If you choose a period of 'p', the function will compute the rate of - change using an historical data window of ('p' * window) seconds. - For example, if you specify in the alarm rule: + anomaly detection module of Heka. It can only be applied to normal + distributions. With an alarm rule using the 'roc' function, the + 'window' parameter specifies the duration in seconds of the current + window, and the 'periods' parameter specifies the number of windows + used for the historical data. You need at least one period and the + 'periods' parameter must not be zero. If you choose a period of 'p', + the function will compute the rate of change using a historical data + window of ('p' * window) seconds. For example, if you specify the + following in the alarm rule: | | window = 60 | periods = 3 | threshold = 1.5 | -| The function will store in a circular buffer the value of the metrics +| the function will store in a circular buffer the value of the metrics received during the last 300 seconds (5 minutes) where: | | Current window (CW) = 60 sec | Previous window (PW) = 60 sec | Historical window (HW) = 180 sec | -| And apply the following formula: +| and apply the following formula: | | abs(avg(CW) - avg(PW)) > std(HW) * 1.5 ? true : false | mww: -| returns the result (true, false) of the Mann-Whitney-Wilcoxon test function - of Heka that can be used only with normal distributions (not implemented yet) +| returns the result (true, false) of the Mann-Whitney-Wilcoxon test + function of Heka that can be used only with normal distributions (not + implemented yet) | mww-nonparametric: -| returns the result (true, false) of the Mann-Whitney-Wilcoxon - test function of Heka that can be used with non-normal distributions (not implemented yet) +| returns the result (true, false) of the Mann-Whitney-Wilcoxon test + function of Heka that can be used with non-normal distributions (not + implemented yet) | diff: -| returns the difference between the last value and the first value of all the values +| returns the difference between the last value and the first value of + all the values | threshold | Type: float @@ -281,15 +276,13 @@ as shown in the example below:: Modify or create an alarm +++++++++++++++++++++++++ -To modify (or create) an alarm, you need to edit the -``/etc/hiera/override/alarming.yaml`` file. -This file has four sections: +To modify or create an alarm, edit the ``/etc/hiera/override/alarming.yaml`` +file. This file has the following sections: -1. The *alarms* section contains a global list of alarms that - are executed by the Collectors. These alarms are global to - the LMA toolchain and should be kept identical - on all nodes of the OpenStack environment. - Here is another example of the definition of an alarm:: +#. The ``alarms`` section contains a global list of alarms that are executed + by the Collectors. These alarms are global to the LMA toolchain and should + be kept identical on all nodes of the OpenStack environment. The following + is another example of the definition of an alarm:: alarms: - name: 'cpu-critical-controller' @@ -312,30 +305,29 @@ This file has four sections: periods: 0 function: avg - This alarm is called 'cpu-critical-controller'. - It says that CPU activity is critical (severity: 'critical') - if any of the rules in the alarm definition evaluates to true. + This alarm is called 'cpu-critical-controller'. It says that CPU activity + is critical (severity: 'critical') if any of the rules in the alarm + definition evaluate to true. - The rule says that the alarm - will evaluate to 'true' if the value of the metric *cpu_idle* - has been in average (function: avg) below or equal + The rule says that the alarm will evaluate to 'true' if the value of the + metric ``cpu_idle`` has been in average (function: avg), below or equal (relational_operator: <=) to 5 for the last 5 minutes (window: 120). OR (logical_operator: 'or') - If the value of the metric **cpu_wait** has been in average - (function: avg) superior or equal (relational_operator: >=) to 35 - for the last 5 minutes (window: 120) + If the value of the metric **cpu_wait** has been in average (function: avg), + superior or equal (relational_operator: >=) to 35 for the last 5 minutes + (window: 120) Note that these metrics are expressed in percentage. - What alarms are executed on which node depends on - the mapping between the alarm definition and the - definition of a cluster as described in the following sections. + What alarms are executed on which node depends on the mapping between the + alarm definition and the definition of a cluster as described in the + following sections. -2. The *node_cluster_roles* section defines the mapping between - the internal definition of a cluster of nodes and one or - several Fuel roles. For example:: +#. The ``node_cluster_roles`` section defines the mapping between the internal + definition of a cluster of nodes and one or several Fuel roles. + For example:: node_cluster_roles: controller: ['primary-controller', 'controller'] @@ -343,22 +335,19 @@ This file has four sections: storage: ['cinder', 'ceph-osd'] [ ... ] - Creates a mapping between the 'primary-controller' - and 'controller' Fuel roles and the internal defintion of a cluster - of nodes called 'controller'. - Likewise, the internal definition of a cluster of nodes called - 'storage' is mapped to the 'cinder' and 'ceph-osd' Fuel roles. - The internal definition of a cluster of nodes is used to assign - the alarms to the relevant category of nodes. - This mapping is also used to configure the **passive checks** - in Nagios. This is the reason why, it is criticaly important - to keep the exact same copy of ``/etc/hiera/override/alarming.yaml`` - across all the nodes of the OpenStack environment including the - node(s) where Nagios is installed. + Creates a mapping between the 'primary-controller' and 'controller' Fuel + roles, and the internal definition of a cluster of nodes called 'controller'. + Likewise, the internal definition of a cluster of nodes called 'storage' is + mapped to the 'cinder' and 'ceph-osd' Fuel roles. The internal definition + of a cluster of nodes is used to assign the alarms to the relevant category + of nodes. This mapping is also used to configure the **passive checks** + in Nagios. Therefore, it is critically important to keep exactly the same + copy of ``/etc/hiera/override/alarming.yaml`` across all nodes of the + OpenStack environment including the node(s) where Nagios is installed. -3. The *service_cluster_roles* section defines the mapping between - the internal definition of a cluster of services and one or - several Fuel roles. For example:: +#. The ``service_cluster_roles`` section defines the mapping between the + internal definition of a cluster of services and one or several Fuel roles. + For example:: service_cluster_roles: rabbitmq: ['primary-controller', 'controller'] @@ -366,18 +355,17 @@ This file has four sections: elasticsearch: ['primary-elasticsearch_kibana', 'elasticsearch_kibana'] [ ... ] - Creates a mapping between the 'primary-controller' - and 'controller' Fuel roles and the internal defintion of a cluster - of services called 'rabbitmq'. + Creates a mapping between the 'primary-controller' and 'controller' Fuel + roles, and the internal definition of a cluster of services called 'rabbitmq'. Likewise, the internal definition of a cluster of services called - 'elasticsearch' is mapped to the 'primary-elasticsearch_kibana' - and 'elasticsearch_kibana' Fuel roles. - As for the clusters of nodes, the internal definition of a cluster - of services is used to assign the alarns to the relevant category of services. + 'elasticsearch' is mapped to the 'primary-elasticsearch_kibana' and + 'elasticsearch_kibana' Fuel roles. As for the clusters of nodes, the + internal definition of a cluster of services is used to assign the alarms + to the relevant category of services. -4. The *node_cluster_alarms* section defines the mapping between - the internal definition of a cluster of nodes and the alarms that - are assigned to that category of nodes. For example:: +#. The ``node_cluster_alarms`` section defines the mapping between the + internal definition of a cluster of nodes and the alarms that are assigned + to that category of nodes. For example:: node_cluster_alarms: controller: @@ -385,121 +373,105 @@ This file has four sections: root-fs: ['root-fs-critical', 'root-fs-warning'] log-fs: ['log-fs-critical', 'log-fs-warning'] - Creates three alarm groups for the cluster of nodes called - 'controller'. + Creates three alarm groups for the cluster of nodes called 'controller': - * The *cpu* alarm group is mapped to two alarms defined in the - *alarms* section known as the 'cpu-critical-controller' and - 'cpu-warning-controller' alarms. Those alarms monitor the - CPU on the controller nodes. Note that the order matters - here since the first alarm which evaluates to 'true' stops - the evaluation. Hence, it is important to start the list - with the most critical alarms. - * The *root-fs* alarm group is mapped to two alarms defined - in the *alarms* section known as the 'root-fs-critical' - and 'root-fs-warning' alarms. Those alarms monitor the - root file system on the controller nodes. - * The *log-fs* alarm group is mapped to two alarms defined - in the *alarms* section known as the 'log-fs-critical' and - 'log-fs-warning' alarms. Those alarms monitor the file - system where the logs are created on the controller - nodes. + * The *cpu* alarm group is mapped to two alarms defined in the ``alarms`` + section known as the 'cpu-critical-controller' and + 'cpu-warning-controller' alarms. These alarms monitor the CPU on the + controller nodes. The order matters here since the first alarm that + evaluates to 'true' stops the evaluation. Therefore, it is important + to start the list with the most critical alarms. + * The *root-fs* alarm group is mapped to two alarms defined in the + ``alarms`` section known as the 'root-fs-critical' and 'root-fs-warning' + alarms. These alarms monitor the root file system on the controller nodes. + * The *log-fs* alarm group is mapped to two alarms defined in the ``alarms`` + section known as the 'log-fs-critical' and 'log-fs-warning' alarms. These + alarms monitor the file system where the logs are created on the + controller nodes. - .. note:: An *alarm group* is a mere implementaton artifact - (although it has several functional usefulness) that is - primarily used to distribute the alarms evaluation workload - across several Lua plugins. Since the Lua plugins - runtime is sandboxed within Heka, it is preferable to run - smaller sets of alarms in different plugins rather than a - large set of alarms in a single plugin. This is to avoid - having alarms evaluation plugins shutdown by Heka. - Furthermore, the alarm groups are used to identify what is - called a *source*. A *source* is a tuple in which we associate - a cluster with an alarm group. For example the tuple ['controller', 'cpu'] - is a *source*. It associates a 'controller' cluster with the 'cpu' - alarm group. The tuple ['controller', 'root-fs'] is another *source* - example. The *source* is used by the GSE Plugins to remember the - AFD metrics it has received. If a GSE Plugin stops receiving - AFD metrics it used to get, then the GSE Plugin will - infer that the health status for the cluster associated - with the source is *Unknown*. + .. note:: An *alarm group* is a mere implementation artifact (although it + has functional value) that is primarily used to distribute the alarms + evaluation workload across several Lua plugins. Since the Lua plugins + runtime is sandboxed within Heka, it is preferable to run smaller sets + of alarms in different plugins rather than a large set of alarms in a + single plugin. This is to avoid having alarms evaluation plugins + shut down by Heka. Furthermore, the alarm groups are used to identify + what is called a *source*. A *source* is a tuple in which we associate + a cluster with an alarm group. For example, the tuple + ['controller', 'cpu'] is a *source*. It associates a 'controller' + cluster with the 'cpu' alarm group. The tuple ['controller', 'root-fs'] + is another *source* example. The *source* is used by the GSE Plugins to + remember the AFD metrics it has received. If a GSE Plugin stops receiving + AFD metrics it used to get, then the GSE Plugin infers that the health + status of the cluster associated with the source is *Unknown*. - This is evaluated every *ticker-interval*. By default, - the *ticker interval* for the GSE Plugins is set to - 10 seconds. + This is evaluated every *ticker-interval*. By default, the + *ticker interval* for the GSE Plugins is set to 10 seconds. .. _aggreg_correl_config: Aggregation and correlation configuration ----------------------------------------- -StackLight comes with a predefined set of aggregation rules and -correlation policies. As for the alarms, it is possible to -create new aggregation rules and correlation policies or modify -existing ones. To do so, you will be required to modify the -``/etc/hiera/override/gse_filters.yaml`` file -and apply the :ref:`Puppet manifest ` -that will generate Lua plugins known as -the GSE Plugins which are the actuators of these aggregation -rules and correlation policies. -But before you proceed, you need to undestand the structure -of that file. +StackLight comes with a predefined set of aggregation rules and correlation +policies. However, you can create new aggregation rules and correlation +policies or modify the existing ones. To do so, modify the ``/etc/hiera/override/gse_filters.yaml`` file and apply the +:ref:`Puppet manifest ` that will generate Lua plugins known as +the GSE Plugins, which are the actuators of these aggregation rules and +correlation policies. But before you proceed, verify that you understand the +structure of that file. -.. note:: As for ``/etc/hiera/override/alarming.yaml``, - it is criticaly important to keep the exact same copy of - ``/etc/hiera/override/gse_filters.yaml`` - across all the nodes of the OpenStack environment including the - node(s) where Nagios is installed. +.. note:: As for ``/etc/hiera/override/alarming.yaml``, it is critically + important to keep exactly the same copy of + ``/etc/hiera/override/gse_filters.yaml`` across all the nodes of the + OpenStack environment including the node(s) where Nagios is installed. -The aggregation rules and correlation policies are defined -in the ``/etc/hiera/override/gse_filters.yaml`` configuration file. +The aggregation rules and correlation policies are defined in the ``/etc/hiera/override/gse_filters.yaml`` configuration file. -This file has four sections: +This file has the following sections: -1. The *gse_policies* section contains the :ref:`health status - correlation policies ` that apply to the node - clusters and service clusters. -2. The *gse_cluster_service* section contains the :ref:`aggregation rules - ` for the service clusters. These - aggregation rules are actuated by the Service Cluster GSE - Plugin which runs on the Aggregator. -3. The *gse_cluster_node* section contains the :ref:`aggreagion rules - ` for the node clusters. These aggregation rules - are actuated by the Node Cluster GSE Plugin which runs on the - Aggregator. -4. The *gse_cluster_global* section contains the :ref:`aggregation - rules ` for the so-called top-level clusters. - A global cluster is a kind of logical construct of node clusters - and service clusters. These aggregation rules are actuated by - the Global Cluster GSE Plugin which runs on the Aggregator. +#. The ``gse_policies`` section contains the :ref:`health status correlation + policies ` that apply to the node clusters and service + clusters. +#. The ``gse_cluster_service` section contains the :ref:`aggregation rules + ` for the service clusters. These aggregation rules + are actuated by the Service Cluster GSE Plugin that runs on the Aggregator. +#. The ``gse_cluster_node`` section contains the :ref:`aggregation rules + ` for the node clusters. These aggregation rules are + actuated by the Node Cluster GSE Plugin that runs on the Aggregator. +#. The ``gse_cluster_global`` section contains the :ref:`aggregation + rules ` for the so-called top-level clusters. A global + cluster is a kind of logical construct of node clusters and service + clusters. These aggregation rules are actuated by the Global Cluster GSE + Plugin that runs on the Aggregator. .. _gse_policies: Health status policies ++++++++++++++++++++++ -The correlation logic implemented by the GSE plugins is policy-based. -The policies define how the GSE plugins infer the health status of a -cluster. +The correlation logic implemented by the GSE plugins is policy-based. The +policies define how the GSE plugins infer the health status of a cluster. -By default, two policies are defined: +By default, there are two policies: -* The **highest_severity** policy defines that the cluster's status depends on the - member with the highest severity, typically used for a cluster of services. -* The **majority_of_members** policy defines that the cluster is healthy as long as - (N+1)/2 members of the cluster are healthy. This is typically used for - clusters managed by Pacemaker. +* The **highest_severity** policy defines that the cluster's status depends on + the member with the highest severity, typically used for a cluster of + services. +* The **majority_of_members** policy defines that the cluster is healthy as + long as (N+1)/2 members of the cluster are healthy. This is typically used + for clusters managed by Pacemaker. -A policy consists of a list of rules that are evaluated against the -current status of the cluster's members. When one of the rules matches, the -cluster's status gets the value associated with the rule and the evaluation -stops here. The last rule of the list is usually a catch-all rule that -defines the default status in case none of the previous rules could be matched. +A policy consists of a list of rules that are evaluated against the current +status of the cluster's members. When one of the rules matches, the cluster's +status gets the value associated with the rule and the evaluation stops. The +last rule of the list is usually a catch-all rule that defines the default +status if none of the previous rules matches. -A policy rule is defined as shown in the example below:: +The following example shows the policy rule definition:: # The following rule definition reads as: "the cluster's status is critical - # if more than 50% of its members are either down or criticial" + # if more than 50% of its members are either down or critical" - status: critical trigger: logical_operator: or @@ -517,7 +489,7 @@ Where | logical_operator | Type: Enum('and' | '&&' | 'or' | '||') -| The conjonction relation for the condition rules +| The conjunction relation for the condition rules | rules | Type: list @@ -543,7 +515,7 @@ Where | Type: float | The threshold value -Lets take a closer look at the policy called *highest_severity*:: +Consider the policy called *highest_severity*:: gse_policies: @@ -582,28 +554,31 @@ Lets take a closer look at the policy called *highest_severity*:: threshold: 0 - status: unknown -The policy definition reads as: +The policy definition reads as follows: -* The status of the cluster is *Down* if the status of at least one cluster's member is *Down*. +* The status of the cluster is ``Down`` if the status of at least one + cluster's member is ``Down``. -* Otherwise the status of the cluster is *Critical* if the status of at least one cluster's member is *Critical*. +* Otherwise, the status of the cluster is ``Critical`` if the status of at + least one cluster's member is ``Critical``. -* Otherwise the status of the cluster is *Warning* if the status of at least one cluster's member is *Warning*. +* Otherwise, the status of the cluster is ``Warning`` if the status of at + least one cluster's member is ``Warning``. -* Otherwise the status of the cluster is *Okay* if the status of at least one cluster's entity is *Okay*. +* Otherwise, the status of the cluster is ``Okay`` if the status of at least + one cluster's entity is *Okay*. -* Otherwise the status of the cluster is *Unknown*. +* Otherwise, the status of the cluster is ``Unknown``. .. _gse_cluster_service: Service cluster aggregation rules +++++++++++++++++++++++++++++++++ -The service cluster aggregation rules are used to designate -the members of a service cluster along with -the AFD metrics that must be taken into account to derive an -health status for the service cluster. -Here is an example of the service cluster aggregation rules:: +The service cluster aggregation rules are used to designate the members of a +service cluster along with the AFD metrics that must be taken into account to +derive a health status for the service cluster. The following is an example of +the service cluster aggregation rules:: gse_cluster_service: input_message_types: @@ -673,7 +648,7 @@ Where Service cluster definition ++++++++++++++++++++++++++ -The service clusters are defined as shown in the example below:: +The following example shows the service clusters definition:: gse_cluster_service: [...] @@ -691,36 +666,36 @@ Where | members | Type: list | The list of cluster members. - The AFD messages that are associated to the cluster when the *cluster_field* - value is equal to the cluster name and the *member_field* value is in this - list. + The AFD messages that are associated with the cluster when the + ``cluster_field`` value is equal to the cluster name and the + ``member_field`` value is in this list. | group_by | Type: Enum(member, hostname) | This parameter defines how the incoming AFD metrics are aggregated. | | member: -| aggregation by member, irrespective of the host that emitted the AFD metric. -| This setting is typically used for AFD metrics that are not host-centric. +| aggregation by member, irrespective of the host that emitted the AFD +| metric. This setting is typically used for AFD metrics that are not +| host-centric. | | hostname: | aggregation by hostname then by member. -| This setting is typically used for AFD metrics that are host-centric such as -| those working on filesystem or CPU usage metrics. +| This setting is typically used for AFD metrics that are host-centric, +| such as those working on the file system or CPU usage metrics. | policy: | Type: unicode -| The policy to use for computing the service cluster status. See :ref:`gse_policies` - for details. +| The policy to use for computing the service cluster status. + See :ref:`gse_policies` for details. -If we look more closely into the example above, it defines that the Service -Cluster GSE plugin resulting from those rules will emit a -*gse_service_cluster_metric* message every 10 -seconds to report the current status of the *nova-api* cluster. This -status is computed using the *afd_service_metric* metric for which -Fields[service] is 'nova-api' and Fields[source] is one of 'backends', -'endpoint' or 'http_errors'. The 'nova-api' cluster's status is computed using -the 'highest_severity' policy which means that it will be equal to the 'worst' +A closer look into the example above defines that the Service Cluster GSE +plugin resulting from those rules will emit a *gse_service_cluster_metric* +message every 10 seconds to report the current status of the *nova-api* +cluster. This status is computed using the *afd_service_metric* metric for +which Fields[service] is 'nova-api' and Fields[source] is one of 'backends', +'endpoint', or 'http_errors'. The 'nova-api' cluster's status is computed using +the 'highest_severity' policy, which means that it will be equal to the 'worst' status across all members. .. _gse_cluster_node: @@ -728,11 +703,10 @@ status across all members. Node cluster aggregation rules ++++++++++++++++++++++++++++++ -The node cluster aggregation rules are used to designate -the members of a node cluster along with -the AFD metrics that must be taken into account to derive -an health status for the node cluster. -Here is an example of the node cluster aggregation rules:: +The node cluster aggregation rules are used to designate the members of a node +cluster along with the AFD metrics that must be taken into account to derive +a health status for the node cluster. The following is an example of the node +cluster aggregation rules:: gse_cluster_node: input_message_types: @@ -804,7 +778,7 @@ Where Node cluster definition +++++++++++++++++++++++ -The node clusters are defined as shown in the example below:: +The following example shows the node clusters definition:: gse_cluster_node: [...] @@ -822,36 +796,35 @@ Where | members | Type: list | The list of cluster members. - The AFD messages are associated to the cluster when the *cluster_field* - value is equal to the cluster name and the *member_field* value is in this - list. + The AFD messages are associated to the cluster when the ``cluster_field`` + value is equal to the cluster name and the ``member_field`` value is in + this list. | group_by | Type: Enum(member, hostname) | This parameter defines how the incoming AFD metrics are aggregated. | | member: -| aggregation by member, irrespective of the host that emitted the AFD metric. -| This setting is typically used for AFD metrics that are not host-centric. +| aggregation by member, irrespective of the host that emitted the AFD +| metric. This setting is typically used for AFD metrics that are not +| host-centric. | | hostname: | aggregation by hostname then by member. -| This setting is typically used for AFD metrics that are host-centric such as -| those working on filesystem or CPU usage metrics. +| This setting is typically used for AFD metrics that are host-centric, +| such as those working on the file system or CPU usage metrics. | policy: | Type: unicode -| The policy to use for computing the node cluster status. See :ref:`gse_policies` - for details. +| The policy to use for computing the node cluster status. + See :ref:`gse_policies` for details. -If we look more closely into the example above, it defines that the Node -Cluster GSE plugin resulting from those rules will emit a -*gse_node_cluster_metric* message every 10 -seconds to report the current status of the *controller* cluster. This +A closer look into the example above defines that the Node Cluster GSE plugin +resulting from those rules will emit a *gse_node_cluster_metric* message every +10 seconds to report the current status of the *controller* cluster. This status is computed using the *afd_node_metric* metric for which Fields[node_role] is 'controller' and Fields[source] is one of 'cpu', -'root-fs' or 'log-fs'. The 'controller' cluster's status is computed using -the 'majority_of_members' policy which means that it will be equal to the 'majority' +'root-fs' or 'log-fs'. The 'controller' cluster's status is computed using the 'majority_of_members' policy which means that it will be equal to the 'majority' status across all members. .. _gse_cluster_global: @@ -859,23 +832,20 @@ status across all members. Top-level cluster aggregation rules +++++++++++++++++++++++++++++++++++ -The top-level agggregation rules aggregate GSE metrics from the -Service Cluster GSE Plugin and the Node Cluster GSE Plugin. -This is the last aggregation stage that issues health status -for the top-level clusters. A top-level cluster is a logical -contruct of service and node clustering. By default, we define -that the health status of Nova, as a top-level cluster, -depends on the health status of several service clusters -related to Nova and the health status of the 'controller' and -'compute' node clusters. But it can be anything. For example, you -could define a 'control-plane' top-level cluster that would -exclude the health status of the 'compute' node cluster if -you wanted to... In summary, the top-level cluster aggregation -rules are used to designate the node clusters and service -clusters members of a top-level cluster along with -the GSE metrics that must be taken into account to derive -an health status for the top-level cluster. -Here is an example of a top-level cluster aggregation rules:: +The top-level aggregation rules aggregate GSE metrics from the Service +Cluster GSE Plugin and the Node Cluster GSE Plugin. This is the last +aggregation stage that issues health status for the top-level clusters. +A top-level cluster is a logical construct of service and node clustering. +By default, we define that the health status of Nova, as a top-level cluster, +depends on the health status of several service clusters related to Nova and +the health status of the 'controller' and 'compute' node clusters. But it can +be anything. For example, you can define a 'control-plane' top-level cluster +that would exclude the health status of the 'compute' node cluster if required. +The top-level cluster aggregation rules are used to designate the node +clusters and service clusters members of a top-level cluster along with the +GSE metrics that must be taken into account to derive a health status for the +top-level cluster. The following is an example of a top-level cluster +aggregation rules:: gse_cluster_global: input_message_types: @@ -954,7 +924,7 @@ Where Top-level cluster definition ++++++++++++++++++++++++++++ -The top-level clusters are defined as shown in the example below:: +The following example shows the top-level clusters definition:: gse_cluster_global: [...] @@ -987,15 +957,16 @@ Where | members | Type: list | The list of cluster members. -| The GSE messages are associated to the cluster when the *member_field* value -| (i.e *cluster_name*) is in this list. +| The GSE messages are associated to the cluster when the ``member_field`` +| value (``cluster_name``), is on this list. | hints | Type: list -| The list of clusters that are indirectly associated with the top-level cluster. -| The GSE messages are indirectly associated to the cluster when the *member_field* value -| (i.e *cluster_name*) is in this list. This means that they are not used to derive -| the health status of the top-level cluster but as 'hints' for root cause analysis. +| The list of clusters that are indirectly associated with the top-level +| cluster. The GSE messages are indirectly associated to the cluster when +| the ``member_field`` value (``cluster_name``) is on this list. This means +| that they are not used to derive the health status of the top-level +| cluster but as 'hints' for root cause analysis. | group_by | Type: Enum(member, hostname) @@ -1004,8 +975,8 @@ Where | policy: | Type: unicode -| The policy to use for computing the top-level cluster status. See :ref:`gse_policies` - for details. +| The policy to use for computing the top-level cluster status. + See :ref:`gse_policies` for details. .. _puppet_apply: @@ -1015,11 +986,10 @@ Apply your configuration changes Once you have edited and saved your changes in ``/etc/hiera/override/alarmaing.yaml`` and / or ``/etc/hiera/override/gse_filters.yaml``, -you need to apply the following Puppet manifest on -all the nodes of your OpenStack -environment (**including the node(s) where Nagios is installed**) +apply the following Puppet manifest on all the nodes of your OpenStack +environment **including the node(s) where Nagios is installed** for the changes to take effect:: # puppet apply --modulepath=/etc/fuel/plugins/lma_collector-/puppet/modules:\ /etc/puppet/modules \ - /etc/fuel/plugins/lma_collector-/puppet/manifests/configure_afd_filters.pp + /etc/fuel/plugins/lma_collector-/puppet/manifests/configure_afd_filters.pp \ No newline at end of file diff --git a/doc/user/source/configure_plugin.rst b/doc/user/source/configure_plugin.rst index 01c99ebb2..a2b320c5a 100644 --- a/doc/user/source/configure_plugin.rst +++ b/doc/user/source/configure_plugin.rst @@ -77,6 +77,10 @@ Plugin configuration .. _plugin_verification: +.. raw:: latex + + \pagebreak + Plugin verification ------------------- diff --git a/doc/user/source/metrics/ceph.rst b/doc/user/source/metrics/ceph.rst index 7516fe954..8b4746d97 100644 --- a/doc/user/source/metrics/ceph.rst +++ b/doc/user/source/metrics/ceph.rst @@ -1,108 +1,128 @@ .. _Ceph_metrics: -All Ceph metrics have a ``cluster`` field containing the name of the Ceph cluster -(*ceph* by default). +All Ceph metrics have a ``cluster`` field containing the name of the Ceph +cluster (*ceph* by default). -See `cluster monitoring`_ and `RADOS monitoring`_ for further details. +For details, see +`Cluster monitoring `_ +and `RADOS monitoring `_. Cluster ^^^^^^^ -* ``ceph_health``, the health status of the entire cluster where values ``1``, ``2`` - , ``3`` represent respectively ``OK``, ``WARNING`` and ``ERROR``. +* ``ceph_health``, the health status of the entire cluster where values + ``1``, ``2``, ``3`` represent ``OK``, ``WARNING`` and ``ERROR``, respectively. -* ``ceph_monitor_count``, number of ceph-mon processes. +* ``ceph_monitor_count``, the number of ceph-mon processes. -* ``ceph_quorum_count``, number of ceph-mon processes participating in the +* ``ceph_quorum_count``, the number of ceph-mon processes participating in the quorum. Pools ^^^^^ -* ``ceph_pool_total_avail_bytes``, total available size in bytes for all pools. -* ``ceph_pool_total_bytes``, total number of bytes for all pools. -* ``ceph_pool_total_number``, total number of pools. -* ``ceph_pool_total_used_bytes``, total used size in bytes by all pools. +* ``ceph_pool_total_avail_bytes``, the total available size in bytes for all + pools. +* ``ceph_pool_total_bytes``, the total number of bytes for all pools. +* ``ceph_pool_total_number``, the total number of pools. +* ``ceph_pool_total_used_bytes``, the total used size in bytes by all pools. -The folllowing metrics have a ``pool`` field that contains the name of the Ceph pool. +The following metrics have a ``pool`` field that contains the name of the +Ceph pool. -* ``ceph_pool_bytes_used``, amount of data in bytes used by the pool. -* ``ceph_pool_max_avail``, available size in bytes for the pool. -* ``ceph_pool_objects``, number of objects in the pool. -* ``ceph_pool_op_per_sec``, number of operations per second for the pool. -* ``ceph_pool_pg_num``, number of placement groups for the pool. -* ``ceph_pool_read_bytes_sec``, number of bytes read by second for the pool. -* ``ceph_pool_size``, number of data replications for the pool. -* ``ceph_pool_write_bytes_sec``, number of bytes written by second for the pool. +* ``ceph_pool_bytes_used``, the amount of data in bytes used by the pool. +* ``ceph_pool_max_avail``, the available size in bytes for the pool. +* ``ceph_pool_objects``, the number of objects in the pool. +* ``ceph_pool_op_per_sec``, the number of operations per second for the pool. +* ``ceph_pool_pg_num``, the number of placement groups for the pool. +* ``ceph_pool_read_bytes_sec``, the number of bytes read by second for the pool. +* ``ceph_pool_size``, the number of data replications for the pool. +* ``ceph_pool_write_bytes_sec``, the number of bytes written by second for the + pool. Placement Groups ^^^^^^^^^^^^^^^^ -* ``ceph_pg_bytes_avail``, available size in bytes. -* ``ceph_pg_bytes_total``, cluster total size in bytes. -* ``ceph_pg_bytes_used``, data stored size in bytes. -* ``ceph_pg_data_bytes``, stored data size in bytes before it is replicated, cloned - or snapshotted. -* ``ceph_pg_state``, number of placement groups in a given state. The metric - contains a ``state`` field whose value is ```` is a combination +* ``ceph_pg_bytes_avail``, the available size in bytes. +* ``ceph_pg_bytes_total``, the cluster total size in bytes. +* ``ceph_pg_bytes_used``, the data stored size in bytes. +* ``ceph_pg_data_bytes``, the stored data size in bytes before it is + replicated, cloned or snapshotted. +* ``ceph_pg_state``, the number of placement groups in a given state. The + metric contains a ``state`` field whose ```` value is a combination separated by ``+`` of 2 or more states of this list: ``creating``, ``active``, ``clean``, ``down``, ``replay``, ``splitting``, ``scrubbing``, ``degraded``, ``inconsistent``, ``peering``, ``repair``, ``recovering``, ``recovery_wait``, ``backfill``, ``backfill-wait``, ``backfill_toofull``, ``incomplete``, ``stale``, ``remapped``. -* ``ceph_pg_total``, total number of placement groups. +* ``ceph_pg_total``, the total number of placement groups. OSD Daemons ^^^^^^^^^^^ -* ``ceph_osd_down``, number of OSD daemons DOWN. -* ``ceph_osd_in``, number of OSD daemons IN. -* ``ceph_osd_out``, number of OSD daemons OUT. -* ``ceph_osd_up``, number of OSD daemons UP. +* ``ceph_osd_down``, the number of OSD daemons DOWN. +* ``ceph_osd_in``, the number of OSD daemons IN. +* ``ceph_osd_out``, the number of OSD daemons OUT. +* ``ceph_osd_up``, the number of OSD daemons UP. -The following metrics have an ``osd`` field that contains the OSD identifier. +The following metrics have an ``osd`` field that contains the OSD identifier: * ``ceph_osd_apply_latency``, apply latency in ms for the given OSD. * ``ceph_osd_commit_latency``, commit latency in ms for the given OSD. -* ``ceph_osd_total``, total size in bytes for the given OSD. -* ``ceph_osd_used``, data stored size in bytes for the given OSD. +* ``ceph_osd_total``, the total size in bytes for the given OSD. +* ``ceph_osd_used``, the data stored size in bytes for the given OSD. OSD Performance ^^^^^^^^^^^^^^^ All the following metrics are retrieved per OSD daemon from the corresponding -socket ``/var/run/ceph/ceph-osd..asok`` by issuing the command ``perf dump``. +``/var/run/ceph/ceph-osd..asok`` socket by issuing the :command:`perf dump` +command. All metrics have an ``osd`` field that contains the OSD identifier. -.. note:: These metrics are not collected when a node has both the ceph-osd and controller roles. +.. note:: These metrics are not collected when a node has both the ceph-osd + and controller roles. -See `OSD performance counters`_ for further details. +For details, see `OSD performance counters `_. -* ``ceph_perf_osd_op``, number of client operations. -* ``ceph_perf_osd_op_in_bytes``, number of bytes received from clients for write operations. -* ``ceph_perf_osd_op_latency``, average latency in ms for client operations (including queue time). -* ``ceph_perf_osd_op_out_bytes``, number of bytes sent to clients for read operations. -* ``ceph_perf_osd_op_process_latency``, average latency in ms for client operations (excluding queue time). -* ``ceph_perf_osd_op_r``, number of client read operations. -* ``ceph_perf_osd_op_r_latency``, average latency in ms for read operation (including queue time). -* ``ceph_perf_osd_op_r_out_bytes``, number of bytes sent to clients for read operations. -* ``ceph_perf_osd_op_r_process_latency``, average latency in ms for read operation (excluding queue time). -* ``ceph_perf_osd_op_rw``, number of client read-modify-write operations. -* ``ceph_perf_osd_op_rw_in_bytes``, number of bytes per second received from clients for read-modify-write operations. -* ``ceph_perf_osd_op_rw_latency``, average latency in ms for read-modify-write operations (including queue time). -* ``ceph_perf_osd_op_rw_out_bytes``, number of bytes per second sent to clients for read-modify-write operations. -* ``ceph_perf_osd_op_rw_process_latency``, average latency in ms for read-modify-write operations (excluding queue time). -* ``ceph_perf_osd_op_rw_rlat``, average latency in ms for read-modify-write operations with readable/applied. -* ``ceph_perf_osd_op_w``, number of client write operations. -* ``ceph_perf_osd_op_wip``, number of replication operations currently being processed (primary). -* ``ceph_perf_osd_op_w_in_bytes``, number of bytes received from clients for write operations. -* ``ceph_perf_osd_op_w_latency``, average latency in ms for write operations (including queue time). -* ``ceph_perf_osd_op_w_process_latency``, average latency in ms for write operation (excluding queue time). -* ``ceph_perf_osd_op_w_rlat``, average latency in ms for write operations with readable/applied. -* ``ceph_perf_osd_recovery_ops``, number of recovery operations in progress. - -.. _cluster monitoring: http://docs.ceph.com/docs/master/rados/operations/monitoring/ -.. _RADOS monitoring: http://docs.ceph.com/docs/master/rados/operations/monitoring-osd-pg/ -.. _OSD performance counters: http://ceph.com/docs/firefly/dev/perf_counters/ +* ``ceph_perf_osd_op``, the number of client operations. +* ``ceph_perf_osd_op_in_bytes``, the number of bytes received from clients for + write operations. +* ``ceph_perf_osd_op_latency``, the average latency in ms for client operations + (including queue time). +* ``ceph_perf_osd_op_out_bytes``, the number of bytes sent to clients for read + operations. +* ``ceph_perf_osd_op_process_latency``, the average latency in ms for client + operations (excluding queue time). +* ``ceph_perf_osd_op_r``, the number of client read operations. +* ``ceph_perf_osd_op_r_latency``, the average latency in ms for read operation + (including queue time). +* ``ceph_perf_osd_op_r_out_bytes``, the number of bytes sent to clients for + read operations. +* ``ceph_perf_osd_op_r_process_latency``, the average latency in ms for read + operation (excluding queue time). +* ``ceph_perf_osd_op_rw``, the number of client read-modify-write operations. +* ``ceph_perf_osd_op_rw_in_bytes``, the number of bytes per second received + from clients for read-modify-write operations. +* ``ceph_perf_osd_op_rw_latency``, the average latency in ms for + read-modify-write operations (including queue time). +* ``ceph_perf_osd_op_rw_out_bytes``, the number of bytes per second sent to + clients for read-modify-write operations. +* ``ceph_perf_osd_op_rw_process_latency``, the average latency in ms for + read-modify-write operations (excluding queue time). +* ``ceph_perf_osd_op_rw_rlat``, the average latency in ms for read-modify-write + operations with readable/applied. +* ``ceph_perf_osd_op_w``, the number of client write operations. +* ``ceph_perf_osd_op_wip``, the number of replication operations currently + being processed (primary). +* ``ceph_perf_osd_op_w_in_bytes``, the number of bytes received from clients + for write operations. +* ``ceph_perf_osd_op_w_latency``, the average latency in ms for write + operations (including queue time). +* ``ceph_perf_osd_op_w_process_latency``, the average latency in ms for write + operation (excluding queue time). +* ``ceph_perf_osd_op_w_rlat``, the average latency in ms for write operations + with readable/applied. +* ``ceph_perf_osd_recovery_ops``, the number of recovery operations in progress. \ No newline at end of file diff --git a/doc/user/source/metrics/clusters.rst b/doc/user/source/metrics/clusters.rst index de49abf18..5b40ef2fd 100644 --- a/doc/user/source/metrics/clusters.rst +++ b/doc/user/source/metrics/clusters.rst @@ -3,24 +3,23 @@ The cluster metrics are emitted by the GSE plugins. For details, see :ref:`Configuring alarms `. -* ``cluster_node_status``, the status of the node cluster. - The metric contains a ``cluster_name`` field that identifies the node cluster. +* ``cluster_node_status``, the status of the node cluster. The metric contains + a ``cluster_name`` field that identifies the node cluster. -* ``cluster_service_status``, the status of the service cluster. - The metric contains a ``cluster_name`` field that identifies the service cluster. - -* ``cluster_status``, the status of the global cluster. - The metric contains a ``cluster_name`` field that identifies the global cluster. +* ``cluster_service_status``, the status of the service cluster. The metric + contains a ``cluster_name`` field that identifies the service cluster. +* ``cluster_status``, the status of the global cluster. The metric contains a + ``cluster_name`` field that identifies the global cluster. The supported values for these metrics are: -* `0` for the *Okay* status. +* ``0`` for the *Okay* status. -* `1` for the *Warning* status. +* ``1`` for the *Warning* status. -* `2` for the *Unknown* status. +* ``2`` for the *Unknown* status. -* `3` for the *Critical* status. +* ``3`` for the *Critical* status. -* `4` for the *Down* status. +* ``4`` for the *Down* status. \ No newline at end of file diff --git a/doc/user/source/metrics/elasticsearch.rst b/doc/user/source/metrics/elasticsearch.rst index 33a5d0e3d..b7168e1cc 100644 --- a/doc/user/source/metrics/elasticsearch.rst +++ b/doc/user/source/metrics/elasticsearch.rst @@ -1,20 +1,19 @@ .. _Elasticsearch: The following metrics represent the simple status on the health of the cluster. -See `cluster health`_ for further details. +For details, see `Cluster health `_. * ``elasticsearch_cluster_active_primary_shards``, the number of active primary shards. * ``elasticsearch_cluster_active_shards``, the number of active shards. * ``elasticsearch_cluster_health``, the health status of the entire cluster - where values ``1``, ``2`` , ``3`` represent respectively ``green``, - ``yellow`` and ``red``. The ``red`` status may also be reported when the - Elasticsearch API returns an unexpected result (network failure for instance). + where values ``1``, ``2`` , ``3`` represent ``green``, ``yellow`` and + ``red``, respectively. The ``red`` status may also be reported when the + Elasticsearch API returns an unexpected result, for example, a network + failure. * ``elasticsearch_cluster_initializing_shards``, the number of initializing shards. * ``elasticsearch_cluster_number_of_nodes``, the number of nodes in the cluster. * ``elasticsearch_cluster_number_of_pending_tasks``, the number of pending tasks. * ``elasticsearch_cluster_relocating_shards``, the number of relocating shards. -* ``elasticsearch_cluster_unassigned_shards``, the number of unassigned shards. - -.. _cluster health: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cluster-health.html +* ``elasticsearch_cluster_unassigned_shards``, the number of unassigned shards. \ No newline at end of file diff --git a/doc/user/source/metrics/haproxy.rst b/doc/user/source/metrics/haproxy.rst index f560aea46..4fc62477e 100644 --- a/doc/user/source/metrics/haproxy.rst +++ b/doc/user/source/metrics/haproxy.rst @@ -1,6 +1,6 @@ .. _haproxy_metrics: -``frontend`` and ``backend`` field values can be: +The ``frontend`` and ``backend`` field values can be as follows: * cinder-api * glance-api @@ -35,7 +35,8 @@ Server Frontends ^^^^^^^^^ -The following metrics have a ``frontend`` field that contains the name of the frontend server. +The following metrics have a ``frontend`` field that contains the name of the +front-end server: * ``haproxy_frontend_bytes_in``, the number of bytes received by the frontend. * ``haproxy_frontend_bytes_out``, the number of bytes transmitted by the frontend. @@ -55,25 +56,33 @@ Backends ^^^^^^^^ .. _haproxy_backend_metric: -The following metrics have a ``backend`` field that contains the name of the backend server. +The following metrics have a ``backend`` field that contains the name of the +back-end server: -* ``haproxy_backend_bytes_in``, the number of bytes received by the backend. -* ``haproxy_backend_bytes_out``, the number of bytes transmitted by the backend. +* ``haproxy_backend_bytes_in``, the number of bytes received by the back end. +* ``haproxy_backend_bytes_out``, the number of bytes transmitted by the back end. * ``haproxy_backend_denied_requests``, the number of denied requests. * ``haproxy_backend_denied_responses``, the number of denied responses. -* ``haproxy_backend_downtime``, the total downtime in second. +* ``haproxy_backend_downtime``, the total downtime in seconds. * ``haproxy_backend_error_connection``, the number of error connections. * ``haproxy_backend_error_responses``, the number of error responses. * ``haproxy_backend_queue_current``, the number of requests in queue. -* ``haproxy_backend_redistributed``, the number of times a request was redispatched to another server. +* ``haproxy_backend_redistributed``, the number of times a request was + redispatched to another server. * ``haproxy_backend_response_1xx``, the number of HTTP responses with 1xx code. * ``haproxy_backend_response_2xx``, the number of HTTP responses with 2xx code. * ``haproxy_backend_response_3xx``, the number of HTTP responses with 3xx code. * ``haproxy_backend_response_4xx``, the number of HTTP responses with 4xx code. * ``haproxy_backend_response_5xx``, the number of HTTP responses with 5xx code. -* ``haproxy_backend_response_other``, the number of HTTP responses with other code. -* ``haproxy_backend_retries``, the number of times a connection to a server was retried. -* ``haproxy_backend_servers``, the count of servers grouped by state. This metric has an additional ``state`` field that contains the state of the backends (either 'down' or 'up'). +* ``haproxy_backend_response_other``, the number of HTTP responses with other + code. +* ``haproxy_backend_retries``, the number of times a connection to a server + was retried. +* ``haproxy_backend_servers``, the count of servers grouped by state. This + metric has an additional ``state`` field that contains the state of the + back ends (either 'down' or 'up'). * ``haproxy_backend_session_current``, the number of current sessions. * ``haproxy_backend_session_total``, the cumulative number of sessions. -* ``haproxy_backend_status``, the global backend status where values ``0`` and ``1`` represent respectively ``DOWN`` (all backends are down) and ``UP`` (at least one backend is up). +* ``haproxy_backend_status``, the global back-end status where values ``0`` + and ``1`` represent, respectively, ``DOWN`` (all back ends are down) and ``UP`` + (at least one back end is up). diff --git a/doc/user/source/metrics/influxdb.rst b/doc/user/source/metrics/influxdb.rst index ebacc4fd9..81bc1acd8 100644 --- a/doc/user/source/metrics/influxdb.rst +++ b/doc/user/source/metrics/influxdb.rst @@ -1,37 +1,47 @@ .. InfluxDB: -The following metrics are extracted from the output of ``show stats`` command. -The values are reset to zero when InfluxDB is restarted. +The following metrics are extracted from the output of the :command:`show stats` +command. The values are reset to zero when InfluxDB is restarted. cluster ^^^^^^^ -These metrics are only available if there are more than one node in the cluster. +The following metrics are only available if there is more than one node in the +cluster: -* ``influxdb_cluster_write_shard_points_requests``, the number of requests for writing a time series points to a shard. -* ``influxdb_cluster_write_shard_requests``, the number of requests for writing to a shard. +* ``influxdb_cluster_write_shard_points_requests``, the number of requests for + writing a time series points to a shard. +* ``influxdb_cluster_write_shard_requests``, the number of requests for writing + to a shard. httpd ^^^^^ -* ``influxdb_httpd_failed_auths``, the number of times failed authentications. +* ``influxdb_httpd_failed_auths``, the number of failed authentications. * ``influxdb_httpd_ping_requests``, the number of ping requests. * ``influxdb_httpd_query_requests``, the number of query requests received. -* ``influxdb_httpd_query_response_bytes``, the number of bytes returned to the client. +* ``influxdb_httpd_query_response_bytes``, the number of bytes returned to the + client. * ``influxdb_httpd_requests``, the number of requests received. * ``influxdb_httpd_write_points_ok``, the number of points successfully written. -* ``influxdb_httpd_write_request_bytes``, the number of bytes received for write requests. +* ``influxdb_httpd_write_request_bytes``, the number of bytes received for + write requests. * ``influxdb_httpd_write_requests``, the number of write requests received. write ^^^^^ -* ``influxdb_write_local_point_requests``, the number of write points requests from the local data node. +* ``influxdb_write_local_point_requests``, the number of write points requests + from the local data node. * ``influxdb_write_ok``, the number of successful writes of consistency level. -* ``influxdb_write_point_requests``, the number of write points requests across all data nodes. -* ``influxdb_write_remote_point_requests``, the number of write points requests to remote data nodes. -* ``influxdb_write_requests``, the number of write requests across all data nodes. -* ``influxdb_write_sub_ok``, the number of successful points send to subscriptions. +* ``influxdb_write_point_requests``, the number of write points requests across + all data nodes. +* ``influxdb_write_remote_point_requests``, the number of write points requests + to remote data nodes. +* ``influxdb_write_requests``, the number of write requests across all data + nodes. +* ``influxdb_write_sub_ok``, the number of successful points sent to + subscriptions. runtime ^^^^^^^ @@ -41,11 +51,12 @@ runtime * ``influxdb_heap_idle``, the number of bytes in idle spans. * ``influxdb_heap_in_use``, the number of bytes in non-idle spans. * ``influxdb_heap_objects``, the total number of allocated objects. -* ``influxdb_heap_released``, the number of bytes released to the operating system. +* ``influxdb_heap_released``, the number of bytes released to the operating + system. * ``influxdb_heap_system``, the number of bytes obtained from the system. * ``influxdb_memory_alloc``, the number of bytes allocated and not yet freed. * ``influxdb_memory_frees``, the number of free operations. * ``influxdb_memory_lookups``, the number of pointer lookups. * ``influxdb_memory_mallocs``, the number of malloc operations. * ``influxdb_memory_system``, the number of bytes obtained from the system. -* ``influxdb_memory_total_alloc``, the number of bytes allocated (even if freed). +* ``influxdb_memory_total_alloc``, the number of bytes allocated (even if freed). \ No newline at end of file diff --git a/doc/user/source/metrics/libvirt.rst b/doc/user/source/metrics/libvirt.rst index 2fb366ecc..24cae1d7f 100644 --- a/doc/user/source/metrics/libvirt.rst +++ b/doc/user/source/metrics/libvirt.rst @@ -1,6 +1,6 @@ .. _libvirt-metrics: -Every metric contains an ``instance_id`` field which is the UUID of the +Every metric contains an ``instance_id`` field, which is the UUID of the instance for the Nova service. CPU @@ -17,7 +17,7 @@ Disk ^^^^ Metrics have a ``device`` field that contains the virtual disk device to which -the metric applies (eg 'vda', 'vdb' and so on). +the metric applies. For example, 'vda', 'vdb', and others. * ``virt_disk_octets_read``, the number of octets (bytes) read per second. @@ -37,7 +37,7 @@ Network ^^^^^^^ Metrics have an ``interface`` field that contains the interface name to which -the metric applies (eg 'tap0dc043a6-dd', 'tap769b123a-2e' and so on). +the metric applies. For example, 'tap0dc043a6-dd', 'tap769b123a-2e', and others. * ``virt_if_dropped_rx``, the number of dropped packets per second when receiving from the interface. @@ -61,4 +61,4 @@ the metric applies (eg 'tap0dc043a6-dd', 'tap769b123a-2e' and so on). interface. * ``virt_if_packets_tx``, the number of packets transmitted per second by the - interface. + interface. \ No newline at end of file diff --git a/doc/user/source/metrics/lma.rst b/doc/user/source/metrics/lma.rst index dcf546493..6fd867a47 100644 --- a/doc/user/source/metrics/lma.rst +++ b/doc/user/source/metrics/lma.rst @@ -3,49 +3,67 @@ System ^^^^^^ -Metrics have a ``service`` field with the name of the service it applies to. Values can be: hekad, collectd, influxd, grafana-server or elasticsearch. +The metrics have a ``service`` field with the name of the service it applies +to. The values can be: ``hekad``, ``collectd``, ``influxd``, ``grafana-server`` +or ``elasticsearch``. -* ``lma_components_count_processes``, number of processes currently running. -* ``lma_components_count_threads``, number of threads currently running. -* ``lma_components_cputime_syst``, percentage of CPU time spent in system mode by the service. - It can be greater than 100% when the node has more than one CPU. -* ``lma_components_cputime_user``, percentage of CPU time spent in user mode by the service. - It can be greater than 100% when the node has more than one CPU. -* ``lma_components_disk_bytes_read``, number of bytes read from disk(s) per second. -* ``lma_components_disk_bytes_write``, number of bytes written to disk(s) per second. -* ``lma_components_disk_ops_read``, number of read operations from disk(s) per second. -* ``lma_components_disk_ops_write``, number of write operations to disk(s) per second. -* ``lma_components_memory_code``, physical memory devoted to executable code (bytes). -* ``lma_components_memory_data``, physical memory devoted to other than executable code (bytes). -* ``lma_components_memory_rss``, non-swapped physical memory used (bytes). -* ``lma_components_memory_vm``, virtual memory size (bytes). +* ``lma_components_count_processes``, the number of processes currently running. +* ``lma_components_count_threads``, the number of threads currently running. +* ``lma_components_cputime_syst``, the percentage of CPU time spent in system + mode by the service. It can be greater than 100% when the node has more than + one CPU. +* ``lma_components_cputime_user``, the percentage of CPU time spent in user + mode by the service. It can be greater than 100% when the node has more than + one CPU. +* ``lma_components_disk_bytes_read``, the number of bytes read from disk(s) per + second. +* ``lma_components_disk_bytes_write``, the number of bytes written to disk(s) + per second. +* ``lma_components_disk_ops_read``, the number of read operations from disk(s) + per second. +* ``lma_components_disk_ops_write``, the number of write operations to disk(s) + per second. +* ``lma_components_memory_code``, the physical memory devoted to executable code + in bytes. +* ``lma_components_memory_data``, the physical memory devoted to other than + executable code in bytes. +* ``lma_components_memory_rss``, the non-swapped physical memory used in bytes. +* ``lma_components_memory_vm``, the virtual memory size in bytes. * ``lma_components_pagefaults_majflt``, major page faults per second. * ``lma_components_pagefaults_minflt``, minor page faults per second. -* ``lma_components_stacksize``, absolute value of the start address (the bottom) +* ``lma_components_stacksize``, the absolute value of the start address (the bottom) of the stack minus the address of the current stack pointer. Heka pipeline ^^^^^^^^^^^^^ -Metrics have two fields: ``name`` that contains the name of the decoder or filter as defined by *Heka* and ``type`` that is either *decoder* or *filter*. +The metrics have two fields: ``name`` that contains the name of the decoder +or filter as defined by *Heka* and ``type`` that is either *decoder* or +*filter*. -Metrics for both types: +The metrics for both types are as follows: -* ``hekad_memory``, the total memory used by the Sandbox (in bytes). -* ``hekad_msg_avg_duration``, the average time for processing the message (in nanoseconds). -* ``hekad_msg_count``, the total number of messages processed by the decoder. This will reset to 0 when the process is restarted. +* ``hekad_memory``, the total memory in bytes used by the Sandbox. +* ``hekad_msg_avg_duration``, the average time in nanoseconds for processing + the message. +* ``hekad_msg_count``, the total number of messages processed by the decoder. + This resets to ``0`` when the process is restarted. Additional metrics for *filter* type: -* ``heakd_timer_event_avg_duration``, the average time for executing the *timer_event* function (in nanoseconds). -* ``hekad_timer_event_count``, the total number of executions of the *timer_event* function. This will reset to 0 when the process is restarted. +* ``heakd_timer_event_avg_duration``, the average time in nanoseconds for + executing the *timer_event* function. +* ``hekad_timer_event_count``, the total number of executions of the + *timer_event* function. This resets to ``0`` when the process is restarted. -Backend checks -^^^^^^^^^^^^^^ +Back-end checks +^^^^^^^^^^^^^^^ -* ``http_check``, the backend's API status, 1 if it is responsive, if not 0. - The metric contains a ``service`` field that identifies the LMA backend service being checked. +* ``http_check``, the API status of the back end, ``1`` if it is responsive, + if not, then ``0``. The metric contains a ``service`` field that identifies + the LMA back-end service being checked. -```` is one of the following values (depending of which Fuel plugins are deployed in the environment): +```` is one of the following values, depending on which Fuel plugins +are deployed in the environment: -* 'influxdb' +* 'influxdb' \ No newline at end of file diff --git a/doc/user/source/metrics/memcached.rst b/doc/user/source/metrics/memcached.rst index fe4ecab04..718037d1a 100644 --- a/doc/user/source/metrics/memcached.rst +++ b/doc/user/source/metrics/memcached.rst @@ -1,25 +1,26 @@ .. _memcached_metrics: -* ``memcached_command_flush``, cumulative number of flush reqs. -* ``memcached_command_get``, cumulative number of retrieval reqs. -* ``memcached_command_set``, cumulative number of storage reqs. -* ``memcached_command_touch``, cumulative number of touch reqs. -* ``memcached_connections_current``, number of open connections. -* ``memcached_df_cache_free``, current number of free bytes to store items. -* ``memcached_df_cache_used``, current number of bytes used to store items. -* ``memcached_items_current``, current number of items stored. -* ``memcached_octets_rx``, total number of bytes read by this server from network. -* ``memcached_octets_tx``, total number of bytes sent by this server to network. -* ``memcached_ops_decr_hits``, number of successful decr reqs. -* ``memcached_ops_decr_misses``, number of decr reqs against missing keys. -* ``memcached_ops_evictions``, number of valid items removed from cache to free memory for new items. -* ``memcached_ops_hits``, number of keys that have been requested. -* ``memcached_ops_incr_hits``, number of successful incr reqs. -* ``memcached_ops_incr_misses``, number of successful incr reqs. -* ``memcached_ops_misses``, number of items that have been requested and not found. -* ``memcached_percent_hitratio``, percentage of get command hits (in cache). +* ``memcached_command_flush``, the cumulative number of flush reqs. +* ``memcached_command_get``, the cumulative number of retrieval reqs. +* ``memcached_command_set``, the cumulative number of storage reqs. +* ``memcached_command_touch``, the cumulative number of touch reqs. +* ``memcached_connections_current``, the number of open connections. +* ``memcached_df_cache_free``, the current number of free bytes to store items. +* ``memcached_df_cache_used``, the current number of bytes used to store items. +* ``memcached_items_current``, the current number of items stored. +* ``memcached_octets_rx``, the total number of bytes read by this server from + the network. +* ``memcached_octets_tx``, the total number of bytes sent by this server to + the network. +* ``memcached_ops_decr_hits``, the number of successful decr reqs. +* ``memcached_ops_decr_misses``, the number of decr reqs against missing keys. +* ``memcached_ops_evictions``, the number of valid items removed from cache to + free memory for new items. +* ``memcached_ops_hits``, the number of keys that have been requested. +* ``memcached_ops_incr_hits``, the number of successful incr reqs. +* ``memcached_ops_incr_misses``, the number of successful incr reqs. +* ``memcached_ops_misses``, the number of items that have been requested and + not found. +* ``memcached_percent_hitratio``, the percentage of get command hits (in cache). - -See `memcached documentation`_ for further details. - -.. _memcached documentation: https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L488 +For details, see the `Memcached documentation `_. \ No newline at end of file diff --git a/doc/user/source/metrics/mysql.rst b/doc/user/source/metrics/mysql.rst index e677c5419..436457f33 100644 --- a/doc/user/source/metrics/mysql.rst +++ b/doc/user/source/metrics/mysql.rst @@ -4,8 +4,8 @@ Commands ^^^^^^^^ ``mysql_commands``, the number of times per second a given statement has been -executed. The metric has a ``statement`` field that contains the statement to -which it applies. The values can be: +executed. The metric has a ``statement`` field that contains the statement to +which it applies. The values can be as follows: * ``change_db`` for the USE statement. * ``commit`` for the COMMIT statement. @@ -29,7 +29,7 @@ Handlers ``mysql_handler``, the number of times per second a given handler has been executed. The metric has a ``handler`` field that contains the handler -it applies to. The values can be: +it applies to. The values can be as follows: * ``commit`` for the internal COMMIT statements. * ``delete`` for the internal DELETE statements. @@ -40,56 +40,69 @@ it applies to. The values can be: * ``read_prev`` for the requests that read the previous row in key order. * ``read_rnd`` for the requests that read a row based on a fixed position. * ``read_rnd_next`` for the requests that read the next row in the data file. -* ``rollback`` the requests that perform rollback operation. +* ``rollback`` the requests that perform the rollback operation. * ``update`` the requests that update a row in a table. * ``write`` the requests that insert a row in a table. Locks ^^^^^ -* ``mysql_locks_immediate``, the number of times per second the requests for table locks could be granted immediately. -* ``mysql_locks_waited``, the number of times per second the requests for table locks had to wait. +* ``mysql_locks_immediate``, the number of times per second the requests for + table locks could be granted immediately. +* ``mysql_locks_waited``, the number of times per second the requests for + table locks had to wait. Network ^^^^^^^ -* ``mysql_octets_rx``, the number of bytes received per second by the server. -* ``mysql_octets_tx``, the number of bytes sent per second by the server. +* ``mysql_octets_rx``, the number of bytes per second received by the server. +* ``mysql_octets_tx``, the number of bytes per second sent by the server. Threads ^^^^^^^ * ``mysql_threads_cached``, the number of threads in the thread cache. * ``mysql_threads_connected``, the number of currently open connections. -* ``mysql_threads_created``, the number of threads created per second to handle connections. +* ``mysql_threads_created``, the number of threads created per second to + handle connections. * ``mysql_threads_running``, the number of threads that are not sleeping. Cluster ^^^^^^^ -These metrics are collected with statement 'SHOW STATUS'. see `Percona documentation`_ -for further details. +The following metrics are collected with statement 'SHOW STATUS'. For details, +see `Percona documentation `_. -* ``mysql_cluster_connected``, ``1`` when the node is connected to the cluster, if not ``0``. -* ``mysql_cluster_local_cert_failures``, number of writesets that failed the certification test. -* ``mysql_cluster_local_commits``, number of writesets commited on the node. -* ``mysql_cluster_local_recv_queue``, the number of writesets waiting to be applied. -* ``mysql_cluster_local_send_queue``, the number of writesets waiting to be sent. -* ``mysql_cluster_ready``, ``1`` when the node is ready to accept queries, if not ``0``. -* ``mysql_cluster_received``, total number of writesets received from other nodes. -* ``mysql_cluster_received_bytes``, total size in bytes of writesets received from other nodes. -* ``mysql_cluster_replicated``, total number of writesets sent to other nodes. -* ``mysql_cluster_replicated_bytes`` total size in bytes of writesets sent to other nodes. -* ``mysql_cluster_size``, current number of nodes in the cluster. -* ``mysql_cluster_status``, ``1`` when the node is 'Primary', ``2`` if 'Non-Primary' and ``3`` if 'Disconnected'. +* ``mysql_cluster_connected``, ``1`` when the node is connected to the cluster, + if not, then ``0``. +* ``mysql_cluster_local_cert_failures``, the number of write sets that failed + the certification test. +* ``mysql_cluster_local_commits``, the number of write sets committed on the + node. +* ``mysql_cluster_local_recv_queue``, the number of write sets waiting to be + applied. +* ``mysql_cluster_local_send_queue``, the number of write sets waiting to be + sent. +* ``mysql_cluster_ready``, ``1`` when the node is ready to accept queries, if + not, then ``0``. +* ``mysql_cluster_received``, the total number of write sets received from + other nodes. +* ``mysql_cluster_received_bytes``, the total size in bytes of write sets + received from other nodes. +* ``mysql_cluster_replicated``, the total number of write sets sent to other + nodes. +* ``mysql_cluster_replicated_bytes`` the total size in bytes of write sets sent + to other nodes. +* ``mysql_cluster_size``, the current number of nodes in the cluster. +* ``mysql_cluster_status``, ``1`` when the node is 'Primary', ``2`` if + 'Non-Primary', and ``3`` if 'Disconnected'. -.. _Percona documentation: http://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-status-index.html - -Slow Queries +Slow queries ^^^^^^^^^^^^ -This metric is collected with statement 'SHOW STATUS where Variable_name = 'Slow_queries'. - -* ``mysql_slow_queries``, number of queries that have taken more than X seconds, - depending of the MySQL configuration parameter 'long_query_time' (10s per default) +The following metric is collected with statement +'SHOW STATUS where Variable_name = 'Slow_queries'. +* ``mysql_slow_queries``, the number of queries that have taken more than X + seconds, depending on the MySQL configuration parameter 'long_query_time' + (10s per default). \ No newline at end of file diff --git a/doc/user/source/metrics/openstack.rst b/doc/user/source/metrics/openstack.rst index 0271b05bc..3f03c01c6 100644 --- a/doc/user/source/metrics/openstack.rst +++ b/doc/user/source/metrics/openstack.rst @@ -4,10 +4,12 @@ Service checks ^^^^^^^^^^^^^^ .. _service_checks: -* ``openstack_check_api``, the service's API status, 1 if it is responsive, if not 0. - The metric contains a ``service`` field that identifies the OpenStack service being checked. +* ``openstack_check_api``, the service's API status, ``1`` if it is responsive, + if not, then ``0``. The metric contains a ``service`` field that identifies + the OpenStack service being checked. -```` is one of the following values with their respective resource checks: +```` is one of the following values with their respective resource +checks: * 'ceilometer-api': '/v2/capabilities' * 'cinder-api': '/' @@ -21,61 +23,75 @@ Service checks * 'swift-api': '/healthcheck' * 'swift-s3-api': '/healthcheck' -.. note:: All checks are performed without authentication except for Ceilometer. +.. note:: All checks except for Ceilometer are performed without authentication. Compute ^^^^^^^ -These metrics are emitted per compute node. +The following metrics are emitted per compute node: -* ``openstack_nova_free_disk``, the disk space (in GB) available for new instances. -* ``openstack_nova_free_ram``, the memory (in MB) available for new instances. -* ``openstack_nova_free_vcpus``, the number of virtual CPU available for new instances. -* ``openstack_nova_instance_creation_time``, the time (in seconds) it took to launch a new instance. -* ``openstack_nova_instance_state``, the number of instances which entered a given state (the value is always 1). +* ``openstack_nova_free_disk``, the disk space in GB available for new instances. +* ``openstack_nova_free_ram``, the memory in MB available for new instances. +* ``openstack_nova_free_vcpus``, the number of virtual CPU available for new + instances. +* ``openstack_nova_instance_creation_time``, the time in seconds it took to + launch a new instance. +* ``openstack_nova_instance_state``, the number of instances which entered a + given state (the value is always ``1``). The metric contains a ``state`` field. * ``openstack_nova_running_instances``, the number of running instances. * ``openstack_nova_running_tasks``, the number of tasks currently executed. -* ``openstack_nova_used_disk``, the disk space (in GB) used by the instances. -* ``openstack_nova_used_ram``, the memory (in MB) used by the instances. -* ``openstack_nova_used_vcpus``, the number of virtual CPU used by the instances. +* ``openstack_nova_used_disk``, the disk space in GB used by the instances. +* ``openstack_nova_used_ram``, the memory in MB used by the instances. +* ``openstack_nova_used_vcpus``, the number of virtual CPU used by the + instances. -These metrics are retrieved from the Nova API and represent the aggregated -values across all compute nodes. +The following metrics are retrieved from the Nova API and represent the +aggregated values across all compute nodes. -* ``openstack_nova_total_free_disk``, the total amount of disk space (in GB) available for new instances. -* ``openstack_nova_total_free_ram``, the total amount of memory (in MB) available for new instances. -* ``openstack_nova_total_free_vcpus``, the total number of virtual CPU available for new instances. -* ``openstack_nova_total_running_instances``, the total number of running instances. -* ``openstack_nova_total_running_tasks``, the total number of tasks currently executed. -* ``openstack_nova_total_used_disk``, the total amount of disk space (in GB) used by the instances. -* ``openstack_nova_total_used_ram``, the total amount of memory (in MB) used by the instances. -* ``openstack_nova_total_used_vcpus``, the total number of virtual CPU used by the instances. +* ``openstack_nova_total_free_disk``, the total amount of disk space in GB + available for new instances. +* ``openstack_nova_total_free_ram``, the total amount of memory in MB available + for new instances. +* ``openstack_nova_total_free_vcpus``, the total number of virtual CPU + available for new instances. +* ``openstack_nova_total_running_instances``, the total number of running + instances. +* ``openstack_nova_total_running_tasks``, the total number of tasks currently + executed. +* ``openstack_nova_total_used_disk``, the total amount of disk space in GB + used by the instances. +* ``openstack_nova_total_used_ram``, the total amount of memory in MB used by + the instances. +* ``openstack_nova_total_used_vcpus``, the total number of virtual CPU used by + the instances. -These metrics are retrieved from the Nova API. +The following metrics are retrieved from the Nova API: * ``openstack_nova_instances``, the total count of instances in a given state. The metric contains a ``state`` field which is one of 'active', 'deleted', 'error', 'paused', 'resumed', 'rescued', 'resized', 'shelved_offloaded' or 'suspended'. -These metrics are retrieved from the Nova database. +The following metrics are retrieved from the Nova database: .. _compute-service-state-metrics: -* ``openstack_nova_service``, the Nova service state (either 0 for 'up', 1 for 'down' or 2 for 'disabled'). - The metric contains a ``service`` field (one of 'compute', 'conductor', 'scheduler', 'cert' - or 'consoleauth') and a ``state`` field (one of 'up', 'down' or 'disabled'). +* ``openstack_nova_service``, the Nova service state (either ``0`` for 'up', + ``1`` for 'down' or ``2`` for 'disabled'). + The metric contains a ``service`` field (one of 'compute', 'conductor', + 'scheduler', 'cert' or 'consoleauth') and a ``state`` field (one of 'up', + 'down' or 'disabled'). * ``openstack_nova_services``, the total count of Nova services by state. The metric contains a ``service`` field (one of 'compute', 'conductor', 'scheduler', 'cert' or 'consoleauth') and a ``state`` field (one - of 'up', 'down' or 'disabled'). + of 'up', 'down', or 'disabled'). Identity ^^^^^^^^ -These metrics are retrieved from the Keystone API. +The following metrics are retrieved from the Keystone API: * ``openstack_keystone_roles``, the total number of roles. * ``openstack_keystone_tenants``, the number of tenants by state. The metric @@ -86,28 +102,37 @@ These metrics are retrieved from the Keystone API. Volume ^^^^^^ -These metrics are emitted per volume node. +The following metrics are emitted per volume node: -* ``openstack_cinder_volume_creation_time``, the time (in seconds) it took to create a new volume. +* ``openstack_cinder_volume_creation_time``, the time in seconds it took to + create a new volume. -.. note:: When using Ceph as the backend storage for volumes, the ``hostname`` value is always set to ``rbd``. +.. note:: When using Ceph as the back end storage for volumes, the ``hostname`` + value is always set to ``rbd``. -These metrics are retrieved from the Cinder API. +The following metrics are retrieved from the Cinder API: -* ``openstack_cinder_snapshots``, the number of snapshots by state. The metric contains a ``state`` field. -* ``openstack_cinder_snapshots_size``, the total size (in bytes) of snapshots by state. The metric contains a ``state`` field. -* ``openstack_cinder_volumes``, the number of volumes by state. The metric contains a ``state`` field. -* ``openstack_cinder_volumes_size``, the total size (in bytes) of volumes by state. The metric contains a ``state`` field. +* ``openstack_cinder_snapshots``, the number of snapshots by state. The metric + contains a ``state`` field. +* ``openstack_cinder_snapshots_size``, the total size (in bytes) of snapshots + by state. The metric contains a ``state`` field. +* ``openstack_cinder_volumes``, the number of volumes by state. The metric + contains a ``state`` field. +* ``openstack_cinder_volumes_size``, the total size (in bytes) of volumes by + state. The metric contains a ``state`` field. -``state`` is one of 'available', 'creating', 'attaching', 'in-use', 'deleting', 'backing-up', 'restoring-backup', 'error', 'error_deleting', 'error_restoring', 'error_extending'. +``state`` is one of 'available', 'creating', 'attaching', 'in-use', 'deleting', +'backing-up', 'restoring-backup', 'error', 'error_deleting', 'error_restoring', +'error_extending'. -These metrics are retrieved from the Cinder database. +The following metrics are retrieved from the Cinder database: .. _volume-service-state-metrics: -* ``openstack_cinder_service``, the Cinder service state (either 0 for 'up', 1 for 'down' or 2 for 'disabled'). - The metric contains a ``service`` field (one of 'volume', 'backup', 'scheduler'), - and a ``state`` field (one of 'up', 'down' or 'disabled'). +* ``openstack_cinder_service``, the Cinder service state (either ``0`` for + 'up', ``1`` for 'down', or ``2`` for 'disabled'). The metric contains a + ``service`` field (one of 'volume', 'backup', 'scheduler') and a ``state`` + field (one of 'up', 'down' or 'disabled'). * ``openstack_cinder_services``, the total count of Cinder services by state. The metric contains a ``service`` field (one of 'volume', 'backup', @@ -116,17 +141,18 @@ These metrics are retrieved from the Cinder database. Image ^^^^^ -These metrics are retrieved from the Glance API. +The following metrics are retrieved from the Glance API: * ``openstack_glance_images``, the number of images by state and visibility. - The metric contains ``state`` and ``visibility`` field. + The metric contains ``state`` and ``visibility`` fields. * ``openstack_glance_images_size``, the total size (in bytes) of images by - state and visibility. The metric contains ``state`` and ``visibility`` field. + state and visibility. The metric contains ``state`` and ``visibility`` + fields. * ``openstack_glance_snapshots``, the number of snapshot images by state and - visibility. The metric contains ``state`` and ``visibility`` field. + visibility. The metric contains ``state`` and ``visibility`` fields. * ``openstack_glance_snapshots_size``, the total size (in bytes) of snapshots by state and visibility. The metric contains ``state`` and ``visibility`` - field. + fields. ``state`` is one of 'queued', 'saving', 'active', 'killed', 'deleted', 'pending_delete'. ``visibility`` is either 'public' or 'private'. @@ -134,27 +160,32 @@ These metrics are retrieved from the Glance API. Network ^^^^^^^ -These metrics are retrieved from the Neutron API. +The following metrics are retrieved from the Neutron API: * ``openstack_neutron_floatingips``, the total number of floating IP addresses. -* ``openstack_neutron_networks``, the number of virtual networks by state. The metric contains a ``state`` field. -* ``openstack_neutron_ports``, the number of virtual ports by owner and state. The metric contains ``owner`` and ``state`` fields. -* ``openstack_neutron_routers``, the number of virtual routers by state. The metric contains a ``state`` field. +* ``openstack_neutron_networks``, the number of virtual networks by state. The + metric contains a ``state`` field. +* ``openstack_neutron_ports``, the number of virtual ports by owner and state. + The metric contains ``owner`` and ``state`` fields. +* ``openstack_neutron_routers``, the number of virtual routers by state. The + metric contains a ``state`` field. * ``openstack_neutron_subnets``, the number of virtual subnets. ```` is one of 'active', 'build', 'down' or 'error'. -```` is one of 'compute', 'dhcp', 'floatingip', 'floatingip_agent_gateway', 'router_interface', 'router_gateway', 'router_ha_interface', 'router_interface_distributed' or 'router_centralized_snat'. +```` is one of 'compute', 'dhcp', 'floatingip', 'floatingip_agent_gateway', 'router_interface', 'router_gateway', 'router_ha_interface', +'router_interface_distributed', or 'router_centralized_snat'. -These metrics are retrieved from the Neutron database. +The following metrics are retrieved from the Neutron database: .. _network-agent-state-metrics: .. note:: These metrics are not collected when the Contrail plugin is deployed. -* ``openstack_neutron_agent``, the Neutron agent state (either 0 for 'up', 1 for 'down' or 2 for 'disabled'). - The metric contains a ``service`` field (one of 'dhcp', 'l3', 'metadata' or 'openvswitch'), - and a ``state`` field (one of 'up', 'down' or 'disabled'). +* ``openstack_neutron_agent``, the Neutron agent state (either ``0`` for 'up', + ``1`` for 'down', or ``2`` for 'disabled'). + The metric contains a ``service`` field (one of 'dhcp', 'l3', 'metadata', or + 'openvswitch'), and a ``state`` field (one of 'up', 'down' or 'disabled'). * ``openstack_neutron_agents``, the total number of Neutron agents by service and state. The metric contains ``service`` (one of 'dhcp', 'l3', 'metadata' @@ -164,12 +195,17 @@ API response times ^^^^^^^^^^^^^^^^^^ * ``openstack__http_response_times``, HTTP response time statistics. - The statistics are ``min``, ``max``, ``sum``, ``count``, ``upper_90`` (90 percentile) over 10 seconds. - The metric contains ``http_method`` (eg 'GET', 'POST', and so forth) and ``http_status`` (eg '2xx', '4xx', and so forth) fields. + The statistics are ``min``, ``max``, ``sum``, ``count``, ``upper_90`` + (90 percentile) over 10 seconds. The metric contains an ``http_method`` field, + for example, 'GET', 'POST', and others, and an ``http_status`` field, for + example, '2xx', '4xx', and others. -```` is one of 'cinder', 'glance', 'heat' 'keystone', 'neutron' or 'nova'. +```` is one of 'cinder', 'glance', 'heat' 'keystone', 'neutron' or +'nova'. Logs ^^^^ -* ``log_messages``, the number of log messages per second for the given service and severity level. The metric contains ``service`` and ``level`` (one of 'debug', 'info', ... ) fields. +* ``log_messages``, the number of log messages per second for the given + service and severity level. The metric contains ``service`` and ``level`` + (one of 'debug', 'info', and others) fields. diff --git a/doc/user/source/metrics/pacemaker.rst b/doc/user/source/metrics/pacemaker.rst index 2888aa3be..9da3a8184 100644 --- a/doc/user/source/metrics/pacemaker.rst +++ b/doc/user/source/metrics/pacemaker.rst @@ -4,6 +4,6 @@ Resource location ^^^^^^^^^^^^^^^^^ * ``pacemaker_resource_local_active``, ``1`` when the resource is located on - the host reporting the metric, if not ``0``. The metric contains a + the host reporting the metric, if not, then ``0``. The metric contains a ``resource`` field which is one of 'vip__public', 'vip__management', - 'vip__vrouter_pub' or 'vip__vrouter'. + 'vip__vrouter_pub', or 'vip__vrouter'. diff --git a/doc/user/source/metrics/rabbitmq.rst b/doc/user/source/metrics/rabbitmq.rst index 7bd33a34d..b40761bd1 100644 --- a/doc/user/source/metrics/rabbitmq.rst +++ b/doc/user/source/metrics/rabbitmq.rst @@ -3,16 +3,23 @@ Cluster ^^^^^^^ -* ``rabbitmq_connections``, total number of connections. -* ``rabbitmq_consumers``, total number of consumers. -* ``rabbitmq_channels``, total number of channels. -* ``rabbitmq_exchanges``, total number of exchanges. -* ``rabbitmq_messages``, total number of messages which are ready to be consumed or not yet acknowledged. -* ``rabbitmq_queues``, total number of queues. -* ``rabbitmq_running_nodes``, total number of running nodes in the cluster. -* ``rabbitmq_disk_free``, the disk free space. -* ``rabbitmq_disk_free_limit``, the minimum amount of free disk for RabbitMQ. When ``rabbitmq_disk_free`` drops below this value, all producers are blocked. -* ``rabbitmq_remaining_disk``, the difference between ``rabbitmq_disk_free`` and ``rabbitmq_disk_free_limit``. +* ``rabbitmq_connections``, the total number of connections. +* ``rabbitmq_consumers``, the total number of consumers. +* ``rabbitmq_channels``, the total number of channels. +* ``rabbitmq_exchanges``, the total number of exchanges. +* ``rabbitmq_messages``, the total number of messages which are ready to be + consumed or not yet acknowledged. +* ``rabbitmq_queues``, the total number of queues. +* ``rabbitmq_running_nodes``, the total number of running nodes in the cluster. +* ``rabbitmq_disk_free``, the free disk space. +* ``rabbitmq_disk_free_limit``, the minimum amount of free disk space for + RabbitMQ. + When ``rabbitmq_disk_free`` drops below this value, all producers are blocked. +* ``rabbitmq_remaining_disk``, the difference between ``rabbitmq_disk_free`` + and ``rabbitmq_disk_free_limit``. * ``rabbitmq_used_memory``, bytes of memory used by the whole RabbitMQ process. -* ``rabbitmq_vm_memory_limit``, the maximum amount of memory allocated for RabbitMQ. When ``rabbitmq_used_memory`` uses more than this value, all producers are blocked. -* ``rabbitmq_remaining_memory``, the difference between ``rabbitmq_vm_memory_limit`` and ``rabbitmq_used_memory``. +* ``rabbitmq_vm_memory_limit``, the maximum amount of memory allocated for + RabbitMQ. When ``rabbitmq_used_memory`` uses more than this value, all + producers are blocked. +* ``rabbitmq_remaining_memory``, the difference between + ``rabbitmq_vm_memory_limit`` and ``rabbitmq_used_memory``. diff --git a/doc/user/source/metrics/system.rst b/doc/user/source/metrics/system.rst index 75cde2485..0f0181140 100644 --- a/doc/user/source/metrics/system.rst +++ b/doc/user/source/metrics/system.rst @@ -3,36 +3,45 @@ CPU ^^^ -Metrics have a ``cpu_number`` field that contains the CPU number to which the metric applies. +Metrics have a ``cpu_number`` field that contains the CPU number to which the +metric applies. -* ``cpu_idle``, percentage of CPU time spent in the idle task. -* ``cpu_interrupt``, percentage of CPU time spent servicing interrupts. -* ``cpu_nice``, percentage of CPU time spent in user mode with low priority (nice). -* ``cpu_softirq``, percentage of CPU time spent servicing soft interrupts. -* ``cpu_steal``, percentage of CPU time spent in other operating systems. -* ``cpu_system``, percentage of CPU time spent in system mode. -* ``cpu_user``, percentage of CPU time spent in user mode. -* ``cpu_wait``, percentage of CPU time spent waiting for I/O operations to complete. +* ``cpu_idle``, the percentage of CPU time spent in the idle task. +* ``cpu_interrupt``, the percentage of CPU time spent servicing interrupts. +* ``cpu_nice``, the percentage of CPU time spent in user mode with low + priority (nice). +* ``cpu_softirq``, the percentage of CPU time spent servicing soft interrupts. +* ``cpu_steal``, the percentage of CPU time spent in other operating systems. +* ``cpu_system``, the percentage of CPU time spent in system mode. +* ``cpu_user``, the percentage of CPU time spent in user mode. +* ``cpu_wait``, the percentage of CPU time spent waiting for I/O operations to + complete. Disk ^^^^ -Metrics have a ``device`` field that contains the disk device number the metric applies to (eg 'sda', 'sdb' and so on). +Metrics have a ``device`` field that contains the disk device number the metric +applies to. For example, 'sda', 'sdb', and others. -* ``disk_merged_read``, the number of read operations per second that could be merged with already queued operations. -* ``disk_merged_write``, the number of write operations per second that could be merged with already queued operations. +* ``disk_merged_read``, the number of read operations per second that could be + merged with already queued operations. +* ``disk_merged_write``, the number of write operations per second that could + be merged with already queued operations. * ``disk_octets_read``, the number of octets (bytes) read per second. * ``disk_octets_write``, the number of octets (bytes) written per second. * ``disk_ops_read``, the number of read operations per second. * ``disk_ops_write``, the number of write operations per second. -* ``disk_time_read``, the average time for a read operation to complete in the last interval. -* ``disk_time_write``, the average time for a write operation to complete in the last interval. +* ``disk_time_read``, the average time for a read operation to complete in the + last interval. +* ``disk_time_write``, the average time for a write operation to complete in + the last interval. File system ^^^^^^^^^^^ -Metrics have a ``fs`` field that contains the partition's mount point to which the metric applies (eg '/', '/var/lib' and so on). +Metrics have a ``fs`` field that contains the partition's mount point to which +the metric applies. For example, '/', '/var/lib', and others. * ``fs_inodes_free``, the number of free inodes on the file system. * ``fs_inodes_percent_free``, the percentage of free inodes on the file system. @@ -52,46 +61,53 @@ System load * ``load_longterm``, the system load average over the last 15 minutes. * ``load_midterm``, the system load average over the last 5 minutes. -* ``load_shortterm``, the system load averge over the last minute. +* ``load_shortterm``, the system load average over the last minute. Memory ^^^^^^ -* ``memory_buffered``, the amount of memory (in bytes) which is buffered. -* ``memory_cached``, the amount of memory (in bytes) which is cached. -* ``memory_free``, the amount of memory (in bytes) which is free. -* ``memory_used``, the amount of memory (in bytes) which is used. +* ``memory_buffered``, the amount of buffered memory in bytes. +* ``memory_cached``, the amount of cached memory in bytes. +* ``memory_free``, the amount of free memory in bytes. +* ``memory_used``, the amount of used memory in bytes. Network ^^^^^^^ -Metrics have a ``interface`` field that contains the interface name the metric applies to (eg 'eth0', 'eth1' and so on). +Metrics have an ``interface`` field that contains the interface name the +metric applies to. For example, 'eth0', 'eth1', and others. -* ``if_errors_rx``, the number of errors per second detected when receiving from the interface. -* ``if_errors_tx``, the number of errors per second detected when transmitting from the interface. -* ``if_octets_rx``, the number of octets (bytes) received per second by the interface. -* ``if_octets_tx``, the number of octets (bytes) transmitted per second by the interface. -* ``if_packets_rx``, the number of packets received per second by the interface. -* ``if_packets_tx``, the number of packets transmitted per second by the interface. +* ``if_errors_rx``, the number of errors per second detected when receiving + from the interface. +* ``if_errors_tx``, the number of errors per second detected when transmitting + from the interface. +* ``if_octets_rx``, the number of octets (bytes) received per second by the + interface. +* ``if_octets_tx``, the number of octets (bytes) transmitted per second by the + interface. +* ``if_packets_rx``, the number of packets received per second by the + interface. +* ``if_packets_tx``, the number of packets transmitted per second by the + interface. Processes ^^^^^^^^^ * ``processes_count``, the number of processes in a given state. The metric has - a ``state`` field (one of 'blocked', 'paging', 'running', 'sleeping', 'stopped' - or 'zombies'). + a ``state`` field (one of 'blocked', 'paging', 'running', 'sleeping', + 'stopped' or 'zombies'). * ``processes_fork_rate``, the number of processes forked per second. Swap ^^^^ -* ``swap_cached``, the amount of cached memory (in bytes) which is in the swap. -* ``swap_free``, the amount of free memory (in bytes) which is in the swap. +* ``swap_cached``, the amount of cached memory (in bytes) that is in the swap. +* ``swap_free``, the amount of free memory (in bytes) that is in the swap. * ``swap_io_in``, the number of swap pages written per second. * ``swap_io_out``, the number of swap pages read per second. -* ``swap_used``, the amount of used memory (in bytes) which is in the swap. +* ``swap_used``, the amount of used memory (in bytes) that is in the swap. Users ^^^^^ -* ``logged_users``, the number of users currently logged-in. +* ``logged_users``, the number of users currently logged in. \ No newline at end of file