Merge "copy admin-guide"
This commit is contained in:
commit
1b78978cba
|
@ -0,0 +1,7 @@
|
||||||
|
==========================
|
||||||
|
Telemetry Alarming service
|
||||||
|
==========================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
|
||||||
|
telemetry-alarms.rst
|
|
@ -0,0 +1,343 @@
|
||||||
|
.. _telemetry-alarms:
|
||||||
|
|
||||||
|
======
|
||||||
|
Alarms
|
||||||
|
======
|
||||||
|
|
||||||
|
Alarms provide user-oriented Monitoring-as-a-Service for resources
|
||||||
|
running on OpenStack. This type of monitoring ensures you can
|
||||||
|
automatically scale in or out a group of instances through the
|
||||||
|
Orchestration service, but you can also use alarms for general-purpose
|
||||||
|
awareness of your cloud resources' health.
|
||||||
|
|
||||||
|
These alarms follow a tri-state model:
|
||||||
|
|
||||||
|
ok
|
||||||
|
The rule governing the alarm has been evaluated as ``False``.
|
||||||
|
|
||||||
|
alarm
|
||||||
|
The rule governing the alarm have been evaluated as ``True``.
|
||||||
|
|
||||||
|
insufficient data
|
||||||
|
There are not enough datapoints available in the evaluation periods
|
||||||
|
to meaningfully determine the alarm state.
|
||||||
|
|
||||||
|
Alarm definitions
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The definition of an alarm provides the rules that govern when a state
|
||||||
|
transition should occur, and the actions to be taken thereon. The
|
||||||
|
nature of these rules depend on the alarm type.
|
||||||
|
|
||||||
|
Threshold rule alarms
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
For conventional threshold-oriented alarms, state transitions are
|
||||||
|
governed by:
|
||||||
|
|
||||||
|
* A static threshold value with a comparison operator such as greater
|
||||||
|
than or less than.
|
||||||
|
|
||||||
|
* A statistic selection to aggregate the data.
|
||||||
|
|
||||||
|
* A sliding time window to indicate how far back into the recent past
|
||||||
|
you want to look.
|
||||||
|
|
||||||
|
Valid threshold alarms are: ``gnocchi_resources_threshold_rule``,
|
||||||
|
``gnocchi_aggregation_by_metrics_threshold_rule``, or
|
||||||
|
``gnocchi_aggregation_by_resources_threshold_rule``.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
As of Ocata, the ``threshold`` alarm is deprecated since Ceilometer's
|
||||||
|
native storage API is deprecated.
|
||||||
|
|
||||||
|
Composite rule alarms
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Composite alarms enable users to define an alarm with multiple triggering
|
||||||
|
conditions, using a combination of ``and`` and ``or`` relations.
|
||||||
|
|
||||||
|
|
||||||
|
Combination rule alarms
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Combination alarms are deprecated as of Newton for composite alarms.
|
||||||
|
Combination alarm functionality is removed in Pike.
|
||||||
|
|
||||||
|
The Telemetry service also supports the concept of a meta-alarm, which
|
||||||
|
aggregates over the current state of a set of underlying basic alarms
|
||||||
|
combined via a logical operator (``and`` or ``or``).
|
||||||
|
|
||||||
|
Alarm dimensioning
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
A key associated concept is the notion of *dimensioning* which
|
||||||
|
defines the set of matching meters that feed into an alarm
|
||||||
|
evaluation. Recall that meters are per-resource-instance, so in the
|
||||||
|
simplest case an alarm might be defined over a particular meter
|
||||||
|
applied to all resources visible to a particular user. More useful
|
||||||
|
however would be the option to explicitly select which specific
|
||||||
|
resources you are interested in alarming on.
|
||||||
|
|
||||||
|
At one extreme you might have narrowly dimensioned alarms where this
|
||||||
|
selection would have only a single target (identified by resource
|
||||||
|
ID). At the other extreme, you could have widely dimensioned alarms
|
||||||
|
where this selection identifies many resources over which the
|
||||||
|
statistic is aggregated. For example all instances booted from a
|
||||||
|
particular image or all instances with matching user metadata (the
|
||||||
|
latter is how the Orchestration service identifies autoscaling
|
||||||
|
groups).
|
||||||
|
|
||||||
|
Alarm evaluation
|
||||||
|
~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Alarms are evaluated by the ``alarm-evaluator`` service on a periodic
|
||||||
|
basis, defaulting to once every minute.
|
||||||
|
|
||||||
|
Alarm actions
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Any state transition of individual alarm (to ``ok``, ``alarm``, or
|
||||||
|
``insufficient data``) may have one or more actions associated with
|
||||||
|
it. These actions effectively send a signal to a consumer that the
|
||||||
|
state transition has occurred, and provide some additional context.
|
||||||
|
This includes the new and previous states, with some reason data
|
||||||
|
describing the disposition with respect to the threshold, the number
|
||||||
|
of datapoints involved and most recent of these. State transitions
|
||||||
|
are detected by the ``alarm-evaluator``, whereas the
|
||||||
|
``alarm-notifier`` effects the actual notification action.
|
||||||
|
|
||||||
|
**Webhooks**
|
||||||
|
|
||||||
|
These are the *de facto* notification type used by Telemetry alarming
|
||||||
|
and simply involve an HTTP POST request being sent to an endpoint,
|
||||||
|
with a request body containing a description of the state transition
|
||||||
|
encoded as a JSON fragment.
|
||||||
|
|
||||||
|
**Log actions**
|
||||||
|
|
||||||
|
These are a lightweight alternative to webhooks, whereby the state
|
||||||
|
transition is simply logged by the ``alarm-notifier``, and are
|
||||||
|
intended primarily for testing purposes.
|
||||||
|
|
||||||
|
Workload partitioning
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The alarm evaluation process uses the same mechanism for workload
|
||||||
|
partitioning as the central and compute agents. The
|
||||||
|
`Tooz <https://pypi.python.org/pypi/tooz>`_ library provides the
|
||||||
|
coordination within the groups of service instances. For further
|
||||||
|
information about this approach, see the `high availability guide
|
||||||
|
<https://docs.openstack.org/ha-guide/controller-ha-telemetry.html>`_.
|
||||||
|
|
||||||
|
To use this workload partitioning solution set the
|
||||||
|
``evaluation_service`` option to ``default``. For more
|
||||||
|
information, see the alarm section in the
|
||||||
|
`OpenStack Configuration Reference <https://docs.openstack.org/ocata/config-reference/telemetry.html>`_.
|
||||||
|
|
||||||
|
Using alarms
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Alarm creation
|
||||||
|
--------------
|
||||||
|
|
||||||
|
An example of creating a Gnocchi threshold-oriented alarm, based on an upper
|
||||||
|
bound on the CPU utilization for a particular instance:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ aodh alarm create --name cpu_hi \
|
||||||
|
--type gnocchi_resources_threshold \
|
||||||
|
--description 'instance running hot' \
|
||||||
|
--metric cpu_util --threshold 70.0 \
|
||||||
|
--comparison-operator gt --aggregation_method avg \
|
||||||
|
--granularity 600 --evaluation-periods 3 \
|
||||||
|
--alarm-action 'log://' --resource_id INSTANCE_ID
|
||||||
|
|
||||||
|
This creates an alarm that will fire when the average CPU utilization
|
||||||
|
for an individual instance exceeds 70% for three consecutive 10
|
||||||
|
minute periods. The notification in this case is simply a log message,
|
||||||
|
though it could alternatively be a webhook URL.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Alarm names must be unique for the alarms associated with an
|
||||||
|
individual project. Administrator can limit the maximum
|
||||||
|
resulting actions for three different states, and the
|
||||||
|
ability for a normal user to create ``log://`` and ``test://``
|
||||||
|
notifiers is disabled. This prevents unintentional
|
||||||
|
consumption of disk and memory resources by the
|
||||||
|
Telemetry service.
|
||||||
|
|
||||||
|
The sliding time window over which the alarm is evaluated is 30
|
||||||
|
minutes in this example. This window is not clamped to wall-clock
|
||||||
|
time boundaries, rather it's anchored on the current time for each
|
||||||
|
evaluation cycle, and continually creeps forward as each evaluation
|
||||||
|
cycle rolls around (by default, this occurs every minute).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The alarm granularity must match the granularities of the metric configured
|
||||||
|
in Gnocchi.
|
||||||
|
|
||||||
|
Otherwise the alarm will tend to flit in and out of the
|
||||||
|
``insufficient data`` state due to the mismatch between the actual
|
||||||
|
frequency of datapoints in the metering store and the statistics
|
||||||
|
queries used to compare against the alarm threshold. If a shorter
|
||||||
|
alarm period is needed, then the corresponding interval should be
|
||||||
|
adjusted in the ``pipeline.yaml`` file.
|
||||||
|
|
||||||
|
Other notable alarm attributes that may be set on creation, or via a
|
||||||
|
subsequent update, include:
|
||||||
|
|
||||||
|
state
|
||||||
|
The initial alarm state (defaults to ``insufficient data``).
|
||||||
|
|
||||||
|
description
|
||||||
|
A free-text description of the alarm (defaults to a synopsis of the
|
||||||
|
alarm rule).
|
||||||
|
|
||||||
|
enabled
|
||||||
|
True if evaluation and actioning is to be enabled for this alarm
|
||||||
|
(defaults to ``True``).
|
||||||
|
|
||||||
|
repeat-actions
|
||||||
|
True if actions should be repeatedly notified while the alarm
|
||||||
|
remains in the target state (defaults to ``False``).
|
||||||
|
|
||||||
|
ok-action
|
||||||
|
An action to invoke when the alarm state transitions to ``ok``.
|
||||||
|
|
||||||
|
insufficient-data-action
|
||||||
|
An action to invoke when the alarm state transitions to
|
||||||
|
``insufficient data``.
|
||||||
|
|
||||||
|
time-constraint
|
||||||
|
Used to restrict evaluation of the alarm to certain times of the
|
||||||
|
day or days of the week (expressed as ``cron`` expression with an
|
||||||
|
optional timezone).
|
||||||
|
|
||||||
|
An example of creating a combination alarm, based on the combined
|
||||||
|
state of two underlying alarms:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ aodh alarm create --name meta --type composite \
|
||||||
|
--composite-rule '{"or":[{"threshold": 0.8,"metric": "cpu_util", "type": \
|
||||||
|
"gnocchi_resources_threshold", "resource_id": INSTANCE_ID, \
|
||||||
|
"aggregation-method": "last"},{"threshold": 0.8,"metric": "cpu_util", \
|
||||||
|
"type": "gnocchi_resources_threshold", "resource_id": INSTANCE_ID2, \
|
||||||
|
"aggregation-method": "last"}]}' \
|
||||||
|
--alarm-action 'http://example.org/notify'
|
||||||
|
|
||||||
|
This creates an alarm that will fire when either one of two underlying
|
||||||
|
alarms transition into the alarm state. The notification in this case
|
||||||
|
is a webhook call. Any number of underlying alarms can be combined in
|
||||||
|
this way, using either ``and`` or ``or``. Additionally, combinations
|
||||||
|
can contain nested conditions:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ aodh alarm create --name meta --type composite \
|
||||||
|
--composite-rule '{"or":[ALARM_1, {"and":[ALARM2, ALARM3]}]}'
|
||||||
|
--alarm-action 'http://example.org/notify'
|
||||||
|
|
||||||
|
|
||||||
|
Alarm retrieval
|
||||||
|
---------------
|
||||||
|
|
||||||
|
You can display all your alarms via (some attributes are omitted for
|
||||||
|
brevity):
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ aodh alarm list
|
||||||
|
+----------+-----------+--------+-------------------+----------+---------+
|
||||||
|
| Alarm ID | Type | Name | State | Severity | Enabled |
|
||||||
|
+----------+-----------+--------+-------------------+----------+---------+
|
||||||
|
| ALARM_ID | threshold | cpu_hi | insufficient data | high | True |
|
||||||
|
+----------+-----------+--------+-------------------+----------+---------+
|
||||||
|
|
||||||
|
In this case, the state is reported as ``insufficient data`` which
|
||||||
|
could indicate that:
|
||||||
|
|
||||||
|
* meters have not yet been gathered about this instance over the
|
||||||
|
evaluation window into the recent past (for example a brand-new
|
||||||
|
instance)
|
||||||
|
|
||||||
|
* *or*, that the identified instance is not visible to the
|
||||||
|
user/project owning the alarm
|
||||||
|
|
||||||
|
* *or*, simply that an alarm evaluation cycle hasn't kicked off since
|
||||||
|
the alarm was created (by default, alarms are evaluated once per
|
||||||
|
minute).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The visibility of alarms depends on the role and project
|
||||||
|
associated with the user issuing the query:
|
||||||
|
|
||||||
|
* admin users see *all* alarms, regardless of the owner
|
||||||
|
|
||||||
|
* non-admin users see only the alarms associated with their project
|
||||||
|
(as per the normal project segregation in OpenStack)
|
||||||
|
|
||||||
|
Alarm update
|
||||||
|
------------
|
||||||
|
|
||||||
|
Once the state of the alarm has settled down, we might decide that we
|
||||||
|
set that bar too low with 70%, in which case the threshold (or most
|
||||||
|
any other alarm attribute) can be updated thusly:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ aodh alarm update ALARM_ID --threshold 75
|
||||||
|
|
||||||
|
The change will take effect from the next evaluation cycle, which by
|
||||||
|
default occurs every minute.
|
||||||
|
|
||||||
|
Most alarm attributes can be changed in this way, but there is also
|
||||||
|
a convenient short-cut for getting and setting the alarm state:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ openstack alarm state get ALARM_ID
|
||||||
|
$ openstack alarm state set --state ok ALARM_ID
|
||||||
|
|
||||||
|
Over time the state of the alarm may change often, especially if the
|
||||||
|
threshold is chosen to be close to the trending value of the
|
||||||
|
statistic. You can follow the history of an alarm over its lifecycle
|
||||||
|
via the audit API:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ aodh alarm-history show ALARM_ID
|
||||||
|
+------------------+-----------+---------------------------------------+
|
||||||
|
| Type | Timestamp | Detail |
|
||||||
|
+------------------+-----------+---------------------------------------+
|
||||||
|
| creation | time0 | name: cpu_hi |
|
||||||
|
| | | description: instance running hot |
|
||||||
|
| | | type: threshold |
|
||||||
|
| | | rule: cpu_util > 70.0 during 3 x 600s |
|
||||||
|
| state transition | time1 | state: ok |
|
||||||
|
| rule change | time2 | rule: cpu_util > 75.0 during 3 x 600s |
|
||||||
|
+------------------+-----------+---------------------------------------+
|
||||||
|
|
||||||
|
Alarm deletion
|
||||||
|
--------------
|
||||||
|
|
||||||
|
An alarm that is no longer required can be disabled so that it is no
|
||||||
|
longer actively evaluated:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ aodh alarm update --enabled False -a ALARM_ID
|
||||||
|
|
||||||
|
or even deleted permanently (an irreversible step):
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ aodh alarm delete ALARM_ID
|
|
@ -27,6 +27,7 @@ collected by Ceilometer or Gnocchi.
|
||||||
|
|
||||||
install/index
|
install/index
|
||||||
contributor/index
|
contributor/index
|
||||||
|
admin/index
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
Loading…
Reference in New Issue