Merge "copy admin-guide"

This commit is contained in:
Jenkins 2017-07-13 17:53:26 +00:00 committed by Gerrit Code Review
commit 1b78978cba
3 changed files with 351 additions and 0 deletions

View File

@ -0,0 +1,7 @@
==========================
Telemetry Alarming service
==========================
.. toctree::
telemetry-alarms.rst

View File

@ -0,0 +1,343 @@
.. _telemetry-alarms:
======
Alarms
======
Alarms provide user-oriented Monitoring-as-a-Service for resources
running on OpenStack. This type of monitoring ensures you can
automatically scale in or out a group of instances through the
Orchestration service, but you can also use alarms for general-purpose
awareness of your cloud resources' health.
These alarms follow a tri-state model:
ok
The rule governing the alarm has been evaluated as ``False``.
alarm
The rule governing the alarm have been evaluated as ``True``.
insufficient data
There are not enough datapoints available in the evaluation periods
to meaningfully determine the alarm state.
Alarm definitions
~~~~~~~~~~~~~~~~~
The definition of an alarm provides the rules that govern when a state
transition should occur, and the actions to be taken thereon. The
nature of these rules depend on the alarm type.
Threshold rule alarms
---------------------
For conventional threshold-oriented alarms, state transitions are
governed by:
* A static threshold value with a comparison operator such as greater
than or less than.
* A statistic selection to aggregate the data.
* A sliding time window to indicate how far back into the recent past
you want to look.
Valid threshold alarms are: ``gnocchi_resources_threshold_rule``,
``gnocchi_aggregation_by_metrics_threshold_rule``, or
``gnocchi_aggregation_by_resources_threshold_rule``.
.. note::
As of Ocata, the ``threshold`` alarm is deprecated since Ceilometer's
native storage API is deprecated.
Composite rule alarms
---------------------
Composite alarms enable users to define an alarm with multiple triggering
conditions, using a combination of ``and`` and ``or`` relations.
Combination rule alarms
-----------------------
.. note::
Combination alarms are deprecated as of Newton for composite alarms.
Combination alarm functionality is removed in Pike.
The Telemetry service also supports the concept of a meta-alarm, which
aggregates over the current state of a set of underlying basic alarms
combined via a logical operator (``and`` or ``or``).
Alarm dimensioning
~~~~~~~~~~~~~~~~~~
A key associated concept is the notion of *dimensioning* which
defines the set of matching meters that feed into an alarm
evaluation. Recall that meters are per-resource-instance, so in the
simplest case an alarm might be defined over a particular meter
applied to all resources visible to a particular user. More useful
however would be the option to explicitly select which specific
resources you are interested in alarming on.
At one extreme you might have narrowly dimensioned alarms where this
selection would have only a single target (identified by resource
ID). At the other extreme, you could have widely dimensioned alarms
where this selection identifies many resources over which the
statistic is aggregated. For example all instances booted from a
particular image or all instances with matching user metadata (the
latter is how the Orchestration service identifies autoscaling
groups).
Alarm evaluation
~~~~~~~~~~~~~~~~
Alarms are evaluated by the ``alarm-evaluator`` service on a periodic
basis, defaulting to once every minute.
Alarm actions
-------------
Any state transition of individual alarm (to ``ok``, ``alarm``, or
``insufficient data``) may have one or more actions associated with
it. These actions effectively send a signal to a consumer that the
state transition has occurred, and provide some additional context.
This includes the new and previous states, with some reason data
describing the disposition with respect to the threshold, the number
of datapoints involved and most recent of these. State transitions
are detected by the ``alarm-evaluator``, whereas the
``alarm-notifier`` effects the actual notification action.
**Webhooks**
These are the *de facto* notification type used by Telemetry alarming
and simply involve an HTTP POST request being sent to an endpoint,
with a request body containing a description of the state transition
encoded as a JSON fragment.
**Log actions**
These are a lightweight alternative to webhooks, whereby the state
transition is simply logged by the ``alarm-notifier``, and are
intended primarily for testing purposes.
Workload partitioning
---------------------
The alarm evaluation process uses the same mechanism for workload
partitioning as the central and compute agents. The
`Tooz <https://pypi.python.org/pypi/tooz>`_ library provides the
coordination within the groups of service instances. For further
information about this approach, see the `high availability guide
<https://docs.openstack.org/ha-guide/controller-ha-telemetry.html>`_.
To use this workload partitioning solution set the
``evaluation_service`` option to ``default``. For more
information, see the alarm section in the
`OpenStack Configuration Reference <https://docs.openstack.org/ocata/config-reference/telemetry.html>`_.
Using alarms
~~~~~~~~~~~~
Alarm creation
--------------
An example of creating a Gnocchi threshold-oriented alarm, based on an upper
bound on the CPU utilization for a particular instance:
.. code-block:: console
$ aodh alarm create --name cpu_hi \
--type gnocchi_resources_threshold \
--description 'instance running hot' \
--metric cpu_util --threshold 70.0 \
--comparison-operator gt --aggregation_method avg \
--granularity 600 --evaluation-periods 3 \
--alarm-action 'log://' --resource_id INSTANCE_ID
This creates an alarm that will fire when the average CPU utilization
for an individual instance exceeds 70% for three consecutive 10
minute periods. The notification in this case is simply a log message,
though it could alternatively be a webhook URL.
.. note::
Alarm names must be unique for the alarms associated with an
individual project. Administrator can limit the maximum
resulting actions for three different states, and the
ability for a normal user to create ``log://`` and ``test://``
notifiers is disabled. This prevents unintentional
consumption of disk and memory resources by the
Telemetry service.
The sliding time window over which the alarm is evaluated is 30
minutes in this example. This window is not clamped to wall-clock
time boundaries, rather it's anchored on the current time for each
evaluation cycle, and continually creeps forward as each evaluation
cycle rolls around (by default, this occurs every minute).
.. note::
The alarm granularity must match the granularities of the metric configured
in Gnocchi.
Otherwise the alarm will tend to flit in and out of the
``insufficient data`` state due to the mismatch between the actual
frequency of datapoints in the metering store and the statistics
queries used to compare against the alarm threshold. If a shorter
alarm period is needed, then the corresponding interval should be
adjusted in the ``pipeline.yaml`` file.
Other notable alarm attributes that may be set on creation, or via a
subsequent update, include:
state
The initial alarm state (defaults to ``insufficient data``).
description
A free-text description of the alarm (defaults to a synopsis of the
alarm rule).
enabled
True if evaluation and actioning is to be enabled for this alarm
(defaults to ``True``).
repeat-actions
True if actions should be repeatedly notified while the alarm
remains in the target state (defaults to ``False``).
ok-action
An action to invoke when the alarm state transitions to ``ok``.
insufficient-data-action
An action to invoke when the alarm state transitions to
``insufficient data``.
time-constraint
Used to restrict evaluation of the alarm to certain times of the
day or days of the week (expressed as ``cron`` expression with an
optional timezone).
An example of creating a combination alarm, based on the combined
state of two underlying alarms:
.. code-block:: console
$ aodh alarm create --name meta --type composite \
--composite-rule '{"or":[{"threshold": 0.8,"metric": "cpu_util", "type": \
"gnocchi_resources_threshold", "resource_id": INSTANCE_ID, \
"aggregation-method": "last"},{"threshold": 0.8,"metric": "cpu_util", \
"type": "gnocchi_resources_threshold", "resource_id": INSTANCE_ID2, \
"aggregation-method": "last"}]}' \
--alarm-action 'http://example.org/notify'
This creates an alarm that will fire when either one of two underlying
alarms transition into the alarm state. The notification in this case
is a webhook call. Any number of underlying alarms can be combined in
this way, using either ``and`` or ``or``. Additionally, combinations
can contain nested conditions:
.. code-block:: console
$ aodh alarm create --name meta --type composite \
--composite-rule '{"or":[ALARM_1, {"and":[ALARM2, ALARM3]}]}'
--alarm-action 'http://example.org/notify'
Alarm retrieval
---------------
You can display all your alarms via (some attributes are omitted for
brevity):
.. code-block:: console
$ aodh alarm list
+----------+-----------+--------+-------------------+----------+---------+
| Alarm ID | Type | Name | State | Severity | Enabled |
+----------+-----------+--------+-------------------+----------+---------+
| ALARM_ID | threshold | cpu_hi | insufficient data | high | True |
+----------+-----------+--------+-------------------+----------+---------+
In this case, the state is reported as ``insufficient data`` which
could indicate that:
* meters have not yet been gathered about this instance over the
evaluation window into the recent past (for example a brand-new
instance)
* *or*, that the identified instance is not visible to the
user/project owning the alarm
* *or*, simply that an alarm evaluation cycle hasn't kicked off since
the alarm was created (by default, alarms are evaluated once per
minute).
.. note::
The visibility of alarms depends on the role and project
associated with the user issuing the query:
* admin users see *all* alarms, regardless of the owner
* non-admin users see only the alarms associated with their project
(as per the normal project segregation in OpenStack)
Alarm update
------------
Once the state of the alarm has settled down, we might decide that we
set that bar too low with 70%, in which case the threshold (or most
any other alarm attribute) can be updated thusly:
.. code-block:: console
$ aodh alarm update ALARM_ID --threshold 75
The change will take effect from the next evaluation cycle, which by
default occurs every minute.
Most alarm attributes can be changed in this way, but there is also
a convenient short-cut for getting and setting the alarm state:
.. code-block:: console
$ openstack alarm state get ALARM_ID
$ openstack alarm state set --state ok ALARM_ID
Over time the state of the alarm may change often, especially if the
threshold is chosen to be close to the trending value of the
statistic. You can follow the history of an alarm over its lifecycle
via the audit API:
.. code-block:: console
$ aodh alarm-history show ALARM_ID
+------------------+-----------+---------------------------------------+
| Type | Timestamp | Detail |
+------------------+-----------+---------------------------------------+
| creation | time0 | name: cpu_hi |
| | | description: instance running hot |
| | | type: threshold |
| | | rule: cpu_util > 70.0 during 3 x 600s |
| state transition | time1 | state: ok |
| rule change | time2 | rule: cpu_util > 75.0 during 3 x 600s |
+------------------+-----------+---------------------------------------+
Alarm deletion
--------------
An alarm that is no longer required can be disabled so that it is no
longer actively evaluated:
.. code-block:: console
$ aodh alarm update --enabled False -a ALARM_ID
or even deleted permanently (an irreversible step):
.. code-block:: console
$ aodh alarm delete ALARM_ID

View File

@ -27,6 +27,7 @@ collected by Ceilometer or Gnocchi.
install/index
contributor/index
admin/index
.. toctree::
:maxdepth: 1