From c9f2c43dae032febdbe30b9a98084dc4ef19af16 Mon Sep 17 00:00:00 2001 From: gord chung Date: Wed, 21 Jun 2017 20:07:49 +0000 Subject: [PATCH] copy admin-guide copy alarms admin-guide from openstack-manuals Change-Id: Ib52147918db7c3d5ee5ba504b61657aa483abee2 --- doc/source/admin/index.rst | 7 + doc/source/admin/telemetry-alarms.rst | 343 ++++++++++++++++++++++++++ doc/source/index.rst | 1 + 3 files changed, 351 insertions(+) create mode 100644 doc/source/admin/index.rst create mode 100644 doc/source/admin/telemetry-alarms.rst diff --git a/doc/source/admin/index.rst b/doc/source/admin/index.rst new file mode 100644 index 00000000..0f2d5b0c --- /dev/null +++ b/doc/source/admin/index.rst @@ -0,0 +1,7 @@ +========================== +Telemetry Alarming service +========================== + +.. toctree:: + + telemetry-alarms.rst diff --git a/doc/source/admin/telemetry-alarms.rst b/doc/source/admin/telemetry-alarms.rst new file mode 100644 index 00000000..e6afd2de --- /dev/null +++ b/doc/source/admin/telemetry-alarms.rst @@ -0,0 +1,343 @@ +.. _telemetry-alarms: + +====== +Alarms +====== + +Alarms provide user-oriented Monitoring-as-a-Service for resources +running on OpenStack. This type of monitoring ensures you can +automatically scale in or out a group of instances through the +Orchestration service, but you can also use alarms for general-purpose +awareness of your cloud resources' health. + +These alarms follow a tri-state model: + +ok + The rule governing the alarm has been evaluated as ``False``. + +alarm + The rule governing the alarm have been evaluated as ``True``. + +insufficient data + There are not enough datapoints available in the evaluation periods + to meaningfully determine the alarm state. + +Alarm definitions +~~~~~~~~~~~~~~~~~ + +The definition of an alarm provides the rules that govern when a state +transition should occur, and the actions to be taken thereon. The +nature of these rules depend on the alarm type. + +Threshold rule alarms +--------------------- + +For conventional threshold-oriented alarms, state transitions are +governed by: + +* A static threshold value with a comparison operator such as greater + than or less than. + +* A statistic selection to aggregate the data. + +* A sliding time window to indicate how far back into the recent past + you want to look. + +Valid threshold alarms are: ``gnocchi_resources_threshold_rule``, +``gnocchi_aggregation_by_metrics_threshold_rule``, or +``gnocchi_aggregation_by_resources_threshold_rule``. + +.. note:: + + As of Ocata, the ``threshold`` alarm is deprecated since Ceilometer's + native storage API is deprecated. + +Composite rule alarms +--------------------- + +Composite alarms enable users to define an alarm with multiple triggering +conditions, using a combination of ``and`` and ``or`` relations. + + +Combination rule alarms +----------------------- + +.. note:: + + Combination alarms are deprecated as of Newton for composite alarms. + Combination alarm functionality is removed in Pike. + +The Telemetry service also supports the concept of a meta-alarm, which +aggregates over the current state of a set of underlying basic alarms +combined via a logical operator (``and`` or ``or``). + +Alarm dimensioning +~~~~~~~~~~~~~~~~~~ + +A key associated concept is the notion of *dimensioning* which +defines the set of matching meters that feed into an alarm +evaluation. Recall that meters are per-resource-instance, so in the +simplest case an alarm might be defined over a particular meter +applied to all resources visible to a particular user. More useful +however would be the option to explicitly select which specific +resources you are interested in alarming on. + +At one extreme you might have narrowly dimensioned alarms where this +selection would have only a single target (identified by resource +ID). At the other extreme, you could have widely dimensioned alarms +where this selection identifies many resources over which the +statistic is aggregated. For example all instances booted from a +particular image or all instances with matching user metadata (the +latter is how the Orchestration service identifies autoscaling +groups). + +Alarm evaluation +~~~~~~~~~~~~~~~~ + +Alarms are evaluated by the ``alarm-evaluator`` service on a periodic +basis, defaulting to once every minute. + +Alarm actions +------------- + +Any state transition of individual alarm (to ``ok``, ``alarm``, or +``insufficient data``) may have one or more actions associated with +it. These actions effectively send a signal to a consumer that the +state transition has occurred, and provide some additional context. +This includes the new and previous states, with some reason data +describing the disposition with respect to the threshold, the number +of datapoints involved and most recent of these. State transitions +are detected by the ``alarm-evaluator``, whereas the +``alarm-notifier`` effects the actual notification action. + +**Webhooks** + +These are the *de facto* notification type used by Telemetry alarming +and simply involve an HTTP POST request being sent to an endpoint, +with a request body containing a description of the state transition +encoded as a JSON fragment. + +**Log actions** + +These are a lightweight alternative to webhooks, whereby the state +transition is simply logged by the ``alarm-notifier``, and are +intended primarily for testing purposes. + +Workload partitioning +--------------------- + +The alarm evaluation process uses the same mechanism for workload +partitioning as the central and compute agents. The +`Tooz `_ library provides the +coordination within the groups of service instances. For further +information about this approach, see the `high availability guide +`_. + +To use this workload partitioning solution set the +``evaluation_service`` option to ``default``. For more +information, see the alarm section in the +`OpenStack Configuration Reference `_. + +Using alarms +~~~~~~~~~~~~ + +Alarm creation +-------------- + +An example of creating a Gnocchi threshold-oriented alarm, based on an upper +bound on the CPU utilization for a particular instance: + +.. code-block:: console + + $ aodh alarm create --name cpu_hi \ + --type gnocchi_resources_threshold \ + --description 'instance running hot' \ + --metric cpu_util --threshold 70.0 \ + --comparison-operator gt --aggregation_method avg \ + --granularity 600 --evaluation-periods 3 \ + --alarm-action 'log://' --resource_id INSTANCE_ID + +This creates an alarm that will fire when the average CPU utilization +for an individual instance exceeds 70% for three consecutive 10 +minute periods. The notification in this case is simply a log message, +though it could alternatively be a webhook URL. + +.. note:: + + Alarm names must be unique for the alarms associated with an + individual project. Administrator can limit the maximum + resulting actions for three different states, and the + ability for a normal user to create ``log://`` and ``test://`` + notifiers is disabled. This prevents unintentional + consumption of disk and memory resources by the + Telemetry service. + +The sliding time window over which the alarm is evaluated is 30 +minutes in this example. This window is not clamped to wall-clock +time boundaries, rather it's anchored on the current time for each +evaluation cycle, and continually creeps forward as each evaluation +cycle rolls around (by default, this occurs every minute). + +.. note:: + + The alarm granularity must match the granularities of the metric configured + in Gnocchi. + +Otherwise the alarm will tend to flit in and out of the +``insufficient data`` state due to the mismatch between the actual +frequency of datapoints in the metering store and the statistics +queries used to compare against the alarm threshold. If a shorter +alarm period is needed, then the corresponding interval should be +adjusted in the ``pipeline.yaml`` file. + +Other notable alarm attributes that may be set on creation, or via a +subsequent update, include: + +state + The initial alarm state (defaults to ``insufficient data``). + +description + A free-text description of the alarm (defaults to a synopsis of the + alarm rule). + +enabled + True if evaluation and actioning is to be enabled for this alarm + (defaults to ``True``). + +repeat-actions + True if actions should be repeatedly notified while the alarm + remains in the target state (defaults to ``False``). + +ok-action + An action to invoke when the alarm state transitions to ``ok``. + +insufficient-data-action + An action to invoke when the alarm state transitions to + ``insufficient data``. + +time-constraint + Used to restrict evaluation of the alarm to certain times of the + day or days of the week (expressed as ``cron`` expression with an + optional timezone). + +An example of creating a combination alarm, based on the combined +state of two underlying alarms: + +.. code-block:: console + + $ aodh alarm create --name meta --type composite \ + --composite-rule '{"or":[{"threshold": 0.8,"metric": "cpu_util", "type": \ + "gnocchi_resources_threshold", "resource_id": INSTANCE_ID, \ + "aggregation-method": "last"},{"threshold": 0.8,"metric": "cpu_util", \ + "type": "gnocchi_resources_threshold", "resource_id": INSTANCE_ID2, \ + "aggregation-method": "last"}]}' \ + --alarm-action 'http://example.org/notify' + +This creates an alarm that will fire when either one of two underlying +alarms transition into the alarm state. The notification in this case +is a webhook call. Any number of underlying alarms can be combined in +this way, using either ``and`` or ``or``. Additionally, combinations +can contain nested conditions: + +.. code-block:: console + + $ aodh alarm create --name meta --type composite \ + --composite-rule '{"or":[ALARM_1, {"and":[ALARM2, ALARM3]}]}' + --alarm-action 'http://example.org/notify' + + +Alarm retrieval +--------------- + +You can display all your alarms via (some attributes are omitted for +brevity): + +.. code-block:: console + + $ aodh alarm list + +----------+-----------+--------+-------------------+----------+---------+ + | Alarm ID | Type | Name | State | Severity | Enabled | + +----------+-----------+--------+-------------------+----------+---------+ + | ALARM_ID | threshold | cpu_hi | insufficient data | high | True | + +----------+-----------+--------+-------------------+----------+---------+ + +In this case, the state is reported as ``insufficient data`` which +could indicate that: + +* meters have not yet been gathered about this instance over the + evaluation window into the recent past (for example a brand-new + instance) + +* *or*, that the identified instance is not visible to the + user/project owning the alarm + +* *or*, simply that an alarm evaluation cycle hasn't kicked off since + the alarm was created (by default, alarms are evaluated once per + minute). + +.. note:: + + The visibility of alarms depends on the role and project + associated with the user issuing the query: + + * admin users see *all* alarms, regardless of the owner + + * non-admin users see only the alarms associated with their project + (as per the normal project segregation in OpenStack) + +Alarm update +------------ + +Once the state of the alarm has settled down, we might decide that we +set that bar too low with 70%, in which case the threshold (or most +any other alarm attribute) can be updated thusly: + +.. code-block:: console + + $ aodh alarm update ALARM_ID --threshold 75 + +The change will take effect from the next evaluation cycle, which by +default occurs every minute. + +Most alarm attributes can be changed in this way, but there is also +a convenient short-cut for getting and setting the alarm state: + +.. code-block:: console + + $ openstack alarm state get ALARM_ID + $ openstack alarm state set --state ok ALARM_ID + +Over time the state of the alarm may change often, especially if the +threshold is chosen to be close to the trending value of the +statistic. You can follow the history of an alarm over its lifecycle +via the audit API: + +.. code-block:: console + + $ aodh alarm-history show ALARM_ID + +------------------+-----------+---------------------------------------+ + | Type | Timestamp | Detail | + +------------------+-----------+---------------------------------------+ + | creation | time0 | name: cpu_hi | + | | | description: instance running hot | + | | | type: threshold | + | | | rule: cpu_util > 70.0 during 3 x 600s | + | state transition | time1 | state: ok | + | rule change | time2 | rule: cpu_util > 75.0 during 3 x 600s | + +------------------+-----------+---------------------------------------+ + +Alarm deletion +-------------- + +An alarm that is no longer required can be disabled so that it is no +longer actively evaluated: + +.. code-block:: console + + $ aodh alarm update --enabled False -a ALARM_ID + +or even deleted permanently (an irreversible step): + +.. code-block:: console + + $ aodh alarm delete ALARM_ID diff --git a/doc/source/index.rst b/doc/source/index.rst index 5118e7e6..6d14b2b5 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -27,6 +27,7 @@ collected by Ceilometer or Gnocchi. install/index contributor/index + admin/index .. toctree:: :maxdepth: 1