Merge "copy admin-guide"

2017-07-13 17:53:26 +00:00 · 2017-07-13 17:53:26 +00:00 · 1b78978cba
parent 990414712f c9f2c43dae
commit 1b78978cba
3 changed files with 351 additions and 0 deletions
--- a/doc/source/admin/index.rst
+++ b/doc/source/admin/index.rst
@ -0,0 +1,7 @@
 ==========================
 Telemetry Alarming service
 ==========================
 .. toctree::
   telemetry-alarms.rst
--- a/doc/source/admin/telemetry-alarms.rst
+++ b/doc/source/admin/telemetry-alarms.rst
@ -0,0 +1,343 @@
 .. _telemetry-alarms:
 ======
 Alarms
 ======
 Alarms provide user-oriented Monitoring-as-a-Service for resources
 running on OpenStack. This type of monitoring ensures you can
 automatically scale in or out a group of instances through the
 Orchestration service, but you can also use alarms for general-purpose
 awareness of your cloud resources' health.
 These alarms follow a tri-state model:
 ok
  The rule governing the alarm has been evaluated as ``False``.
 alarm
  The rule governing the alarm have been evaluated as ``True``.
 insufficient data
  There are not enough datapoints available in the evaluation periods
  to meaningfully determine the alarm state.
 Alarm definitions
 ~~~~~~~~~~~~~~~~~
 The definition of an alarm provides the rules that govern when a state
 transition should occur, and the actions to be taken thereon. The
 nature of these rules depend on the alarm type.
 Threshold rule alarms
 ---------------------
 For conventional threshold-oriented alarms, state transitions are
 governed by:
 * A static threshold value with a comparison operator such as greater
  than or less than.
 * A statistic selection to aggregate the data.
 * A sliding time window to indicate how far back into the recent past
  you want to look.
 Valid threshold alarms are: ``gnocchi_resources_threshold_rule``,
 ``gnocchi_aggregation_by_metrics_threshold_rule``, or
 ``gnocchi_aggregation_by_resources_threshold_rule``.
 .. note::
  As of Ocata, the ``threshold`` alarm is deprecated since Ceilometer's
  native storage API is deprecated.
 Composite rule alarms
 ---------------------
 Composite alarms enable users to define an alarm with multiple triggering
 conditions, using a combination of ``and`` and ``or`` relations.
 Combination rule alarms
 -----------------------
 .. note::
   Combination alarms are deprecated as of Newton for composite alarms.
   Combination alarm functionality is removed in Pike.
 The Telemetry service also supports the concept of a meta-alarm, which
 aggregates over the current state of a set of underlying basic alarms
 combined via a logical operator (``and`` or ``or``).
 Alarm dimensioning
 ~~~~~~~~~~~~~~~~~~
 A key associated concept is the notion of *dimensioning* which
 defines the set of matching meters that feed into an alarm
 evaluation. Recall that meters are per-resource-instance, so in the
 simplest case an alarm might be defined over a particular meter
 applied to all resources visible to a particular user. More useful
 however would be the option to explicitly select which specific
 resources you are interested in alarming on.
 At one extreme you might have narrowly dimensioned alarms where this
 selection would have only a single target (identified by resource
 ID). At the other extreme, you could have widely dimensioned alarms
 where this selection identifies many resources over which the
 statistic is aggregated. For example all instances booted from a
 particular image or all instances with matching user metadata (the
 latter is how the Orchestration service identifies autoscaling
 groups).
 Alarm evaluation
 ~~~~~~~~~~~~~~~~
 Alarms are evaluated by the ``alarm-evaluator`` service on a periodic
 basis, defaulting to once every minute.
 Alarm actions
 -------------
 Any state transition of individual alarm (to ``ok``, ``alarm``, or
 ``insufficient data``) may have one or more actions associated with
 it. These actions effectively send a signal to a consumer that the
 state transition has occurred, and provide some additional context.
 This includes the new and previous states, with some reason data
 describing the disposition with respect to the threshold, the number
 of datapoints involved and most recent of these. State transitions
 are detected by the ``alarm-evaluator``, whereas the
 ``alarm-notifier`` effects the actual notification action.
 **Webhooks**
 These are the *de facto* notification type used by Telemetry alarming
 and simply involve an HTTP POST request being sent to an endpoint,
 with a request body containing a description of the state transition
 encoded as a JSON fragment.
 **Log actions**
 These are a lightweight alternative to webhooks, whereby the state
 transition is simply logged by the ``alarm-notifier``, and are
 intended primarily for testing purposes.
 Workload partitioning
 ---------------------
 The alarm evaluation process uses the same mechanism for workload
 partitioning as the central and compute agents. The
 `Tooz <https://pypi.python.org/pypi/tooz>`_ library provides the
 coordination within the groups of service instances. For further
 information about this approach, see the `high availability guide
 <https://docs.openstack.org/ha-guide/controller-ha-telemetry.html>`_.
 To use this workload partitioning solution set the
 ``evaluation_service`` option to ``default``. For more
 information, see the alarm section in the
 `OpenStack Configuration Reference <https://docs.openstack.org/ocata/config-reference/telemetry.html>`_.
 Using alarms
 ~~~~~~~~~~~~
 Alarm creation
 --------------
 An example of creating a Gnocchi threshold-oriented alarm, based on an upper
 bound on the CPU utilization for a particular instance:
 .. code-block:: console
   $ aodh alarm create --name cpu_hi \
     --type gnocchi_resources_threshold \
     --description 'instance running hot' \
     --metric cpu_util --threshold 70.0 \
     --comparison-operator gt --aggregation_method avg \
     --granularity 600 --evaluation-periods 3 \
     --alarm-action 'log://' --resource_id INSTANCE_ID
 This creates an alarm that will fire when the average CPU utilization
 for an individual instance exceeds 70% for three consecutive 10
 minute periods. The notification in this case is simply a log message,
 though it could alternatively be a webhook URL.
 .. note::
    Alarm names must be unique for the alarms associated with an
    individual project. Administrator can limit the maximum
    resulting actions for three different states, and the
    ability for a normal user to create ``log://`` and ``test://``
    notifiers is disabled. This prevents unintentional
    consumption of disk and memory resources by the
    Telemetry service.
 The sliding time window over which the alarm is evaluated is 30
 minutes in this example. This window is not clamped to wall-clock
 time boundaries, rather it's anchored on the current time for each
 evaluation cycle, and continually creeps forward as each evaluation
 cycle rolls around (by default, this occurs every minute).
 .. note::
   The alarm granularity must match the granularities of the metric configured
   in Gnocchi.
 Otherwise the alarm will tend to flit in and out of the
 ``insufficient data`` state due to the mismatch between the actual
 frequency of datapoints in the metering store and the statistics
 queries used to compare against the alarm threshold. If a shorter
 alarm period is needed, then the corresponding interval should be
 adjusted in the ``pipeline.yaml`` file.
 Other notable alarm attributes that may be set on creation, or via a
 subsequent update, include:
 state
  The initial alarm state (defaults to ``insufficient data``).
 description
  A free-text description of the alarm (defaults to a synopsis of the
  alarm rule).
 enabled
  True if evaluation and actioning is to be enabled for this alarm
  (defaults to ``True``).
 repeat-actions
  True if actions should be repeatedly notified while the alarm
  remains in the target state (defaults to ``False``).
 ok-action
  An action to invoke when the alarm state transitions to ``ok``.
 insufficient-data-action
  An action to invoke when the alarm state transitions to
  ``insufficient data``.
 time-constraint
  Used to restrict evaluation of the alarm to certain times of the
  day or days of the week (expressed as ``cron`` expression with an
  optional timezone).
 An example of creating a combination alarm, based on the combined
 state of two underlying alarms:
 .. code-block:: console
   $ aodh alarm create --name meta --type composite \
     --composite-rule '{"or":[{"threshold": 0.8,"metric": "cpu_util", "type": \
     "gnocchi_resources_threshold", "resource_id": INSTANCE_ID, \
     "aggregation-method": "last"},{"threshold": 0.8,"metric": "cpu_util", \
     "type": "gnocchi_resources_threshold", "resource_id": INSTANCE_ID2, \
     "aggregation-method": "last"}]}' \
     --alarm-action 'http://example.org/notify'
 This creates an alarm that will fire when either one of two underlying
 alarms transition into the alarm state. The notification in this case
 is a webhook call. Any number of underlying alarms can be combined in
 this way, using either ``and`` or ``or``. Additionally, combinations
 can contain nested conditions:
 .. code-block:: console
   $ aodh alarm create --name meta --type composite \
     --composite-rule '{"or":[ALARM_1, {"and":[ALARM2, ALARM3]}]}'
     --alarm-action 'http://example.org/notify'
 Alarm retrieval
 ---------------
 You can display all your alarms via (some attributes are omitted for
 brevity):
 .. code-block:: console
   $ aodh alarm list
   +----------+-----------+--------+-------------------+----------+---------+
   | Alarm ID | Type      | Name   | State             | Severity | Enabled |
   +----------+-----------+--------+-------------------+----------+---------+
   | ALARM_ID | threshold | cpu_hi | insufficient data | high     | True    |
   +----------+-----------+--------+-------------------+----------+---------+
 In this case, the state is reported as ``insufficient data`` which
 could indicate that:
 * meters have not yet been gathered about this instance over the
  evaluation window into the recent past (for example a brand-new
  instance)
 * *or*, that the identified instance is not visible to the
  user/project owning the alarm
 * *or*, simply that an alarm evaluation cycle hasn't kicked off since
  the alarm was created (by default, alarms are evaluated once per
  minute).
 .. note::
   The visibility of alarms depends on the role and project
   associated with the user issuing the query:
   * admin users see *all* alarms, regardless of the owner
   * non-admin users see only the alarms associated with their project
     (as per the normal project segregation in OpenStack)
 Alarm update
 ------------
 Once the state of the alarm has settled down, we might decide that we
 set that bar too low with 70%, in which case the threshold (or most
 any other alarm attribute) can be updated thusly:
 .. code-block:: console
   $ aodh alarm update ALARM_ID --threshold 75
 The change will take effect from the next evaluation cycle, which by
 default occurs every minute.
 Most alarm attributes can be changed in this way, but there is also
 a convenient short-cut for getting and setting the alarm state:
 .. code-block:: console
   $ openstack alarm state get ALARM_ID
   $ openstack alarm state set --state ok ALARM_ID
 Over time the state of the alarm may change often, especially if the
 threshold is chosen to be close to the trending value of the
 statistic. You can follow the history of an alarm over its lifecycle
 via the audit API:
 .. code-block:: console
   $ aodh alarm-history show ALARM_ID
   +------------------+-----------+---------------------------------------+
   | Type             | Timestamp | Detail                                |
   +------------------+-----------+---------------------------------------+
   | creation         | time0     | name: cpu_hi                          |
   |                  |           | description: instance running hot     |
   |                  |           | type: threshold                       |
   |                  |           | rule: cpu_util > 70.0 during 3 x 600s |
   | state transition | time1     | state: ok                             |
   | rule change      | time2     | rule: cpu_util > 75.0 during 3 x 600s |
   +------------------+-----------+---------------------------------------+
 Alarm deletion
 --------------
 An alarm that is no longer required can be disabled so that it is no
 longer actively evaluated:
 .. code-block:: console
   $ aodh alarm update --enabled False -a ALARM_ID
 or even deleted permanently (an irreversible step):
 .. code-block:: console
   $ aodh alarm delete ALARM_ID
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -27,6 +27,7 @@ collected by Ceilometer or Gnocchi.
   install/index
   contributor/index
   admin/index
 .. toctree::
   :maxdepth: 1