Add general Logging/Monitoring/Alerting spec

This adds a spec discussing the general Logging, Monitoring, Alerting architecture for MCP. Change-Id: I033a6815a304b5f7f18989d60e5781c2565de942
2016-03-30 10:03:42 +02:00 · 2016-03-30 10:03:42 +02:00 · 9dda25e70b
parent 8129d01324
commit 9dda25e70b
1 changed files with 297 additions and 0 deletions
--- a/specs/general-lma-spec.rst
+++ b/specs/general-lma-spec.rst
@ -0,0 +1,297 @@
 ==========================================================
 General Logging, Monitoring, Alerting architecture for MCP
 ==========================================================
 [No Jira Epic for this spec]
 This specification describes the general Logging, Monitoring, Alerting
 architecture for MCP (Mirantis Cloud Platform).
 Problem description
 ===================
 Logging, Monitoring and Alerting are key aspects which need to be taken into
 account from the very beginning of the MCP project.
 This specification just describes the general architecture for Logging,
 Monitoring and Alerting. Details on the different parts will be provided with
 more specific specifications.
 In the rest of the document we will use LMA to refer to Logging, Monitoring and
 Alerting.
 Use Cases
 ---------
 The final goal is to provide tools to help OpenStack Operator diagnose and
 troubleshoot problems.
 Proposed change
 ===============
 We propose to add LMA components to MCP. The proposed software and architecture
 are based on the current Fuel StackLight product (composed of four Fuel
 plugins), with adjustements and improvements to meet the requirement of MCP
 (Mirantis Cloud Platform).
 General Architecture
 --------------------
 The following diagram describes the general architecture::
        OpenStack nodes
    +-------------------+
    | +-------------------+
    | |            +----+ |
    | | Logs+-+  +-+Snap| |             +-------------+
    | |       |  | +----+ |             |             |
    | |      +v--v+       |      +------>Elasticsearch|
    | |      |Heka+--------------+      |             |
    +-+      +----+       |      |      +-------------+
      +-------------------+      |
                                 |      +-------------+
                                 |      |             |
                                 +------+InfluxDB     |
        k8s master node          |      |             |
    +-------------------+        |      +-------------+
    | +-------------------+      |
    | |            +----+ |      |      +-------------+
    | | Logs+-+  +-+Snap| |   +--+      |             |
    | |       |  | +----+ |   |  +------>Nagios       |
    | |      +v--v+       |   |         |             |
    | |      |Heka+-----------+         +-------------+
    +-+      +----+       |
      +-------------------+
 The boxes on the top-left corner of the diagram represent the nodes where the
 OpenStack services run. The boxes on the bottom-left corner of the diagram
 represent the the nodes where the Kubernetes infrastructure services run. The
 boxes on the right of the diagram represent the nodes where the LMA backends
 are run.
 Each node runs two services: Heka and Snap. Although it is not depicted in the
 diagram Heka and Snap also run on the backend nodes, where we also want to
 collect logs and telemetry data.
 `Snap`_ is the telemetry framework created by Intel that we will use in MCP for
 collecting telemetry data (CPU usage, etc.). The current StackLight product
 uses Collectd instead of Snap, so this is an area where StackLight and MCP will
 differ. The telemetry data collected by Snap will be sent to Heka.
 `Heka`_ is a stream processing software created and maintained by Mozilla. We
 will use Heka for collecting logs and notifications, deriving new metrics from
 the telemetry data received from Snap, and sending the results to
 Elasticsearch, InfluxDB and Nagios.
 `Elasticsearch`_ will be used for indexing logs and notifications. And
 `Kibana`_ will be used for visualizing the data indexed in Elasticsearch.
 Default Kibana dashboards will be shipped in MCP.
 `InfluxDB`_ is a database optimized for time-series.  It will be used for
 storing the telemetry data. And `Grafana`_ will be used for visualizing the
 telemetry data stored in InfluxDB. Default Grafana dashboards will be shipped
 in MCP.
 `Nagios`_ is a feature-full monitoring software. In MCP we may use it for
 handling status messages sent by Heka and reporting on the current status of
 nodes and services. For that Nagios's `Passive Checks`_ would be used. We've
 been looking at alternatives such as `Sensu`_ and `Icinga`_ , but until now we
 haven't found something with the level of functionality of Nagios. Another
 alternative is to just rely on Heka's `Alert module`_ and `SMTPOutput plugin`_
 for notifications. Whether Nagios will be used or not in MCP will be discussed
 with a more specific specification. It is also to be noted that Alerting should
 be an optional part of the monitoring sytem in MCP.
 .. _Snap: https://github.com/intelsdi-x/snap
 .. _Heka: http://hekad.readthedocs.org/
 .. _Elasticsearch: https://www.elastic.co/products/elasticsearch
 .. _Kibana: https://www.elastic.co/products/kibana
 .. _InfluxDB: https://influxdata.com/
 .. _Grafana: http://grafana.org/
 .. _Nagios: https://www.nagios.org/
 .. _Passive Checks: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/passivechecks.html
 .. _Sensu: https://sensuapp.org/
 .. _Icinga: https://www.icinga.org/
 .. _Alert module: http://hekad.readthedocs.io/en/latest/sandbox/index.html#alert-module
 .. _SMTPOutput plugin: http://hekad.readthedocs.io/en/latest/config/outputs/smtp.html
 Kubernetes Logging and Monitoring
 ---------------------------------
 Kubernetes comes with its own monitoring and logging stack, so the question of
 what we will use and not use of this stack should be raised. This section
 discusses that.
 Monitoring
 ~~~~~~~~~~
 `Kubernetes Monitoring`_ uses `Heapster`_. Heapster runs as a pod on a node of
 the Kubernetes cluster. Heapster gets container statistics by querying the
 cluster nodes' Kubelets. The Kubelet itself fetches the data from cAdvisor.
 Heapster groups the information by pod and sends the data to a backend for
 storage and visualization (InfluxDB is supported).
 Collecting container and pod statistics is necessary for MCP, but it's not
 sufficient. For example, we also want to collect OpenStack services, to be able
 to report on the health of the OpenStack services that run on the cluster.
 Also, Heapster does not currently support any form of alerting.
 The proposal is to use Snap en each node (see the previous section). Snap
 already includes `plugins for OpenStack`_. For container statistics the `Docker
 plugin`_ may be used, and, if necessary, a Kubernetes/Kubelet-specific Snap
 plugin may be developed.
 Relying on Snap on each node, instead of a centralized Heapster instance, will
 also result in a more scalable solution.
 However, it is to be noted that `Kubernetes Autoscaling`_ currently requires
 Heapster. This means that Heapster must be used if the Autoscaling
 functionality is required for MCP. But in that case, no storage backend should
 be set in the Heapster configuration, as Heapster will just be used for the
 Autoscaling functionality.
 .. _Kubernetes Monitoring: http://kubernetes.io/docs/user-guide/monitoring/
 .. _Heapster: https://github.com/kubernetes/heapster
 .. _plugins for OpenStack: https://github.com/intelsdi-x?utf8=%E2%9C%93&query=snap-plugin-collector
 .. _Docker plugin: https://github.com/intelsdi-x/snap-plugin-collector-docker
 .. _Kubernetes Autoscaling: http://kubernetes.io/docs/user-guide/horizontal-pod-autoscaling/
 Logging
 ~~~~~~~
 `Kubernetes Logging`_ relies on Fluentd, with a Fluentd agent running on each
 node. That agent collects container logs (through the Docker Engine running on
 the node) and sends them to Google Cloud Logging or Elasticsearch (the backend
 used is pecified through the ``KUBE_LOGGING_DESTINATION`` variable).
 The main problem with this solution is our inability to act on the logs before
 they're stored into Elasticsearch. For instance we want to be able to monitor
 tho logs, to be able to detect spikes of errors. We also want to be able to
 derive metrics from logs, such as HTTP response time metrics. Also, we may want
 to use Kafka in the future (see below). In summary, Kubernetes Logging does not
 provide us with the flexibility we need.
 Our proposal is to use Heka instead of Fluentd. The benefits are:
 * Flexibility (e.g. use Kafka between Heka and Elasticsearch in the future).
 * Be able to collect logs from services that can't log to stdout.
 * Team's experience on using Heka and running it in production.
 * Re-use all the Heka plugins we've developed (parsers for OpenStack logs, log
  monitoring filters, etc.).
 .. _Kubernetes Logging: http://kubernetes.io/docs/getting-started-guides/logging/
 Use Kafka
 ---------
 Another component that we're considering introducing is `Apache Kafka`_. Kafka
 will sit between Heka and the backends, and it will be used as a robust and
 scalable messaging system for the communications between the Heka instances and
 the backends. Heka has the capability of buffering messaging, but we believe
 that Kafka would allow for a more robust and resilient system. We may make
 Kafka optional, but highly recommended for medium and large clusters.
 The following diagram depicts the architecture when Kafka is used:
        OpenStack nodes
    +-------------------+
    | +-------------------+       Kafka cluster
    | |            +----+ |         +-------+             +-------------+
    | | Logs+-+  +-+Snap| |         |       |             |             |
    | |       |  | +----+ |      +--+Kafka  +--+    +----->Elasticsearch|
    | |      +v--v+       |      |  |       |  |    |     |             |
    | |      |Heka+-------------->  +-------+  +----+     +-------------+
    +-+      +----+       |      |             |
      +-------------------+      |  +-------+  |          +-------------+
                                 |  |       |  |          |             |
                                 +--+Kafka  +--+      +--->InfluxDB     |
                                 |  |       |  |      |   |             |
        k8s master nodes         |  +-------+  +------+   +-------------+
    +-------------------+        |             |
    | +-------------------+   +-->  +-------+  +----+     +-------------+
    | |            +----+ |   |  |  |       |  |    |     |             |
    | | Logs+-+  +-+Snap| |   |  +--+Kafka  +--+    +----->Nagios       |
    | |       |  | +----+ |   |     |       |             |             |
    | |      +v--v+       |   |     +-------+             +-------------+
    | |      |Heka+-----------+
    +-+      +----+       |
      +-------------------+
 The Heka instances running on the OpenStack and Kubernetes nodes are Kafka
 producers. Although not depicted on the diagram Heka instances will also
 probably be used as Kafka consumers between the Kafka cluster and the backends.
 We will need to run performance tests to determine if Heka will be able to keep
 up with the load when used as a Kafka consumer.
 A specific specification will be written for the introduction of Kafka.
 .. _Apache Kafka: https://kafka.apache.org/
 Packaging and deployment
 ------------------------
 All the services participating to the LMA architecture will run in Docker
 containers, following the MCP approach to packaging and service execution.
 Relying on `Kubernetes Daemon Sets`_ for deploying Heka and Snap on all the
 cluster nodes sounds like a good approach. The Kubernetes doc even mentions
 logstash and collectd as a good candidates for running as Daemon Sets.
 .. _Kubernetes Daemon Sets: http://kubernetes.io/docs/admin/daemons/
 Alternatives
 ------------
 The possible alternatives will be discussed in more specific specifications.
 Implementation
 ==============
 The implementation will be described in more specific specifications.
 Assignee(s)
 -----------
 Primary assignee:
  elemoine (elemoine@mirantis.com)
 Other contributors:
  obourdon (obourdon@mirantis.com)
 Work Items
 ----------
 Other specification documents will be written:
 * Logging with Heka
 * Logs storage and analytics with Elasticsearch and Kibana
 * Monitoring with Snap
 * Metrics storage and analytics with InfluxDB and Grafana
 * Alerting in MCP
 * Introducing Kafka to the MCP Monitoring stack
 Dependencies
 ============
 None.
 Testing
 =======
 The testing strategy will be described in more specific specifications.
 Documentation Impact
 ====================
 The MCP monitoring system will be documented.
 References
 ==========
 None.
 History
 =======
 None.