Add "Introduce Monitoring to MCP" specification

Change-Id: I0d2277a9df8973f044a644f8fa2afa1487c54f70
2016-05-20 18:39:01 +02:00 · 2016-05-20 18:39:01 +02:00 · 24c7fae07a
parent 760961fc5b
commit 24c7fae07a
1 changed files with 277 additions and 0 deletions
--- a/specs/introduce-monitoring.rst
+++ b/specs/introduce-monitoring.rst
@ -0,0 +1,277 @@
+=====================
+Add Monitoring to MCP
+=====================
+
+This specification describes the initial work that will be done for adding
+monitoring capabilities to MCP. The goal is to lay out the architecture for
+collecting, storing and visualizing basic system metrics (CPU, memory, etc.).
+
+Problem description
+===================
+
+Monitoring is a core aspect of MCP. The goal of the work discussed in this
+specification is to introduce first elements of Monitoring in MCP.
+
+Use Cases
+---------
+
+This specification covers the following:
+
+* Collecting data with Snap
+* Processing the data with Hindsight
+* Storing the data into InfluxDB
+* Visualizing the data with Grafana
+
+This specification focuses on Monitoring, aspects related to Alarming are out
+of scope.
+
+Proposed change
+===============
+
+Collecting data with Snap
+-------------------------
+
+We will use `Snap`_ for collecting monitoring data. Snap will run on every
+cluster node. At this stage only system statistics will be collected, i.e. CPU,
+memory, disk, network, etc. The list of collected metrics will depend on what
+is available in Snap.
+
+The data collected by Snap will then published to Hindsight (described in the
+next section). For that we will use Snap's `Heka publisher plugin`_.
+
+.. _Snap: http://intelsdi-x.github.io/snap/
+.. _Heka publisher plugin: https://github.com/intelsdi-x/snap-plugin-publisher-heka
+
+Snap in Docker
+--------------
+
+Snap will run in a Docker container, so a ``Dockerfile`` will be written for
+Snap in MCP.
+
+Some Snap plugins aren't currently compatible with running into Docker
+containers. We've created GitHub Issues for these incompatibilities:
+
+* https://github.com/intelsdi-x/snap-plugin-collector-cpu/issues/14
+* https://github.com/intelsdi-x/snap-plugin-collector-processes/issues/11
+* https://github.com/intelsdi-x/snap-plugin-collector-meminfo/issues/11
+* https://github.com/intelsdi-x/snap-plugin-collector-disk/issues/6
+* https://github.com/intelsdi-x/snap-plugin-collector-df/issues/8
+* https://github.com/intelsdi-x/snap-plugin-collector-load/issues/8
+* https://github.com/intelsdi-x/snap-plugin-collector-iostat/issues/9
+* https://github.com/intelsdi-x/snap-plugin-collector-swap/issues/7
+
+We will address these imcompatibilities with Pull Requests to Snap. Plugins
+like the `df` and `iostat` use Unix commands internally (the `iostat` plugin
+uses the `iostat` command for example). And in most cases these Unix commands
+cannot be parameterized to read from another directory than `/proc`. This means
+that these plugins will need to be eventually rewritten.
+
+Snap is in active development. We ourselves need to change Snap and create Pull
+Requests. So it is important that we are able to build Snap from sources, as
+opposed to depending on binary releases created by the Snap team.
+
+Docker Hub includes `official images for Go (golang)`_, which are based on
+`buildpack-deps`_ images. We're not going to rely of these images. Instead we
+will rely on MCP's ``ms-debian-base`` image, and we will ourselves install the
+build tools we need for building Snap. In this way we can remove the build
+tools, and thereby minimize the weight of Docker images. The final Snap image
+will just include the Snap binaries required for running Snap. The Go compiler
+will for example not be present in the final image.
+
+.. _official images for Go (golang): https://hub.docker.com/r/library/golang/
+.. _buildpack-deps: https://hub.docker.com/_/buildpack-deps/
+
+Processing the data with Hindsight
+----------------------------------
+
+The data collected with Snap will be published to `Hindsight`_. Like Snap,
+Hindsight will run on each cluster node, and Snap will publish the collected
+data to the Hindsight instance running on the same node (in the same ``Pod``).
+
+Hindsight is a rewrite of Heka in C. It was created to address some issues
+found in Heka (mainly performance issues). Hindsight is compatible with Heka,
+in the sense that Lua plugins that work in Heka also work in Hindsight. And
+Hindsight supports Heka's Protobuf format.
+
+For the communication between Snap and Hindsight we will use TCP and Heka's
+Protobuf format, which Hindsight supports. On the Snap side the Heka publisher
+plugin will be used. On the Hindsight side we will use the `Heka TCP Input
+Plugin`_. The plugin will listen on port ``5565``, and decode the Protobuf
+messages sent by Snap. The resulting messages injected into the Hindsight
+pipeline will then be grouped and sent to InfluxDB by batch. For that an
+Hindsight Output Plugin will be developed. That plugin will reuse some `Lua
+code`_ from StackLight.
+
+It is to be noted that Hindsight just acts as a "passthru" here. In other
+words, for the basic needs described in this specification, we could do without
+Hindsight and have Snap directly publish the metrics to InfluxDB (through
+Snap's InfluxDB publisher plugin). But in the future we will use Hindsight for
+evaluating alarms, deriving metrics from logs, etc. So it is important to use
+Hindsight from the beginning.
+
+.. _Hindsight: https://github.com/trink/hindsight/
+.. _Heka TCP Input Plugin: https://github.com/mozilla-services/lua_sandbox/blob/master/sandboxes/heka/input/heka_tcp.lua
+.. _Lua code: https://github.com/openstack/fuel-plugin-lma-collector/blob/master/deployment_scripts/puppet/modules/lma_collector/files/plugins/filters/influxdb_accumulator.lua
+
+Snap and Hindsight deployment
+-----------------------------
+
+Snap and Hindsight will be deployed by Kubernetes on every Kubernetes minion.
+We will rely on a ``DaemonSet`` for that. In this way, Kubernetes will start
+Snap and Hindsight on any new node added to the cluster.
+
+Snap and Hindsight will run in separate Containers part of the same ``Pod``.
+Only one ``DaemonSet`` will be required.
+
+At this stage we won't use `Snap Tribes`_. We indeed don't need that level of
+sophistication for now, because all the Snap instances will be identical: they
+will run the same plugins and tasks.
+
+The Snap configuration will be stored in ``ConfigMap``, and we will rely on
+Snap's plugin/task `auto_discovery_path`_ functionality for Snap to load the
+plugins and create the tasks at start-up time. Currently the
+``auto_discovery_path`` functionality only works for the loading of plugins. We
+will extend it to also work for the creation of tasks.
+
+Likewise, the Hindsight configuration will also be stored in ``ConfigMap``.
+
+.. _Snap Tribes: https://github.com/intelsdi-x/snap/blob/master/docs/TRIBE.md
+.. _auto_discovery_path: https://github.com/intelsdi-x/snap/blob/master/docs/SNAPD_CONFIGURATION.md#snapd-control-configurations
+
+Git repositories
+----------------
+
+As discussed previously Snap and Hindsight will run in the same ``Pod``, and
+they will be deployed by the same ``DaemonSet``. This means that the
+``DaemonSet`` spec file will be common to Snap and Hindsight. For that reason
+we will use just one Git repository: ``ms-lma`` or ``ms-stacklight``.
+
+Storing the data into InfluxDB
+------------------------------
+
+As described in the previous section Hindsight will send the data to InfluxDB
+for storing.
+
+.. note:: We will investigate using Cassandra in the future. But we will start
+  with InfluxDB, because we know how to run and operate InfluxDB.
+
+InfluxDB deployment
+-------------------
+
+InfluxDB will be deployed by Kubernetes. At this point we will not run InfluxDB
+in cluster mode. We will use a ``ReplicaSet`` (in a ``Deployment``) with one
+replica.
+
+Storage is another issue. To simplify the problem we will dedicate a node to
+InfluxDB (using a node label). InfluxDB will run on that node and it will not
+run on any other node. At this point we will use an ``emptyDir`` or
+``hostPath`` Kubernetes volume on a local disk for the data storage. In the
+future we may use LVM, depending on the outcome of our `LVM driver for
+Volumes`_ work.
+
+For Hindsight and Grafana to be able to access InfluxDB a Kubernetes Service
+will be created for InfluxDB. The ``ClusterIP`` service type will be used.
+
+.. note:: Using Ceph RDB for the data storage is not an option. However we know
+  from experience that a local SSD is required to get decent performances.
+
+.. note:: In the future we will need to support the case of a remote InfluxDB
+  backend deployed outside the Kubernetes cluster. This means that it will be
+  possible to configure Hindsight to use a different InfluxDB endpoint.
+
+.. note:: For deploying InfluxDB/Grafana with Kubernetes we can get inspiration
+  from Heapster. See https://github.com/kubernetes/heapster/tree/master/deploy/kube-config/influxdb.
+
+.. _LVM driver for Volumes: https://mirantis.jira.com/browse/MCP-692
+
+Grafana
+-------
+
+Grafana will run in a Docker container. And Grafana will be deployed by
+Kubernetes, through a dedicated ``Pod`` and a dedicated ``ReplicaSet``.
+
+The number of replicas will be set to one, ensuring that there will be at most
+one running Grafana instance at a given time. In the future we will be able to
+scale Grafana by using more replicas, but we don't need that level of
+sophistication for the moment.
+
+Grafana needs a database to store users and dashboards. By default Grafana uses
+SQLite. To simplify the deployment we will use SQLite and an ``emptyDir``
+Kubernetes volume. This means that any custom settings will be lost if Grafana
+is restarted on another node. In the future we could rely on an RDB or LVM
+volume to avoid that problem. We may also consider using another DBMS than
+SQLite.
+
+Grafana will be pre-configured with a default data source (connected to the
+InfluxDB instance) and default dashboards. Adding data sources and dashboards
+to Grafana is done using the Grafana HTTP API. The configuration operations
+will be done in the ``start.sh`` script that starts the Grafana service. See
+https://github.com/grafana/grafana-docker/blob/master/run.sh for an example.
+
+The operator will need to access the Grafana UI. For that we will create
+a Kubernetes Service for Grafana. We will use a `NodePort Service`_ for the
+moment, but we will probably need to rely on a Load Balancer in the future.
+This depends on what will be used in MCP.
+
+.. _NodePort Service_: http://kubernetes.io/docs/user-guide/services/#type-nodeport
+
+Users and Passwords
+-------------------
+
+Both InfluxDB and Grafana require creating users and passwords. We will use
+`Kubernetes Secrets`_ for that.
+
+.. _Kubernetes Secrets: http://kubernetes.io/docs/user-guide/secrets/
+
+Alternatives
+------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Éric Lemoine
+
+Other contributors:
+  Olivier Bourdon
+
+Work Items
+----------
+
+* Create a ``Dockerfile`` for Snap
+* Create a ``Dockerfile`` for Heka
+* Create a ``Dockerfile`` for InfluxDB
+* Create a ``Dockerfile`` for Grafana
+* Create ``DaemonSet`` and ``Pod`` definitions for Snap/Heka
+* Create ``ReplicaSet`` for InfluxDB/Grafana
+* Create ``ConfigMap`` for Snap and Heka configurations
+* Extend Snap to support the auto-discovery of tasks
+* Make Snap plugins compatible with Docker
+
+Dependencies
+============
+
+* Working developement environment
+* Working CI
+
+Testing
+=======
+
+We will develop functional tests to verify that our Snap/Heka pipeline
+works as expected. This can be done with a mock Snap collector plugin, and
+checking that the output of the pipeline is as expected.
+
+Documentation Impact
+====================
+
+We will document how to set up Monitoring in MCP.
+
+References
+==========
+
+None.