Spec: Monitoring plugins + tool stack

This spec proposes the development of monitoring plugins for OpenStack
and system services as well as an opinionated monitoring stack
deployment.

Change-Id: I8c807248f531ead255eda571cba9eea5a29c5ed2
This commit is contained in:
Major Hayden 2017-02-23 07:45:38 -06:00
parent 4a09ae0247
commit 73f0262ec0
1 changed files with 183 additions and 0 deletions

183
specs/pike/monitoring.rst Normal file
View File

@ -0,0 +1,183 @@
Monitoring for an OpenStack-Ansible deployment
##############################################
:date: 2017-02-21 00:00
:tags: monitoring, operations
Include the URL of your launchpad blueprint:
* https://blueprints.launchpad.net/openstack-ansible/+spec/example
The goal of the efforts described in this spec is to provide an easy method for
monitoring an OpenStack cloud. This would initially include basic service state
monitoring with extra functionality added as it matures.
Problem description
===================
OpenStack clouds are complex systems of hardware, software, and networks.
Deployers need to monitor the health of all of these components to ensure that
end users have access to resources. OpenStack-Ansible does not offer any
components for monitoring at this time, and this forces deployers to build
their own monitoring plugins and tool stacks.
Deployers and operators need to know:
* Are my OpenStack services up or down?
* Are my additional services (Galera, RabbitMQ, etc) up or down?
* What is the state of cluster partitions for Galera and RabbitMQ?
* Are my APIs responding within a reasonable time period?
* Are my management and tenant networks accessible?
* Is the hardware underneath my cloud operating normally?
Proposed change
===============
The proposed changes fall into two main buckets:
Monitoring plugins
This is the **primary** work effort for the spec.
Deployers need a solid set of monitoring plugins that gather information from
various services or entities, and those plugins should output data in common
formats for the most popular monitoring tool stacks.
Monitoring tool stack
This is the **secondary** work effort for the spec.
There are many open source and commercially available monitoring tool stacks
available for Linux. Deployers should have the option to deploy an
opinionated tool stack via Ansible if they don't have one of their own
already. The tool stack should offer up its time series data for searching
and also have an alerting mechanism that can hook into a deployer's existing
notification tools.
Alternatives
------------
There are loose collections of monitoring plugins available within OpenStack,
but those plugins aren't being actively maintained. Many of the other plugin
sets available today only output their data in a specific format. Deployers
could choose to use these plugins instead.
Deployers could also deploy their own monitoring tool stacks if needed. They
could use the monitoring plugins created in this spec with their existing
tools.
Playbook/Role impact
--------------------
The monitoring plugins should be installable via pip and they can be added into
existing roles or playbooks *(perhaps the openstack-ansible-openstack_hosts
role)*. The plugins themselves should be released independently of an
OpenStack release.
The monitoring tool stack would be implemented in a new role with an additional
playbook. The role that deploys this stack would be versioned along with
OpenStack-Ansible releases so that it can utilize the existing variables and
modules from each release.
Upgrade impact
--------------
This would be the first implementation of monitoring plugins and tools in
OpenStack-Ansible, so there is no upgrade concern at the moment. However, the
plugins and tool stack installation should be written such that upgrades are
reliable.
Security impact
---------------
Some monitoring plugins will need some level of privileged access to OpenStack
services or the other services running in the cloud. This requires accounts to
be created and new secrets to be stored. It is possible to use accounts that
have fewer privileges so that a compromise of a monitoring plugin would have
a limited security impact.
The monitoring tool stack itself has important security concerns to address.
Data from the monitoring plugins running on each host or container must be able
to reach a centralized database for storage and processing. Access to any
web frontends or databases should be handled carefully, just as we do for
Horizon or Galera today.
Performance impact
------------------
Some monitoring plugins will need to make requests to OpenStack APIs or access
certain other services. These plugins must be written carefully to avoid
negative performance impacts on the system.
End user impact
---------------
End users should not notice the changes from this work.
However, they should get a better user experience if the environment is closely
monitored and operations teams have access to valuable performance data.
Deployer impact
---------------
The monitoring plugins should be distributed as a pip package, so this should
have a small impact on deployers. Some plugins will need accounts on the
system, so deployers will need to create additional secrets for those accounts.
Deployers would have the option to deploy the monitoring tool stack if they do
not have one of their own.
Developer impact
----------------
The developer impact from these changes is very low. The monitoring plugins
should be easy to use and heavily tested. Developers should be able to modify
existing plugins and create new ones with ease.
Dependencies
------------
There are no dependencies.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Major Hayden (mhayden)
Other contributors:
Kevin Carter (cloudnull)
Antony Messerli (antonym)
Work items
----------
1. Write a small class that can be extended for new monitoring plugins.
2. Begin writing monitoring plugins that are executable via setuptools entry
points.
3. Ensure that tests are available for each plugin as well as the base class.
4. Create a role to deploy a monitoring tool stack that uses these plugins.
5. Document the plugins and the tool stack.
Testing
=======
The monitoring plugins should be tested on each commit using tox.
The monitoring tool stack role should be tested independently (like the other
IRR repos) and added to the integrated build as an optional component.
Documentation impact
====================
The plugins should be documented and there should be developer guides that
explain how to modify existing plugins or add new ones. The monitoring tool
stack role will need documentation that explains the new variables and
functionality available.
References
==========
Notes from the OpenStack PTG in Atlanta (Feb 2017):
* https://etherpad.openstack.org/p/osa-ptg-pike-monitoring