WIP: Monitoring guide

This commit adds Mirantis OpenStack Monitoring Guide structure.

Co-Authored-By: Swann Croiset <swann@oopss.org>
Co-Authored-By: Patrick Petit <ppetit@mirantis.com>
Co-Authored-By: Nikita Konovalov <nkonovalov@mirantis.com>
Co-Authored-By: Alexander Tivelkov <ativelkov@mirantis.com>
Co-Authored-By: Alexander Adamov <aadamov@mirantis.com>
Co-Authored-By: Maria Zlatkova <mzlatkova@mirantis.com>
Co-Authored-By: Olena Logvinova <ologvinova@mirantis.com>

Closes-Bug: #1414970
Change-Id: I420764863a393f0ccd759b9aa30ed9fc577152d4
This commit is contained in:
Maria Zlatkova 2015-04-28 11:41:07 +03:00 committed by Irina Povolotskaya
parent 78d9113c57
commit a502524b46
52 changed files with 547 additions and 1 deletions

View File

@ -11,6 +11,7 @@ Documentation
planning-guide
user-guide
operations
monitoring-guide
virtualbox
reference-architecture
terminology

View File

@ -0,0 +1,9 @@
.. include:: /pages/monitoring-guide/introduction.rst
.. include:: /pages/monitoring-guide/assumptions.rst
.. include:: /pages/monitoring-guide/intended-audience.rst
.. include:: /pages/monitoring-guide/document-scope.rst
.. include:: /pages/monitoring-guide/common-monitoring-practices/common-monitoring-practices.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/monitoring-activities-details.rst
.. include:: /pages/monitoring-guide/storage-clusters/storage-clusters.rst
.. include:: /pages/monitoring-guide/hardware-and-system-monitoring/hardware-and-system-monitoring.rst
.. include:: /pages/monitoring-guide/appendix/appendix.rst

View File

@ -52,6 +52,11 @@ using Fuel.
A collection of useful procedures for using and managing
your Mirantis OpenStack environment.
:ref:`monitoring-guide` `(pdf) <pdf/Mirantis-OpenStack-6.0-MonitoringGuide.pdf>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A general OpenStack monitoring strategy.
:ref:`virtualbox` `(pdf) <pdf/Mirantis-OpenStack-6.0-Running-Mirantis-OpenStack-on-VirtualBox.pdf>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

11
monitoring-guide.rst Normal file
View File

@ -0,0 +1,11 @@
.. index:: Monitoring Guide
.. _monitoring-guide:
================
Monitoring Guide
================
.. contents:: :local:
.. include:: /contents/contents-monitoring-guide.rst

View File

@ -0,0 +1,10 @@
.. _mg-appendix:
Appendix
========
We put additional materials in this appendix that are out of
the scope but which could, nonetheless, be of interest.
.. include:: /pages/monitoring-guide/appendix/virtual-machine-monitoring.rst
.. include:: /pages/monitoring-guide/appendix/vm-network-traffic.rst

View File

@ -0,0 +1,4 @@
.. _mg-virtual-machine-monitoring:
Virtual Machine Monitoring
--------------------------

View File

@ -0,0 +1,4 @@
.. _mg-vm-network-traffic:
VM Network Traffic
------------------

View File

@ -0,0 +1,12 @@
.. _mg-assumptions:
Assumptions
===========
We assume that the reader is already familiar with the concepts, architecture
principles, and day-to-day administration tasks of Mirantis OpenStack. It
further assumes that you have deployed Mirantis OpenStack following the
recommendations of :ref:`the Mirantis OpenStack Planning Guide<planning-guide>`
with :ref:`high availability<ha-term>` support using Neutron with
:ref:`VLAN<neutron-vlan-ovs-arch>` or :ref:`GRE<neutron-gre-ovs-arch>`
networking segmentation.

View File

@ -0,0 +1,11 @@
.. _mg-common-monitoring-practices:
Common Monitoring Practices
===========================
This chapter describes the common monitoring practices we recommend
that you to use to design and implement an effective monitoring solution
for OpenStack.
.. include:: /pages/monitoring-guide/common-monitoring-practices/monitoring-domains.rst
.. include:: /pages/monitoring-guide/common-monitoring-practices/monitoring-activities.rst

View File

@ -0,0 +1,4 @@
.. _mg-diagnosing-versus-alerting:
Diagnosing versus Alerting
++++++++++++++++++++++++++

View File

@ -0,0 +1,4 @@
.. _mg-logs-indexing:
Logs Indexing
+++++++++++++

View File

@ -0,0 +1,4 @@
.. _mg-logs-processing:
Logs Processing
+++++++++++++++

View File

@ -0,0 +1,4 @@
.. _mg-metering:
Metering
++++++++

View File

@ -0,0 +1,19 @@
.. _mg-common-monitoring-activities:
Monitoring Activities
---------------------
As stated earlier, this document is not prescriptive of a particular
monitoring system or solution. Instead, it strives to describe common
monitoring practices with problem domains and activities to address them
that are key to get clear operational insights in order to take actions
when problems occur. In this chapter, we try to describe what those
activities are as clearly as possible.
.. include:: /pages/monitoring-guide/common-monitoring-practices/services-processes-cluster-checks.rst
.. include:: /pages/monitoring-guide/common-monitoring-practices/metering.rst
.. include:: /pages/monitoring-guide/common-monitoring-practices/logs-processing.rst
.. include:: /pages/monitoring-guide/common-monitoring-practices/logs-indexing.rst
.. include:: /pages/monitoring-guide/common-monitoring-practices/notifications-processing.rst
.. include:: /pages/monitoring-guide/common-monitoring-practices/diagnosing-versus-alerting.rst
.. include:: /pages/monitoring-guide/common-monitoring-practices/time-sync.rst

View File

@ -0,0 +1,100 @@
.. _mg-monitoring-domains:
Monitoring Domains
------------------
An effective monitoring solution is comprised of distinct activities
aimed at addressing the different problem domains that the operations
staff will have to handle. These activities are summarized below.
Availability Monitoring
Availability monitoring, in its broadest sense, is a monitoring activity
that is responsible for ensuring that the resources for compute, storage,
and networking, as well as the services mediating their access (via the
service API endpoints), are effectively available for end-users to consume
while meeting the performance requirements of the SLA. In terms of
availability monitoring, we use relevant indicators (or metrics); they provide
information on how many resources are currently available in the cloud
infrastructure as well as the process checks ensuring that the services
delivering the access are up and running. Those indicators are obtained
from running synthetic transactions, parsing the logs, metrics collectors
deployed throughout the system, and so forth.
Performance Monitoring
Performance monitoring is supposes measuring how fast a particular
resource can be served by the cloud infrastructure in response to a user
request. For example, measuring how much time it takes to create an
instance or a volume. Key metrics for performance monitoring can be
obtained not only from synthetic transactions simulating an end-user
interaction with a service endpoint but also from analysing the logs,
instrumenting the code, and extracting performance metrics from the
OpenStack notifications.
OpenStack performance and availability monitoring are the two main
monitoring issues developed in this document since they directly
relate to the SLA.
Resource Usage Monitoring
Resource usage monitoring is only partially addressed here. We view it as
a derivative activity by which a cloud operator can retrieve how much
resources were consumed by a particular user or tenant during a particular
time period for chargeback. Resource usage monitoring supposes measuring
consumable resources of the cloud via the APIs. Another key difference
between resource usage monitoring and availability monitoring is that
resource usage monitoring does not have to be performed in real-time.
Readers interested in resource usage monitoring for OpenStack
should take a look at the :ref:`Ceilometer<ceilometer-term>` project.
Alerting
Alerting is a process by which the monitoring system notifies the cloud
operator about an undesirable situation. The situation is typically described
in an alarm like manner, for example, when the value of a key indicator
crosses a threshold or unexpectedly changes a value from OK to NOT OK.
An unexpected change of state, if not the direct manifestation of a problem,
is often a precursor of it. Besides, alerting should have the following properties:
- Provide a comprehensive description of the problem.
- Provide information about which service is affected.
- Provide a severity level.
- Provide the ability to be disabled to avoid false positives during
maintenance.
- Provide the ability to combine alarms to express more complex situations.
- Provide the ability to refer to time-series statistics like median,
standard deviation and percentiles.
Furthermore, we recommend that the health status of any OpenStack service
is expressed using three different values:
- **Healthy** - when both the HA functions of the controller cluster are
still being ensured and no critical errors are being reported by the
monitoring system for a service.
- **Degraded** - when one or more critical errors are reported by the
monitoring system for a service but the HA functions of the controller
cluster are still being ensured.
- **Failed** - when both the HA functions of the controller cluster are
not being ensured anymore and one or more critical errors are being
reported by the monitoring system for a service.
**A critical error should always be reported in an alert.**
The immediacy of the operations staffs response to an alert depends on
the actual status of the HA cluster. It can be any of the following:
- **Immediate** - when a service is failed. It is a critical situation
and so, the alert should be sent to the operations staff for human
intervention.
- **Deferred** - when a service is degraded. While a degraded service
may have a negative impact on the quality of service, the nominal
function of the cloud service should continue to be ensured by the
system and so, the handling of the alert could be safely prioritized
through a ticketing system.
Obviously, not all errors are critical. An effective monitoring solution
should put a great deal of care at defining the proper level of alerting
(smart alerting), in order to avoid flooding the operations staff with
benign notifications that are not reflective of a critical situation.
This document strives to provide some hints about how to set your alarms
with threshold values and status checks but your mileage may vary depending
on your particular OpenStack environment. :ref:`Rally<rally-term>` is a load
generator for OpenStack that you could use to calibrate the alarms of your
monitoring system.

View File

@ -0,0 +1,4 @@
.. _mg-notifications-processing:
OpenStack Notifications Processing
++++++++++++++++++++++++++++++++++

View File

@ -0,0 +1,5 @@
.. _mg-services-processes-cluster-checks:
Services, Processes and Clusters Checks
+++++++++++++++++++++++++++++++++++++++

View File

@ -0,0 +1,4 @@
.. _mg-time-sync:
Time synchronization
++++++++++++++++++++

View File

@ -0,0 +1,4 @@
.. _mg-corosync-pacemaker:
Corosync/Pacemaker
------------------

View File

@ -0,0 +1,46 @@
.. _mg-document-scope:
Document Scope
==============
This guide is about how to monitor an OpenStack cloud from the perspective
of the operations staff with a focus on the infrastructure. As a result,
this guide is not directly intended to serve the monitoring needs of a
cloud user whether it has access to the administrator role or not because
as a cloud user you do not have root access to the servers and host
operating systems. The scope therefore includes some hardware monitoring
through IPMI, monitoring of the host operating system, monitoring of the
cloud management system and processes that are part of its ecosystem.
The processes supporting the cloud management system are roughly of two kind:
* The OpenStack service API endpoints, like *nova-api*, which receive the user
requests.
* The OpenStack service workers connected to the AMQP bus, like *nova-scheduler*,
which process the user requests.
The OpenStack services depend on a number of additional programs that are
not part of the OpenStack code base itself but which nonetheless are
critically important to monitor as we will see below. This includes but is
not limited to Libvirt, :ref:`MySQL<mysql-term>`, :ref:`RabbitMQ<rabbitmq-term>`,
:ref:`Memcached<memcached-term>`, :ref:`HAProxy<haproxy-term>`, :ref:`Corosync<corosync-term>`
and :ref:`Pacemaker<pacemaker-term>`.
The scope also includes the host operating systems, the servers and devices such
as the disks and network interface cards. Some amount of hardware health
checks via IPMI are performed to monitor the status of equipments such as the
fans and CPU temperature in an attempt to help with anticipating hardware
failures.
The scope of this document does not include the monitoring of the end-user
applications as well as the monitoring of the hardware equipments that are
vendor-specific or too complex to be practically addressed in this document.
This includes but is not limited to the following equipment categories.
The network gears
The monitoring of the network gears such as switches and routers is
vendor-specific and too large to be addressed here.
The storage gears
The monitoring of the storage gears like SANs and NASs is vendor-specific
and too large to be addressed here.

View File

@ -0,0 +1,6 @@
.. _mg-ha-cluster:
HA Cluster
==========
.. include:: /pages/monitoring-guide/corosync-pacemaker.rst

View File

@ -0,0 +1,6 @@
.. _mg-disks-monitoring:
Disks Monitoring
----------------

View File

@ -0,0 +1,8 @@
.. _mg-hardware-and-system-monitoring:
Hardware and System Monitoring
==============================
.. include:: /pages/monitoring-guide/hardware-and-system-monitoring/ipmi.rst
.. include:: /pages/monitoring-guide/hardware-and-system-monitoring/disks-monitoring.rst
.. include:: /pages/monitoring-guide/hardware-and-system-monitoring/operating-system-monitoring.rst

View File

@ -0,0 +1,4 @@
.. _mg-ipmi:
IPMI
----

View File

@ -0,0 +1,43 @@
.. _mg-operating-system-monitoring:
Operating System Monitoring
---------------------------
Host Monitoring
+++++++++++++++
Disk Usage Monitoring
+++++++++++++++++++++
Soft RAID Monitoring
++++++++++++++++++++
Filesystem Usage Monitoring
+++++++++++++++++++++++++++
CPU Usage Monitoring
++++++++++++++++++++
RAM Usage Monitoring
++++++++++++++++++++
Swap Usage Monitoring
+++++++++++++++++++++
Process Statistics Monitoring
+++++++++++++++++++++++++++++
Network Interface Card (NIC) Monitoring
+++++++++++++++++++++++++++++++++++++++
Firewall (iptables) Monitoring
++++++++++++++++++++++++++++++

View File

@ -0,0 +1,27 @@
.. _mg-intended-audience:
Intended Audience
=================
The primary audience of this document are the architects and technical staff
involved in the design and deployment of an OpenStack cloud. The other
audiences are the members of the operations staff that are in charge of managing
and maintaining the OpenStack cloud in a healthy state on a daily basis.
This includes:
Line of Business Owner
The Line of Business Owner needs to know how "things" are running and if there
are any problems that may affect the SLA. This person focuses on marketing and
business, not IT, and, thus, is interested in top level indicators to know about
services health.
Operational support
Provides support to customers encountering issues. Is generally organized with a
service desk and two support levels for problem escalation. The support relies on
monitoring solutions to perform diagnostics and also should benefit from preventive
alerts.
Subject Matter Expert
Investigates and resolves a domain-specific problem. Validates the resolution.
Uses the monitoring system to troubleshoot and observe the cloud infrastructure
behaviour.

View File

@ -0,0 +1,36 @@
.. _mg-introduction:
Introduction
============
This document does not attempt to tout a particular solution or monitoring
system for OpenStack. Instead, it strives to provide best practices and
provide specific guidelines about how to monitor OpenStack effectively
irrespectively of the technology being used. This includes specific examples
about how to collect and process key metrics to increase your operational
visibility, check various health indicators to detect critical failure
conditions, index and search the logs for root cause analysis and
troubleshooting. Also, it must be highlighted from the start that this
document provides guidelines for monitoring the OpenStack **infrastructure**
and host services. It is not a guide for the monitoring the virtual machines
nor the applications running on top of them.
The expected outcome is two-fold:
* Gain insights into what is critically important to watch in OpenStack so that
operators can be alerted in near real-time to anticipate and react to
undesirable situations.
* Provide a comprehensive set of guidelines to implement
your own monitoring system. In that sense, this document can also be viewed as
a specification you can use to implement your own solution using technologies
like Zabbix or the LMA Toolchain that are provided as `Fuel plugins
<https://software.mirantis.com/fuel-plugins/>`_ for Mirantis
OpenStack 6.1 onward.
In addition, we think that an effective monitoring solution for OpenStack should
have the following main characteristics.
* Provide near real-time insights and alerting.
* Support discovery and configuration management automation so that the error
prone manual setup can be completely avoided.
* The monitoring system supports its own self-monitoring and high availability.

View File

@ -0,0 +1,4 @@
.. _mg-ceilometer:
Ceilometer
----------

View File

@ -0,0 +1,4 @@
.. _mg-cinder:
Cinder
------

View File

@ -0,0 +1,4 @@
.. _mg-dhcp-agent:
DHCP agent
++++++++++

View File

@ -0,0 +1,5 @@
.. _mg-glance:
Glance
------

View File

@ -0,0 +1,4 @@
.. _mg-haproxy:
HAProxy
-------

View File

@ -0,0 +1,4 @@
.. _mg-heat:
Heat
----

View File

@ -0,0 +1,4 @@
.. _mg-horizon:
Horizon
-------

View File

@ -0,0 +1,4 @@
.. _mg-keystone:
Keystone
--------

View File

@ -0,0 +1,4 @@
.. _mg-libvirt:
LibVirt
-------

View File

@ -0,0 +1,4 @@
.. _mg-memcached:
Memcached
---------

View File

@ -0,0 +1,21 @@
.. _mg-monitoring-activities-details:
Monitoring Activities Details
=============================
.. include:: /pages/monitoring-guide/monitoring-activities-details/keystone.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/nova.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/network.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/glance.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/cinder.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/horizon.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/heat.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/ceilometer.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/sahara.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/murano.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/libvirt.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/haproxy.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/rabbitmq.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/mysql.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/memcached.rst

View File

@ -0,0 +1,7 @@
.. _mg-murano:
Murano
------
Murano RabbitMQ instance
++++++++++++++++++++++++

View File

@ -0,0 +1,4 @@
.. _mg-mysql:
MySQL
-----

View File

@ -0,0 +1,10 @@
.. _mg-network:
Network
-------
.. include:: /pages/monitoring-guide/monitoring-activities-details/neutron.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/dhcp-agent.rst
.. include:: /pages/monitoring-guide/monitoring-activities-details/open-vswitch.rst

View File

@ -0,0 +1,4 @@
.. _mg-neutron:
Neutron
+++++++

View File

@ -0,0 +1,4 @@
.. _mg-nova:
Nova
----

View File

@ -0,0 +1,4 @@
.. _mg-open-vswitch:
Open vSwitch
++++++++++++

View File

@ -0,0 +1,4 @@
.. _mg-rabbitmq:
RabbitMQ
--------

View File

@ -0,0 +1,4 @@
.. _mg-sahara:
Sahara
------

View File

@ -0,0 +1,4 @@
.. _mg-ceph:
Ceph
----

View File

@ -0,0 +1,7 @@
.. _mg-storage-clusters:
Storage clusters
================
.. include:: /pages/monitoring-guide/storage-clusters/swift.rst
.. include:: /pages/monitoring-guide/storage-clusters/ceph.rst

View File

@ -0,0 +1,4 @@
.. _mg-swift:
Swift
-----

View File

@ -21,6 +21,7 @@ pdf_documents = [
('pdf/pdf_planning-guide', u'Mirantis-OpenStack-6.0-PlanningGuide', u'Planning Guide', u'2014, Mirantis Inc.'),
('pdf/pdf_user', u'Mirantis-OpenStack-6.0-UserGuide', u'User Guide', u'2014, Mirantis Inc.'),
('pdf/pdf_operations', u'Mirantis-OpenStack-6.0-OperationsGuide', u'Operations Guide', u'2014, Mirantis Inc.'),
('pdf/pdf_monitoring-guide', u'Mirantis-OpenStack-6.0-MonitoringGuide', u'Monitoring Guide', u'2014, Mirantis Inc.'),
('pdf/pdf_virtualbox', u'Mirantis-OpenStack-6.0-Running-Mirantis-OpenStack-on-VirtualBox', u'Running Mirantis OpenStack on VirtualBox', u'2014, Mirantis Inc.'),
('pdf/pdf_reference', u'Mirantis-OpenStack-6.0-ReferenceArchitecture', u'Reference Architecture', u'2014, Mirantis Inc.'),
('pdf/pdf_plugin-dev', u'Mirantis-OpenStack-6.0-FuelPluginGuide', u'Fuel Plugin Guide', u'2014, Mirantis Inc.'),

View File

@ -0,0 +1,32 @@
.. header::
.. cssclass:: header-table
+-------------------------------------+-----------------------------------+
| Mirantis OpenStack v6.1 | .. cssclass:: right|
| | |
| Monitoring Guide | ###Section### |
+-------------------------------------+-----------------------------------+
.. footer::
.. cssclass:: footer-table
+--------------------------+----------------------+
| | .. cssclass:: right|
| | |
| ©2014, Mirantis Inc. | Page ###Page### |
+--------------------------+----------------------+
.. raw:: pdf
PageBreak oneColumn
.. toctree::
.. include:: /pages/preface/preface.rst
.. _monitoring-guide:
.. include:: /contents/contents-monitoring-guide.rst

View File

@ -5,7 +5,7 @@
+-------------------------------------+-----------------------------------+
| Mirantis OpenStack v6.1 | .. cssclass:: right|
| | |
| Operations Guide | ###Section### |
| Monitoring Guide | ###Section### |
+-------------------------------------+-----------------------------------+
.. footer::