Merge "Copyedit of logging and monitoring chapter"

This commit is contained in:
Jenkins 2014-03-20 13:33:29 +00:00 committed by Gerrit Code Review
commit 70c62e2166
1 changed files with 77 additions and 76 deletions

View File

@ -13,15 +13,16 @@
<?dbhtml stop-chunking?>
<title>Logging and Monitoring</title>
<para>As an OpenStack cloud is composed of so many different
services, there are a large number of log files. This section
aims to assist you in locating and working with them, and
other ways to track the status of your deployment.</para>
services, there are a large number of log files. This chapter
aims to assist you in locating and working with them and
describes other ways to track the status of your deployment.</para>
<section xml:id="where_are_logs">
<title>Where Are the Logs?</title>
<para>Most services use the convention of writing
their log files to subdirectories of the <code>/var/log
directory</code>.</para>
<informaltable rules="all">
directory</code>, as listed in <link linkend="openstack-log-locations">OpenStack Log Locations</link>.</para>
<table xml:id="openstack-log-locations" rules="all">
<caption>OpenStack Log Locations</caption>
<thead>
<tr>
<th>Node Type</th>
@ -31,7 +32,7 @@
</thead>
<tbody>
<tr>
<td><para>Cloud Controller</para></td>
<td><para>Cloud controller</para></td>
<td><para>
<code>nova-*</code>
</para></td>
@ -40,7 +41,7 @@
</para></td>
</tr>
<tr>
<td><para>Cloud Controller</para></td>
<td><para>Cloud controller</para></td>
<td><para>
<code>glance-*</code>
</para></td>
@ -49,7 +50,7 @@
</para></td>
</tr>
<tr>
<td><para>Cloud Controller</para></td>
<td><para>Cloud controller</para></td>
<td><para>
<code>cinder-*</code>
</para></td>
@ -58,7 +59,7 @@
</para></td>
</tr>
<tr>
<td><para>Cloud Controller</para></td>
<td><para>Cloud controller</para></td>
<td><para>
<code>keystone-*</code>
</para></td>
@ -67,7 +68,7 @@
</para></td>
</tr>
<tr>
<td><para>Cloud Controller</para></td>
<td><para>Cloud controller</para></td>
<td><para>
<code>neutron-*</code>
</para></td>
@ -76,7 +77,7 @@
</para></td>
</tr>
<tr>
<td><para>Cloud Controller</para></td>
<td><para>Cloud controller</para></td>
<td><para>horizon</para></td>
<td><para>
<code>/var/log/apache2/</code>
@ -84,21 +85,21 @@
</tr>
<tr>
<td><para>All nodes</para></td>
<td><para>misc (Swift,
<td><para>misc (swift,
dnsmasq)</para></td>
<td><para>
<code>/var/log/syslog</code>
</para></td>
</tr>
<tr>
<td><para>Compute Nodes</para></td>
<td><para>Compute nodes</para></td>
<td><para>libvirt</para></td>
<td><para>
<code>/var/log/libvirt/libvirtd.log</code>
</para></td>
</tr>
<tr>
<td><para>Compute Nodes</para></td>
<td><para>Compute nodes</para></td>
<td><para>Console (boot up messages) for VM instances:</para></td>
<td><para>
<code>/var/lib/nova/instances/instance-&lt;instance
@ -106,36 +107,36 @@
</para></td>
</tr>
<tr>
<td><para>Block Storage Nodes</para></td>
<td><para>Block Storage nodes</para></td>
<td><para>cinder-volume</para></td>
<td><para>
<code>/var/log/cinder/cinder-volume.log</code>
</para></td>
</tr>
</tbody>
</informaltable>
</table>
</section>
<section xml:id="how_to_read_logs">
<title>Reading the Logs</title>
<para>OpenStack services use the standard logging levels, at
increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR,
CRITICAL, and TRACE. That is, messages only appear in the logs
if they are more "severe" than the particular log level
if they are more "severe" than the particular log level,
with DEBUG allowing all log statements through. For
example, TRACE is logged only if the software has a stack
trace, while INFO is logged for every message including
those that are only for information.</para>
<para>To disable DEBUG-level logging, edit
<filename>/etc/nova/nova.conf</filename>:</para>
<filename>/etc/nova/nova.conf</filename> as follows:</para>
<programlisting language="ini">debug=false</programlisting>
<para>Keystone is handled a little differently. To modify the
logging level, edit the
<filename>/etc/keystone/logging.conf</filename> file and look
at the <code>logger_root</code> and <code>handler_file</code>
sections.</para>
<para>Logging for Horizon is configured in
<para>Logging for horizon is configured in
<filename>/etc/openstack_dashboard/local_settings.py</filename>.
As Horizon is a Django web application, it follows the
Because horizon is a Django web application, it follows the
<link xlink:title="Django Logging"
xlink:href="https://docs.djangoproject.com/en/dev/topics/logging/"
>Django Logging</link>
@ -144,7 +145,7 @@
<para>The first step in finding the source of an error is
typically to search for a CRITICAL, TRACE, or ERROR
message in the log starting at the bottom of the log file.</para>
<para>An example of a CRITICAL log message, with the
<para>Here is an example of a CRITICAL log message, with the
corresponding TRACE (Python traceback) immediately
following:</para>
<screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
@ -179,10 +180,10 @@
2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen>
<para>In this example, cinder-volumes failed to start and has
provided a stack trace, since its volume back-end has been
unable to setup the storage volume - probably because the
unable to set up the storage volume&mdash;probably because the
LVM volume that is expected from the configuration does
not exist.</para>
<para>An example error log:</para>
<para>Here is an example error log:</para>
<screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
[Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen>
<para>In this error, a nova service has failed to connect to
@ -209,10 +210,10 @@
<code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If
you search for this string on the cloud controller in the
<filename>/var/log/nova-*.log</filename> files, it appears in
<filename>nova-api.log</filename>, and
<filename>nova-api.log</filename> and
<filename>nova-scheduler.log</filename>. If you search for
this on the compute nodes in
<filename>/var/log/nova-*.log</filename>, it appears
<filename>/var/log/nova-*.log</filename>, it appears in
<filename>nova-network.log</filename> and
<filename>nova-compute.log</filename>. If no ERROR or CRITICAL
messages appear, the most recent log entry that reports
@ -233,11 +234,11 @@
LOG = logging.getLogger(__name__)</programlisting>
<para>To add a DEBUG logging statement, you would do:</para>
<programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting>
<para>You may notice that all of the existing logging messages
<para>You may notice that all the existing logging messages
are preceded by an underscore and surrounded by
parentheses, for example:</para>
<programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting>
<para>This is used to support translation of logging messages
<para>This formatting is used to support translation of logging messages
into different languages using the <link
xlink:href="http://docs.python.org/2/library/gettext.html"
>gettext</link>
@ -256,9 +257,7 @@ LOG = logging.getLogger(__name__)</programlisting>
issues. Instead, we recommend you use the RabbitMQ web
management interface. Enable it on your cloud
controller:</para>
<screen><prompt>#</prompt>
<userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable
rabbitmq_management</userinput></screen>
<screen><prompt>#</prompt> <userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management</userinput></screen>
<screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen>
<para>The RabbitMQ web management interface is accessible on
your cloud controller at http://localhost:55672.</para>
@ -271,11 +270,11 @@ LOG = logging.getLogger(__name__)</programlisting>
<screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:"
Version: 2.7.1-0ubuntu4</userinput></screen>
</note>
<para>An alternative to enabling the RabbitMQ Web Management
Interface is to use the <command>rabbitmqctl</command> commands. For example,
<para>An alternative to enabling the RabbitMQ web management
interface is to use the <command>rabbitmqctl</command> commands. For example,
<command>rabbitmqctl list_queues| grep
cinder</command> displays any messages
left in the queue. If there are, it's a possible sign that
left in the queue. If any messages are there, it's a possible sign that
cinder services didn't connect properly to rabbitmq and
might have to be restarted.</para>
<para>Items to monitor for RabbitMQ include the number of
@ -287,14 +286,14 @@ Version: 2.7.1-0ubuntu4</userinput></screen>
<para>Because your cloud is most likely composed of many
servers, you must check logs on each of those servers to
properly piece an event together. A better solution is to
send the logs of all servers to a central location so they
send the logs of all servers to a central location so that they
can all be accessed from the same area.</para>
<para>Ubuntu uses rsyslog as the default logging service.
Since it is natively able to send logs to a remote
location, you don't have to install anything extra to
enable this feature, just modify the configuration file.
In doing this, consider running your logging over a
management network, or using an encrypted VPN to avoid
management network or using an encrypted VPN to avoid
interception.</para>
<section xml:id="rsyslog_client_config">
<title>rsyslog Client Configuration</title>
@ -327,8 +326,8 @@ syslog_log_facility=LOG_LOCAL3</programlisting>
following line:</para>
<programlisting language="ini">*.* @192.168.1.10</programlisting>
<para>This instructs rsyslog to send all logs to the IP
listed. In this example, the IP points to the Cloud
Controller.</para>
listed. In this example, the IP points to the cloud
controller.</para>
</section>
<section xml:id="rsyslog_server_config">
<title>rsyslog Server Configuration</title>
@ -360,7 +359,7 @@ $template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
local0.* ?NovaFile
local0.* ?NovaAll
&amp; ~</programlisting>
<para>The above example configuration handles the nova service only.
<para>This example configuration handles the nova service only.
It first configures rsyslog to act as a server that runs on port
514. Next, it creates a series of logging templates. Logging
templates control where received logs are stored. Using
@ -378,7 +377,7 @@ local0.* ?NovaAll
</para>
</listitem>
</itemizedlist>
<para>This is useful as logs from c02.example.com go to:</para>
<para>This is useful, as logs from c02.example.com go to:</para>
<itemizedlist>
<listitem>
<para>
@ -397,10 +396,12 @@ local0.* ?NovaAll
</section>
</section>
<section xml:id="stacktach">
<!-- FIXME This section needs updating, especially with the advent of
ceilometer -->
<title>StackTach</title>
<para>StackTach is a tool created by Rackspace to collect and
report the notifications sent by <code>nova</code>.
Notifications are essentially the same as logs, but can be
Notifications are essentially the same as logs but can be
much more detailed. A good overview of notifications can
be found at <link xlink:title="StackTach GitHub repo"
xlink:href="https://wiki.openstack.org/wiki/SystemUsageData"
@ -433,7 +434,7 @@ notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisti
capable of executing arbitrary commands to check the
status of server and network services, remotely
executing arbitrary commands directly on servers, and
allow servers to push notifications back in the form
allowing servers to push notifications back in the form
of passive monitoring. Nagios has been around since
1999. Although newer monitoring services are
available, Nagios is a tried-and-true systems
@ -442,9 +443,9 @@ notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisti
<section xml:id="process_monitoring">
<title>Process Monitoring</title>
<para>A basic type of alert monitoring is to simply check
and see if a required process is running. For example,
and see whether a required process is running. For example,
ensure that the <code>nova-api</code> service is
running on the Cloud Controller:</para>
running on the cloud controller:</para>
<screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput>
<computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
@ -477,22 +478,22 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
more resources are critically low. While the
monitoring thresholds should be tuned to your specific
OpenStack environment, monitoring resource usage is
not specific to OpenStack at all any generic type of
not specific to OpenStack at allany generic type of
alert will work fine.</para>
<para>Some of the resources that you want to monitor
include:</para>
<itemizedlist>
<listitem>
<para>Disk Usage</para>
<para>Disk usage</para>
</listitem>
<listitem>
<para>Server Load</para>
<para>Server load</para>
</listitem>
<listitem>
<para>Memory Usage</para>
<para>Memory usage</para>
</listitem>
<listitem>
<para>Network IO</para>
<para>Network I/O</para>
</listitem>
<listitem>
<para>Available vCPUs</para>
@ -512,8 +513,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
configuration:</para>
<programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e</programlisting>
<para>Nagios alerts you with a WARNING when any disk on
the compute node is 80% full and CRITICAL when 90% is
full.</para>
the compute node is 80 percent full and CRITICAL when 90
percent is full.</para>
</section>
<section xml:id="metering_telemetry">
<title>Metering and Telemetry with Ceilometer</title>
@ -530,13 +531,13 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
xlink:href="http://docs.openstack.org/developer/ceilometer/"
>http://docs.openstack.org/developer/ceilometer/</link>.</para></section>
<section xml:id="os_resources">
<title>OpenStack-specific Resources</title>
<title>OpenStack-Specific Resources</title>
<para>Resources such as memory, disk, and CPU are generic
resources that all servers (even non-OpenStack
servers) have and are important to the overall health
of the server. When dealing with OpenStack
specifically, these resources are important for a
second reason: ensuring enough are available in order
second reason: ensuring that enough are available
to launch instances. There are a few ways you can see
OpenStack resource usage.</para>
<para>The first is through the <code>nova</code>
@ -545,14 +546,14 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
<para>This command displays a list of how many instances a
tenant has running and some light usage statistics
about the combined instances. This command is useful
for a quick overview of your cloud, but doesn't really
for a quick overview of your cloud, but it doesn't really
get into a lot of details.</para>
<para>Next, the <code>nova</code> database contains three
tables that store usage information.</para>
<para>The <code>nova.quotas</code> and
<code>nova.quota_usages</code> tables store quota
information. If a tenant's quota is different than the
default quota settings, their quota is stored in
information. If a tenant's quota is different from the
default quota settings, its quota is stored in the
<code>nova.quotas</code> table. For
example:</para>
<screen><prompt>mysql&gt;</prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput>
@ -587,12 +588,12 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
<para>By comparing a tenant's hard limit with their
current resource usage, you can see their usage
percentage. For example, if this tenant is using 1
Floating IP out of 10, then they are using 10% of
their Floating IP quota. Rather than doing the
floating IP out of 10, then they are using 10 percent of
their floating IP quota. Rather than doing the
calculation manually, you can use SQL or the scripting
language of your choice and create a formatted
report:</para>
<screen><computeroutput>+----------------------------------+------------+-------------+---------------+
<screen><computeroutput>+----------------------------------+------------+-------------+---------------+
| some_tenant |
+-----------------------------------+------------+------------+---------------+
| Resource | Used | Limit | |
@ -613,8 +614,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
| security_groups | 0 | 10 | 0 % |
| volumes | 2 | 10 | 20 % |
+-----------------------------------+------------+------------+---------------+</computeroutput></screen>
<para>The above was generated using a custom script which
can be found on GitHub
<para>The above information was generated by using a custom script
that can be found on GitHub
(https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report).</para>
<note>
<para>This script is specific to a certain OpenStack
@ -627,15 +628,15 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
<title>Intelligent Alerting</title>
<para>Intelligent alerting can be thought of as a form of
continuous integration for operations. For example,
you can easily check to see if the Image Service is up and
you can easily check to see whether the Image Service is up and
running by ensuring that the <code>glance-api</code>
and <code>glance-registry</code> processes are running
or by seeing if <code>glace-api</code> is responding
or by seeing whether <code>glace-api</code> is responding
on port 9292.</para>
<para>But how can you tell if images are being
<para>But how can you tell whether images are being
successfully uploaded to the Image Service? Maybe the
disk that Image Service is storing the images on is
full or the S3 back-end is down. You could naturally
full or the S3 backend is down. You could naturally
check this by doing a quick image upload:</para>
<programlisting language="bash">#!/bin/bash
#
@ -649,35 +650,35 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
6_64-disk.img</programlisting>
<para>By taking this script and rolling it into an alert
for your monitoring system (such as Nagios), you now
have an automated way of ensuring image uploads to the
have an automated way of ensuring that image uploads to the
Image Catalog are working.</para>
<note>
<para>You must remove the image after each test. Even
better, test whether you can successfully delete
an image from the Image Service.</para>
</note>
<para>Intelligent alerting takes a considerable more
amount of time to plan and implement than the other
<para>Intelligent alerting takes considerably more
time to plan and implement than the other
alerts described in this chapter. A good outline to
implement intelligent alerting is:</para>
<itemizedlist>
<listitem>
<para>Review common actions in your cloud</para>
<para>Review common actions in your cloud.</para>
</listitem>
<listitem>
<para>Create ways to automatically test these
actions</para>
actions.</para>
</listitem>
<listitem>
<para>Roll these tests into an alerting
system</para>
system.</para>
</listitem>
</itemizedlist>
<para>Some other examples for Intelligent Alerting
include:</para>
<itemizedlist>
<listitem>
<para>Can instances launch and destroyed?</para>
<para>Can instances launch and be destroyed?</para>
</listitem>
<listitem>
<para>Can users be created?</para>
@ -693,7 +694,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
<section xml:id="trending">
<title>Trending</title>
<para>Trending can give you great insight into how your
cloud is performing day to day. For example, if a busy
cloud is performing day to day. You can learn, for example, if a busy
day was simply a rare occurrence or if you should
start adding new compute nodes.</para>
<para>Trending takes a slightly different approach than
@ -733,7 +734,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
<para>As an example, recording <code>nova-api</code> usage
can allow you to track the need to scale your cloud
controller. By keeping an eye on <code>nova-api</code>
requests, you can determine if you need to spawn more
requests, you can determine whether you need to spawn more
nova-api processes or go as far as introducing an
entirely new server to run <code>nova-api</code>. To
get an approximate count of the requests, look for
@ -762,10 +763,10 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
<title>Summary</title>
<para>For stable operations, you want to detect failure promptly and
determine causes efficiently. With a distributed system, it's even
more important to track the right items to meet a service level target.
more important to track the right items to meet a service-level target.
Learning where these logs are located in the file system or API gives
you an advantage. Plus, we have discussed how to read, interpret, and
manipulate information from OpenStack services so you can monitor
you an advantage. This chapter also showed how to read, interpret, and
manipulate information from OpenStack services so that you can monitor
effectively.</para>
</section>
</chapter>