Merge "Copyedit of logging and monitoring chapter"

2014-03-20 13:33:29 +00:00 · 2014-03-20 13:33:29 +00:00 · 70c62e2166
parent fa7ee68bf3 3864dadb87
commit 70c62e2166
1 changed files with 77 additions and 76 deletions
--- a/doc/openstack-ops/ch_ops_log_monitor.xml
+++ b/doc/openstack-ops/ch_ops_log_monitor.xml
@ -13,15 +13,16 @@
    <?dbhtml stop-chunking?>
    <title>Logging and Monitoring</title>
    <para>As an OpenStack cloud is composed of so many different
-        services, there are a large number of log files. This section
-        aims to assist you in locating and working with them, and
-        other ways to track the status of your deployment.</para>
+        services, there are a large number of log files. This chapter
+        aims to assist you in locating and working with them and
+        describes other ways to track the status of your deployment.</para>
    <section xml:id="where_are_logs">
        <title>Where Are the Logs?</title>
        <para>Most services use the convention of writing
            their log files to subdirectories of the <code>/var/log
-                directory</code>.</para>
-            <informaltable rules="all">
+                directory</code>, as listed in <link linkend="openstack-log-locations">OpenStack Log Locations</link>.</para>
+            <table xml:id="openstack-log-locations" rules="all">
+                <caption>OpenStack Log Locations</caption>
                <thead>
                    <tr>
                        <th>Node Type</th>
@ -31,7 +32,7 @@
                </thead>
                <tbody>
                    <tr>
-                        <td><para>Cloud Controller</para></td>
+                        <td><para>Cloud controller</para></td>
                        <td><para>
                                <code>nova-*</code>
                            </para></td>
@ -40,7 +41,7 @@
                            </para></td>
                    </tr>
                    <tr>
-                         <td><para>Cloud Controller</para></td>
+                         <td><para>Cloud controller</para></td>
                        <td><para>
                                <code>glance-*</code>
                            </para></td>
@ -49,7 +50,7 @@
                            </para></td>
                    </tr>
                    <tr>
-                         <td><para>Cloud Controller</para></td>
+                         <td><para>Cloud controller</para></td>
                        <td><para>
                                <code>cinder-*</code>
                            </para></td>
@ -58,7 +59,7 @@
                            </para></td>
                    </tr>
                    <tr>
-                         <td><para>Cloud Controller</para></td>
+                         <td><para>Cloud controller</para></td>
                        <td><para>
                                <code>keystone-*</code>
                            </para></td>
@ -67,7 +68,7 @@
                            </para></td>
                    </tr>
                    <tr>
-                         <td><para>Cloud Controller</para></td>
+                         <td><para>Cloud controller</para></td>
                        <td><para>
                                <code>neutron-*</code>
                            </para></td>
@ -76,7 +77,7 @@
                            </para></td>
                    </tr>
                    <tr>
-                         <td><para>Cloud Controller</para></td>
+                         <td><para>Cloud controller</para></td>
                        <td><para>horizon</para></td>
                        <td><para>
                                <code>/var/log/apache2/</code>
@ -84,21 +85,21 @@
                    </tr>
                    <tr>
                         <td><para>All nodes</para></td>
-                        <td><para>misc (Swift,
+                        <td><para>misc (swift,
                            dnsmasq)</para></td>
                        <td><para>
                                <code>/var/log/syslog</code>
                            </para></td>
                    </tr>
                    <tr>
-                        <td><para>Compute Nodes</para></td>
+                        <td><para>Compute nodes</para></td>
                        <td><para>libvirt</para></td>
                        <td><para>
                                <code>/var/log/libvirt/libvirtd.log</code>
                            </para></td>
                    </tr>
                    <tr>
-                        <td><para>Compute Nodes</para></td>
+                        <td><para>Compute nodes</para></td>
                        <td><para>Console (boot up messages) for VM instances:</para></td>
                        <td><para>
                                <code>/var/lib/nova/instances/instance-&lt;instance
@ -106,36 +107,36 @@
                            </para></td>
                    </tr>
                    <tr>
-                        <td><para>Block Storage Nodes</para></td>
+                        <td><para>Block Storage nodes</para></td>
                        <td><para>cinder-volume</para></td>
                        <td><para>
                                <code>/var/log/cinder/cinder-volume.log</code>
                            </para></td>
                    </tr>
                </tbody>
-            </informaltable>
+            </table>
    </section>
    <section xml:id="how_to_read_logs">
        <title>Reading the Logs</title>
        <para>OpenStack services use the standard logging levels, at
            increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR,
            CRITICAL, and TRACE. That is, messages only appear in the logs
-            if they are more "severe" than the particular log level
+            if they are more "severe" than the particular log level,
            with DEBUG allowing all log statements through. For
            example, TRACE is logged only if the software has a stack
            trace, while INFO is logged for every message including
            those that are only for information.</para>
        <para>To disable DEBUG-level logging, edit
-                <filename>/etc/nova/nova.conf</filename>:</para>
+                <filename>/etc/nova/nova.conf</filename> as follows:</para>
        <programlisting language="ini">debug=false</programlisting>
        <para>Keystone is handled a little differently. To modify the
            logging level, edit the
                <filename>/etc/keystone/logging.conf</filename> file and look
            at the <code>logger_root</code> and <code>handler_file</code>
            sections.</para>
-        <para>Logging for Horizon is configured in
+        <para>Logging for horizon is configured in
                <filename>/etc/openstack_dashboard/local_settings.py</filename>.
-            As Horizon is a Django web application, it follows the
+            Because horizon is a Django web application, it follows the
                <link xlink:title="Django Logging"
                xlink:href="https://docs.djangoproject.com/en/dev/topics/logging/"
                >Django Logging</link>
@ -144,7 +145,7 @@
        <para>The first step in finding the source of an error is
            typically to search for a CRITICAL, TRACE, or ERROR
            message in the log starting at the bottom of the log file.</para>
-        <para>An example of a CRITICAL log message, with the
+        <para>Here is an example of a CRITICAL log message, with the
            corresponding TRACE (Python traceback) immediately
            following:</para>
        <screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
@ -179,10 +180,10 @@
 2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen>
        <para>In this example, cinder-volumes failed to start and has
            provided a stack trace, since its volume back-end has been
-            unable to setup the storage volume - probably because the
+            unable to set up the storage volume&mdash;probably because the
            LVM volume that is expected from the configuration does
            not exist.</para>
-        <para>An example error log:</para>
+        <para>Here is an example error log:</para>
        <screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
 [Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen>
        <para>In this error, a nova service has failed to connect to
@ -209,10 +210,10 @@
                <code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If
            you search for this string on the cloud controller in the
                <filename>/var/log/nova-*.log</filename> files, it appears in
-                <filename>nova-api.log</filename>, and
+                <filename>nova-api.log</filename> and
                <filename>nova-scheduler.log</filename>. If you search for
            this on the compute nodes in
-                <filename>/var/log/nova-*.log</filename>, it appears
+                <filename>/var/log/nova-*.log</filename>, it appears in
                <filename>nova-network.log</filename> and
                <filename>nova-compute.log</filename>. If no ERROR or CRITICAL
            messages appear, the most recent log entry that reports
@ -233,11 +234,11 @@
 LOG = logging.getLogger(__name__)</programlisting>
        <para>To add a DEBUG logging statement, you would do:</para>
        <programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting>
-        <para>You may notice that all of the existing logging messages
+        <para>You may notice that all the existing logging messages
            are preceded by an underscore and surrounded by
            parentheses, for example:</para>
        <programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting>
-        <para>This is used to support translation of logging messages
+        <para>This formatting is used to support translation of logging messages
            into different languages using the <link
                xlink:href="http://docs.python.org/2/library/gettext.html"
                >gettext</link>
@ -256,9 +257,7 @@ LOG = logging.getLogger(__name__)</programlisting>
            issues. Instead, we recommend you use the RabbitMQ web
            management interface. Enable it on your cloud
            controller:</para>
-        <screen><prompt>#</prompt>
-            <userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable
-                rabbitmq_management</userinput></screen>
+        <screen><prompt>#</prompt> <userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management</userinput></screen>
        <screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen>
        <para>The RabbitMQ web management interface is accessible on
            your cloud controller at http://localhost:55672.</para>
@ -271,11 +270,11 @@ LOG = logging.getLogger(__name__)</programlisting>
            <screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:"
 Version: 2.7.1-0ubuntu4</userinput></screen>
        </note>
-        <para>An alternative to enabling the RabbitMQ Web Management
-            Interface is to use the <command>rabbitmqctl</command> commands. For example,
+        <para>An alternative to enabling the RabbitMQ web management
+            interface is to use the <command>rabbitmqctl</command> commands. For example,
                <command>rabbitmqctl list_queues| grep
                cinder</command> displays any messages
-            left in the queue. If there are, it's a possible sign that
+            left in the queue. If any messages are there, it's a possible sign that
            cinder services didn't connect properly to rabbitmq and
            might have to be restarted.</para>
        <para>Items to monitor for RabbitMQ include the number of
@ -287,14 +286,14 @@ Version: 2.7.1-0ubuntu4</userinput></screen>
        <para>Because your cloud is most likely composed of many
            servers, you must check logs on each of those servers to
            properly piece an event together. A better solution is to
-            send the logs of all servers to a central location so they
+            send the logs of all servers to a central location so that they
            can all be accessed from the same area.</para>
        <para>Ubuntu uses rsyslog as the default logging service.
            Since it is natively able to send logs to a remote
            location, you don't have to install anything extra to
            enable this feature, just modify the configuration file.
            In doing this, consider running your logging over a
-            management network, or using an encrypted VPN to avoid
+            management network or using an encrypted VPN to avoid
            interception.</para>
        <section xml:id="rsyslog_client_config">
            <title>rsyslog Client Configuration</title>
@ -327,8 +326,8 @@ syslog_log_facility=LOG_LOCAL3</programlisting>
                following line:</para>
            <programlisting language="ini">*.* @192.168.1.10</programlisting>
            <para>This instructs rsyslog to send all logs to the IP
-                listed. In this example, the IP points to the Cloud
-                Controller.</para>
+                listed. In this example, the IP points to the cloud
+                controller.</para>
        </section>
        <section xml:id="rsyslog_server_config">
            <title>rsyslog Server Configuration</title>
@ -360,7 +359,7 @@ $template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
 local0.* ?NovaFile
 local0.* ?NovaAll
 &amp; ~</programlisting>
-            <para>The above example configuration handles the nova service only.
+            <para>This example configuration handles the nova service only.
                It first configures rsyslog to act as a server that runs on port
                514. Next, it creates a series of logging templates. Logging
                templates control where received logs are stored. Using
@ -378,7 +377,7 @@ local0.* ?NovaAll
                    </para>
                </listitem>
            </itemizedlist>
-            <para>This is useful as logs from c02.example.com go to:</para>
+            <para>This is useful, as logs from c02.example.com go to:</para>
            <itemizedlist>
                <listitem>
                    <para>
@ -397,10 +396,12 @@ local0.* ?NovaAll
        </section>
    </section>
    <section xml:id="stacktach">
+        <!-- FIXME This section needs updating, especially with the advent of
+         ceilometer -->
        <title>StackTach</title>
        <para>StackTach is a tool created by Rackspace to collect and
            report the notifications sent by <code>nova</code>.
-            Notifications are essentially the same as logs, but can be
+            Notifications are essentially the same as logs but can be
            much more detailed. A good overview of notifications can
            be found at <link xlink:title="StackTach GitHub repo"
                xlink:href="https://wiki.openstack.org/wiki/SystemUsageData"
@ -433,7 +434,7 @@ notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisti
                capable of executing arbitrary commands to check the
                status of server and network services, remotely
                executing arbitrary commands directly on servers, and
-                allow servers to push notifications back in the form
+                allowing servers to push notifications back in the form
                of passive monitoring. Nagios has been around since
                1999. Although newer monitoring services are
                available, Nagios is a tried-and-true systems
@ -442,9 +443,9 @@ notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisti
        <section xml:id="process_monitoring">
            <title>Process Monitoring</title>
            <para>A basic type of alert monitoring is to simply check
-                and see if a required process is running. For example,
+                and see whether a required process is running. For example,
                ensure that the <code>nova-api</code> service is
-                running on the Cloud Controller:</para>
+                running on the cloud controller:</para>
            <screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput>
 <computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova
 nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
@ -477,22 +478,22 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
                more resources are critically low. While the
                monitoring thresholds should be tuned to your specific
                OpenStack environment, monitoring resource usage is
-                not specific to OpenStack at all – any generic type of
+                not specific to OpenStack at all–any generic type of
                alert will work fine.</para>
            <para>Some of the resources that you want to monitor
                include:</para>
            <itemizedlist>
                <listitem>
-                    <para>Disk Usage</para>
+                    <para>Disk usage</para>
                </listitem>
                <listitem>
-                    <para>Server Load</para>
+                    <para>Server load</para>
                </listitem>
                <listitem>
-                    <para>Memory Usage</para>
+                    <para>Memory usage</para>
                </listitem>
                <listitem>
-                    <para>Network IO</para>
+                    <para>Network I/O</para>
                </listitem>
                <listitem>
                    <para>Available vCPUs</para>
@ -512,8 +513,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
                configuration:</para>
            <programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e</programlisting>
            <para>Nagios alerts you with a WARNING when any disk on
-                the compute node is 80% full and CRITICAL when 90% is
-                full.</para>
+                the compute node is 80 percent full and CRITICAL when 90
+                percent is full.</para>
        </section>
        <section xml:id="metering_telemetry">
            <title>Metering and Telemetry with Ceilometer</title>
@ -530,13 +531,13 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
                    xlink:href="http://docs.openstack.org/developer/ceilometer/"
                    >http://docs.openstack.org/developer/ceilometer/</link>.</para></section>
        <section xml:id="os_resources">
-            <title>OpenStack-specific Resources</title>
+            <title>OpenStack-Specific Resources</title>
            <para>Resources such as memory, disk, and CPU are generic
                resources that all servers (even non-OpenStack
                servers) have and are important to the overall health
                of the server. When dealing with OpenStack
                specifically, these resources are important for a
-                second reason: ensuring enough are available in order
+                second reason: ensuring that enough are available
                to launch instances. There are a few ways you can see
                OpenStack resource usage.</para>
            <para>The first is through the <code>nova</code>
@ -545,14 +546,14 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
            <para>This command displays a list of how many instances a
                tenant has running and some light usage statistics
                about the combined instances. This command is useful
-                for a quick overview of your cloud, but doesn't really
+                for a quick overview of your cloud, but it doesn't really
                get into a lot of details.</para>
            <para>Next, the <code>nova</code> database contains three
                tables that store usage information.</para>
            <para>The <code>nova.quotas</code> and
                    <code>nova.quota_usages</code> tables store quota
-                information. If a tenant's quota is different than the
-                default quota settings, their quota is stored in
+                information. If a tenant's quota is different from the
+                default quota settings, its quota is stored in the
                    <code>nova.quotas</code> table. For
                example:</para>
            <screen><prompt>mysql&gt;</prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput>
@ -587,12 +588,12 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
            <para>By comparing a tenant's hard limit with their
                current resource usage, you can see their usage
                percentage. For example, if this tenant is using 1
-                Floating IP out of 10, then they are using 10% of
-                their Floating IP quota. Rather than doing the
+                floating IP out of 10, then they are using 10 percent of
+                their floating IP quota. Rather than doing the
                calculation manually, you can use SQL or the scripting
                language of your choice and create a formatted
                report:</para>
-            <screen><computeroutput>+----------------------------------+------------+-------------+---------------+
+<screen><computeroutput>+----------------------------------+------------+-------------+---------------+
 | some_tenant                                                                 |
 +-----------------------------------+------------+------------+---------------+
 | Resource                          | Used       | Limit      |               |
@ -613,8 +614,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
 | security_groups                   | 0          | 10         |           0 % |
 | volumes                           | 2          | 10         |          20 % |
 +-----------------------------------+------------+------------+---------------+</computeroutput></screen>
-            <para>The above was generated using a custom script which
-                can be found on GitHub
+            <para>The above information was generated by using a custom script
+                that can be found on GitHub
                (https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report).</para>
            <note>
                <para>This script is specific to a certain OpenStack
@ -627,15 +628,15 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
            <title>Intelligent Alerting</title>
            <para>Intelligent alerting can be thought of as a form of
                continuous integration for operations. For example,
-                you can easily check to see if the Image Service is up and
+                you can easily check to see whether the Image Service is up and
                running by ensuring that the <code>glance-api</code>
                and <code>glance-registry</code> processes are running
-                or by seeing if <code>glace-api</code> is responding
+                or by seeing whether <code>glace-api</code> is responding
                on port 9292.</para>
-            <para>But how can you tell if images are being
+            <para>But how can you tell whether images are being
                successfully uploaded to the Image Service? Maybe the
                disk that Image Service is storing the images on is
-                full or the S3 back-end is down. You could naturally
+                full or the S3 backend is down. You could naturally
                check this by doing a quick image upload:</para>
            <programlisting language="bash">#!/bin/bash
 #
@ -649,35 +650,35 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
 6_64-disk.img</programlisting>
            <para>By taking this script and rolling it into an alert
                for your monitoring system (such as Nagios), you now
-                have an automated way of ensuring image uploads to the
+                have an automated way of ensuring that image uploads to the
                Image Catalog are working.</para>
            <note>
                <para>You must remove the image after each test. Even
                    better, test whether you can successfully delete
                    an image from the Image Service.</para>
            </note>
-            <para>Intelligent alerting takes a considerable more
-                amount of time to plan and implement than the other
+            <para>Intelligent alerting takes considerably more
+                time to plan and implement than the other
                alerts described in this chapter. A good outline to
                implement intelligent alerting is:</para>
            <itemizedlist>
                <listitem>
-                    <para>Review common actions in your cloud</para>
+                    <para>Review common actions in your cloud.</para>
                </listitem>
                <listitem>
                    <para>Create ways to automatically test these
-                        actions</para>
+                        actions.</para>
                </listitem>
                <listitem>
                    <para>Roll these tests into an alerting
-                        system</para>
+                        system.</para>
                </listitem>
            </itemizedlist>
            <para>Some other examples for Intelligent Alerting
                include:</para>
            <itemizedlist>
                <listitem>
-                    <para>Can instances launch and destroyed?</para>
+                    <para>Can instances launch and be destroyed?</para>
                </listitem>
                <listitem>
                    <para>Can users be created?</para>
@ -693,7 +694,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
        <section xml:id="trending">
            <title>Trending</title>
            <para>Trending can give you great insight into how your
-                cloud is performing day to day. For example, if a busy
+                cloud is performing day to day. You can learn, for example, if a busy
                day was simply a rare occurrence or if you should
                start adding new compute nodes.</para>
            <para>Trending takes a slightly different approach than
@ -733,7 +734,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
            <para>As an example, recording <code>nova-api</code> usage
                can allow you to track the need to scale your cloud
                controller. By keeping an eye on <code>nova-api</code>
-                requests, you can determine if you need to spawn more
+                requests, you can determine whether you need to spawn more
                nova-api processes or go as far as introducing an
                entirely new server to run <code>nova-api</code>. To
                get an approximate count of the requests, look for
@ -762,10 +763,10 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
        <title>Summary</title>
        <para>For stable operations, you want to detect failure promptly and
        determine causes efficiently. With a distributed system, it's even
-        more important to track the right items to meet a service level target.
+        more important to track the right items to meet a service-level target.
        Learning where these logs are located in the file system or API gives
-        you an advantage. Plus, we have discussed how to read, interpret, and
-        manipulate information from OpenStack services so you can monitor
+        you an advantage. This chapter also showed how to read, interpret, and
+        manipulate information from OpenStack services so that you can monitor
        effectively.</para>
    </section>
 </chapter>