operations-guide/doc/openstack-ops/ch_ops_maintenance.xml

<?xml version="1.0" encoding="UTF-8"?>
<chapter version="5.0" xml:id="maintenance"
         xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:xi="http://www.w3.org/2001/XInclude"
         xmlns:ns5="http://www.w3.org/2000/svg"
         xmlns:ns4="http://www.w3.org/1998/Math/MathML"
         xmlns:ns3="http://www.w3.org/1999/xhtml"
         xmlns:db="http://docbook.org/ns/docbook">
  <?dbhtml stop-chunking?>

  <title>Maintenance, Failures, and Debugging</title>

  <para>Downtime, whether planned or unscheduled, is a certainty when running
  a cloud. This chapter aims to provide useful information for dealing
  proactively, or reactively, with these occurrences.<indexterm
      class="startofrange" xml:id="maindebug">
      <primary>maintenance/debugging</primary>

      <seealso>troubleshooting</seealso>
    </indexterm></para>

  <section xml:id="cloud_controller_storage">
    <?dbhtml stop-chunking?>

    <title>Cloud Controller and Storage Proxy Failures and Maintenance</title>

    <para>The cloud controller and storage proxy are very similar to each
    other when it comes to expected and unexpected downtime. One of each
    server type typically runs in the cloud, which makes them very noticeable
    when they are not running.</para>

    <para>For the cloud controller, the good news is if your cloud is using
    the FlatDHCP multi-host HA network mode, existing instances and volumes
    continue to operate while the cloud controller is offline. For the storage
    proxy, however, no storage traffic is possible until it is back up and
    running.</para>

    <section xml:id="planned_maintenance">
      <?dbhtml stop-chunking?>

      <title>Planned Maintenance</title>

      <para>One way to plan for cloud controller or storage proxy maintenance
      is to simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy
      affects fewer users. If your cloud controller or storage proxy is too
      important to have unavailable at any point in time, you must look into
      high-availability options.<indexterm class="singular">
          <primary>cloud controllers</primary>

          <secondary>planned maintenance of</secondary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>cloud controller planned maintenance</secondary>
        </indexterm></para>
    </section>

    <section xml:id="reboot_cloud_controller">
      <?dbhtml stop-chunking?>

      <title>Rebooting a Cloud Controller or Storage Proxy</title>

      <para>All in all, just issue the "reboot" command. The operating system
      cleanly shuts down services and then automatically reboots. If you want
      to be very thorough, run your backup jobs just before you
      reboot.<indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>rebooting following</secondary>
        </indexterm><indexterm class="singular">
          <primary>storage</primary>

          <secondary>storage proxy maintenance</secondary>
        </indexterm><indexterm class="singular">
          <primary>reboot</primary>

          <secondary>cloud controller or storage proxy</secondary>
        </indexterm><indexterm class="singular">
          <primary>cloud controllers</primary>

          <secondary>rebooting</secondary>
        </indexterm></para>
    </section>

    <section xml:id="after_a_cc_reboot">
      <?dbhtml stop-chunking?>

      <title>After a Cloud Controller or Storage Proxy Reboots</title>

      <para>After a cloud controller reboots, ensure that all required
      services were successfully started. The following commands use
      <code>ps</code> and <code>grep</code> to determine if nova, glance, and
      keystone are currently running:</para>

      <programlisting><?db-font-size 65%?># ps aux | grep nova-
# ps aux | grep glance-
# ps aux | grep keystone
# ps aux | grep cinder</programlisting>

      <para>Also check that all services are functioning. The following set of
      commands sources the <code>openrc</code> file, then runs some basic
      glance, nova, and openstack commands. If the commands work as expected,
      you can be confident that those services are in working
      condition:</para>

      <programlisting><?db-font-size 65%?># source openrc
# glance index
# nova list
# openstack project list</programlisting>

      <para>For the storage proxy, ensure that the Object Storage service has
      resumed:</para>

      <programlisting><?db-font-size 65%?># ps aux | grep swift</programlisting>

      <para>Also check that it is functioning:</para>

      <programlisting><?db-font-size 65%?># swift stat</programlisting>
    </section>

    <section xml:id="cc_failure">
      <?dbhtml stop-chunking?>

      <title>Total Cloud Controller Failure</title>

      <para>The cloud controller could completely fail if, for example, its
      motherboard goes bad. Users will immediately notice the loss of a cloud
      controller since it provides core functionality to your cloud
      environment. If your infrastructure monitoring does not alert you that
      your cloud controller has failed, your users definitely will.
      Unfortunately, this is a rough situation. The cloud controller is an
      integral part of your cloud. If you have only one controller, you will
      have many missing services if it goes down.<indexterm class="singular">
          <primary>cloud controllers</primary>

          <secondary>total failure of</secondary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>cloud controller total failure</secondary>
        </indexterm></para>

      <para>To avoid this situation, create a highly available cloud
      controller cluster. This is outside the scope of this document, but you
      can read more in the <link
      xlink:href="http://docs.openstack.org/ha-guide/index.html">OpenStack High Availability
      Guide</link>.</para>

      <para>The next best approach is to use a configuration-management tool,
      such as Puppet, to automatically build a cloud controller. This should
      not take more than 15 minutes if you have a spare server available.
      After the controller rebuilds, restore any backups taken (see <xref
      linkend="backup_and_recovery" />).</para>

      <para>Also, in practice, the <literal>nova-compute</literal> services on
      the compute nodes do not always reconnect cleanly to rabbitmq hosted on
      the controller when it comes back up after a long reboot; a restart on
      the nova services on the compute nodes is required.</para>
    </section>
  </section>

  <section xml:id="compute_node_failures">
    <?dbhtml stop-chunking?>

    <title>Compute Node Failures and Maintenance</title>

    <para>Sometimes a compute node either crashes unexpectedly or requires a
    reboot for maintenance reasons.</para>

    <section xml:id="planned_maintenance_compute_node">
      <?dbhtml stop-chunking?>

      <title>Planned Maintenance</title>

      <para>If you need to reboot a compute node due to planned maintenance
      (such as a software or hardware upgrade), first ensure that all hosted
      instances have been moved off the node. If your cloud is utilizing
      shared storage, use the <code>nova live-migration</code> command. First,
      get a list of instances that need to be moved:<indexterm
          class="singular">
          <primary>compute nodes</primary>

          <secondary>maintenance</secondary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>compute node planned maintenance</secondary>
        </indexterm></para>

      <programlisting><?db-font-size 65%?># nova list --host c01.example.com --all-tenants</programlisting>

      <para>Next, migrate them one by one:</para>

      <programlisting><?db-font-size 65%?># nova live-migration &lt;uuid&gt; c02.example.com</programlisting>

      <para>If you are not using shared storage, you can use the
      <code>--block-migrate</code> option:</para>

      <programlisting><?db-font-size 65%?># nova live-migration --block-migrate &lt;uuid&gt; c02.example.com</programlisting>

      <para>After you have migrated all instances, ensure that the
      <code>nova-compute</code> service has <phrase
      role="keep-together">stopped</phrase>:</para>

      <programlisting><?db-font-size 65%?># stop nova-compute</programlisting>

      <para>If you use a configuration-management system, such as Puppet, that
      ensures the <code>nova-compute</code> service is always running, you can
      temporarily move the <literal>init</literal> files:</para>

      <programlisting><?db-font-size 65%?># mkdir /root/tmp
# mv /etc/init/nova-compute.conf /root/tmp
# mv /etc/init.d/nova-compute /root/tmp</programlisting>

      <para>Next, shut down your compute node, perform your maintenance, and
      turn the node back on. You can reenable the <code>nova-compute</code>
      service by undoing the previous commands:</para>

      <programlisting><?db-font-size 65%?># mv /root/tmp/nova-compute.conf /etc/init
# mv /root/tmp/nova-compute /etc/init.d/</programlisting>

      <para>Then start the <code>nova-compute</code> service:</para>

      <programlisting><?db-font-size 65%?># start nova-compute</programlisting>

      <para>You can now optionally migrate the instances back to their
      original compute node.</para>
    </section>

    <section xml:id="after_compute_node_reboot">
      <?dbhtml stop-chunking?>

      <title>After a Compute Node Reboots</title>

      <para>When you reboot a compute node, first verify that it booted
      successfully. This includes ensuring that the <code>nova-compute</code>
      service is running:<indexterm class="singular">
          <primary>reboot</primary>

          <secondary>compute node</secondary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>compute node reboot</secondary>
        </indexterm></para>

      <programlisting><?db-font-size 65%?># ps aux | grep nova-compute
# status nova-compute</programlisting>

      <para>Also ensure that it has successfully connected to the AMQP
      server:</para>

      <programlisting><?db-font-size 65%?># grep AMQP /var/log/nova/nova-compute
2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672</programlisting>

      <para>After the compute node is successfully running, you must deal with
      the instances that are hosted on that compute node because none of them
      are running. Depending on your SLA with your users or customers, you
      might have to start each instance and ensure that they start
      correctly.</para>
    </section>

    <section xml:id="maintenance_instances">
      <?dbhtml stop-chunking?>

      <title>Instances</title>

      <para>You can create a list of instances that are hosted on the compute
      node by performing the following command:<indexterm class="singular">
          <primary>instances</primary>

          <secondary>maintenance/debugging</secondary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>instances</secondary>
        </indexterm></para>

      <programlisting><?db-font-size 65%?># nova list --host c01.example.com --all-tenants</programlisting>

      <para>After you have the list, you can use the nova command to start
      each instance:</para>

      <programlisting><?db-font-size 65%?># nova reboot &lt;uuid&gt;</programlisting>

      <note>
        <para>Any time an instance shuts down unexpectedly, it might have
        problems on boot. For example, the instance might require an
        <code>fsck</code> on the root partition. If this happens, the user can
        use the dashboard VNC console to fix this.</para>
      </note>

      <para>If an instance does not boot, meaning <code>virsh list</code>
      never shows the instance as even attempting to boot, do the following on
      the compute node:</para>

      <programlisting><?db-font-size 65%?># tail -f /var/log/nova/nova-compute.log</programlisting>

      <para>Try executing the <code>nova reboot</code> command again. You
      should see an error message about why the instance was not able to
      boot</para>

      <para>In most cases, the error is the result of something in libvirt's
      XML file (<code>/etc/libvirt/qemu/instance-xxxxxxxx.xml</code>) that no
      longer exists. You can enforce re-creation of the XML file as well as
      rebooting the instance by running the following command:</para>

      <programlisting><?db-font-size 65%?># nova reboot --hard &lt;uuid&gt;</programlisting>
    </section>

    <section xml:id="inspect_and_recover_failed_instances">
      <?dbhtml stop-chunking?>

      <title>Inspecting and Recovering Data from Failed Instances</title>

      <para>In some scenarios, instances are running but are inaccessible
      through SSH and do not respond to any command. The VNC console could be
      displaying a boot failure or kernel panic error messages. This could be
      an indication of file system corruption on the VM itself. If you need to
      recover files or inspect the content of the instance, qemu-nbd can be
      used to mount the disk.<indexterm class="singular">
          <primary>data</primary>

          <secondary>inspecting/recovering failed instances</secondary>
        </indexterm></para>

      <warning>
        <para>If you access or view the user's content and data, get approval
        first!<indexterm class="singular">
            <primary>security issues</primary>

            <secondary>failed instance data inspection</secondary>
          </indexterm></para>
      </warning>

      <para>To access the instance's disk
      (<literal>/var/lib/nova/instances/instance-<replaceable>xxxxxx</replaceable>/disk</literal>),
      use the following steps:</para>

      <orderedlist>
        <listitem>
          <para>Suspend the instance using the <literal>virsh</literal>
          command.</para>
        </listitem>

        <listitem>
          <para>Connect the qemu-nbd device to the disk.</para>
        </listitem>

        <listitem>
          <para>Mount the qemu-nbd device.</para>
        </listitem>

        <listitem>
          <para>Unmount the device after inspecting.</para>
        </listitem>

        <listitem>
          <para>Disconnect the qemu-nbd device.</para>
        </listitem>

        <listitem>
          <para>Resume the instance.</para>
        </listitem>
      </orderedlist>

      <para>If you do not follow steps 4 through 6, OpenStack Compute cannot
      manage the instance any longer. It fails to respond to any command
      issued by OpenStack Compute, and it is marked as shut down.</para>

      <para>Once you mount the disk file, you should be able to access it and
      treat it as a collection of normal directories with files and a
      directory structure. However, we do not recommend that you edit or touch
      any files because this could change the access control lists (ACLs) that
      are used to determine which accounts can perform what operations on
      files and directories. Changing ACLs can make the instance unbootable if
      it is not already.<indexterm class="singular">
          <primary>access control list (ACL)</primary>
        </indexterm></para>

      <orderedlist>
        <listitem>
          <para>Suspend the instance using the <literal>virsh</literal>
          command, taking note of the internal ID:</para>

          <programlisting><?db-font-size 65%?># virsh list
Id Name                 State
----------------------------------
1 instance-00000981    running
2 instance-000009f5    running
30 instance-0000274a    running

# virsh suspend 30
Domain 30 suspended</programlisting>
        </listitem>

        <listitem>
          <para>Connect the qemu-nbd device to the disk:</para>

          <programlisting><?db-font-size 65%?># cd /var/lib/nova/instances/instance-0000274a
# ls -lh
total 33M
-rw-rw---- 1 libvirt-qemu kvm  6.3K Oct 15 11:31 console.log
-rw-r--r-- 1 libvirt-qemu kvm   33M Oct 15 22:06 disk
-rw-r--r-- 1 libvirt-qemu kvm  384K Oct 15 22:06 disk.local
-rw-rw-r-- 1 nova         nova 1.7K Oct 15 11:30 libvirt.xml
# qemu-nbd -c /dev/nbd0 `pwd`/disk</programlisting>
        </listitem>

        <listitem>
          <para>Mount the qemu-nbd device.</para>

          <para>The qemu-nbd device tries to export the instance disk's
          different partitions as separate devices. For example, if vda is the
          disk and vda1 is the root partition, qemu-nbd exports the device as
          <literal>/dev/nbd0</literal> and <literal>/dev/nbd0p1</literal>,
          respectively:</para>

          <programlisting><?db-font-size 65%?># mount /dev/nbd0p1 /mnt/</programlisting>

          <para>You can now access the contents of <code>/mnt</code>, which
          correspond to the first partition of the instance's disk.</para>

          <para>To examine the secondary or ephemeral disk, use an alternate
          mount point if you want both primary and secondary drives mounted at
          the same time:</para>

          <programlisting><?db-font-size 65%?># umount /mnt
# qemu-nbd -c /dev/nbd1 `pwd`/disk.local
# mount /dev/nbd1 /mnt/</programlisting>

          <programlisting><?db-font-size 65%?># ls -lh /mnt/
total 76K
lrwxrwxrwx.  1 root root    7 Oct 15 00:44 bin -&gt; usr/bin
dr-xr-xr-x.  4 root root 4.0K Oct 15 01:07 boot
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 dev
drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
drwxr-xr-x.  3 root root 4.0K Oct 15 01:07 home
lrwxrwxrwx.  1 root root    7 Oct 15 00:44 lib -&gt; usr/lib
lrwxrwxrwx.  1 root root    9 Oct 15 00:44 lib64 -&gt; usr/lib64
drwx------.  2 root root  16K Oct 15 00:42 lost+found
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 media
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 mnt
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 opt
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 proc
dr-xr-x---.  3 root root 4.0K Oct 15 21:56 root
drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
lrwxrwxrwx.  1 root root    8 Oct 15 00:44 sbin -&gt; usr/sbin
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 srv
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 sys
drwxrwxrwt.  9 root root 4.0K Oct 15 16:29 tmp
drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var</programlisting>
        </listitem>

        <listitem>
          <para>Once you have completed the inspection, unmount the mount
          point and release the qemu-nbd device:</para>

          <programlisting><?db-font-size 65%?># umount /mnt
# qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected</programlisting>
        </listitem>

        <listitem>
          <para>Resume the instance using <literal>virsh</literal>:</para>

          <programlisting><?db-font-size 65%?># virsh list
Id Name                 State
----------------------------------
1 instance-00000981    running
2 instance-000009f5    running
30 instance-0000274a    paused

# virsh resume 30
Domain 30 resumed</programlisting>
        </listitem>
      </orderedlist>
    </section>

    <section xml:id="volumes">
      <?dbhtml stop-chunking?>

      <title>Volumes</title>

      <para>If the affected instances also had attached volumes, first
      generate a list of instance and volume UUIDs:<indexterm class="singular">
          <primary>volume</primary>

          <secondary>maintenance/debugging</secondary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>volumes</secondary>
        </indexterm></para>

      <programlisting><?db-font-size 65%?>mysql&gt; select nova.instances.uuid as instance_uuid,
cinder.volumes.id as volume_uuid, cinder.volumes.status,
cinder.volumes.attach_status, cinder.volumes.mountpoint,
cinder.volumes.display_name from cinder.volumes
inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid
 where nova.instances.host = 'c01.example.com';</programlisting>

      <para>You should see a result similar to the following:</para>

      <programlisting><?db-font-size 55%?>
+--------------+------------+-------+--------------+-----------+--------------+
|instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name |
+--------------+------------+-------+--------------+-----------+--------------+
|9b969a05      |1f0fbf36    |in-use |attached      |/dev/vdc   | test         |
+--------------+------------+-------+--------------+-----------+--------------+
1 row in set (0.00 sec)</programlisting>

      <para>Next, manually detach and reattach the volumes, where X is the
      proper mount point:</para>

      <programlisting><?db-font-size 65%?># nova volume-detach &lt;instance_uuid&gt; &lt;volume_uuid&gt;
# nova volume-attach &lt;instance_uuid&gt; &lt;volume_uuid&gt; /dev/vdX</programlisting>

      <para>Be sure that the instance has successfully booted and is at a
      login screen before doing the above.</para>
    </section>

    <section xml:id="total_compute_node_failure">
      <?dbhtml stop-chunking?>

      <title>Total Compute Node Failure</title>

      <para>Compute nodes can fail the same way a cloud controller can fail. A
      motherboard failure or some other type of hardware failure can cause an
      entire compute node to go offline. When this happens, all instances
      running on that compute node will not be available. Just like with a
      cloud controller failure, if your infrastructure monitoring does not
      detect a failed compute node, your users will notify you because of
      their lost instances.<indexterm class="singular">
          <primary>compute nodes</primary>

          <secondary>failures</secondary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>compute node total failures</secondary>
        </indexterm></para>

      <para>If a compute node fails and won't be fixed for a few hours (or at
      all), you can relaunch all instances that are hosted on the failed node
      if you use shared storage for
      <code>/var/lib/nova/instances</code>.</para>

      <para>To do this, generate a list of instance UUIDs that are hosted on
      the failed node by running the following query on the nova
      database:</para>

      <programlisting><?db-font-size 65%?>mysql&gt; select uuid from instances where host = \
       'c01.example.com' and deleted = 0;</programlisting>

      <para>Next, update the nova database to indicate that all instances that
      used to be hosted on c01.example.com are now hosted on
      c02.example.com:</para>

      <programlisting><?db-font-size 65%?>mysql&gt; update instances set host = 'c02.example.com' where host = \
       'c01.example.com' and deleted = 0;</programlisting>

      <para>If you're using the Networking service ML2 plug-in, update the
      Networking service database to indicate that all ports that
      used to be hosted on c01.example.com are now hosted on
      c02.example.com:</para>

      <programlisting><?db-font-size 65%?>mysql&gt; update ml2_port_bindings set host = 'c02.example.com' where host = \
       'c01.example.com';</programlisting>

      <programlisting><?db-font-size 65%?>mysql&gt; update ml2_port_binding_levels set host = 'c02.example.com' where host = \
       'c01.example.com';</programlisting>

      <para>After that, use the <literal>nova</literal> command to reboot all
      instances that were on c01.example.com while regenerating their XML
      files at the same time:</para>

      <programlisting><?db-font-size 65%?># nova reboot --hard &lt;uuid&gt;</programlisting>

      <para>Finally, reattach volumes using the same method described in the
      section <link linkend="volumes">Volumes</link>.</para>
    </section>

    <section xml:id="var_lib_nova_instances">
      <?dbhtml stop-chunking?>

      <title>/var/lib/nova/instances</title>

      <para>It's worth mentioning this directory in the context of failed
      compute nodes. This directory contains the libvirt KVM file-based disk
      images for the instances that are hosted on that compute node. If you
      are not running your cloud in a shared storage environment, this
      directory is unique across all compute nodes.<indexterm class="singular">
          <primary>/var/lib/nova/instances directory</primary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>/var/lib/nova/instances</secondary>
        </indexterm></para>

      <para><code>/var/lib/nova/instances</code> contains two types of
      directories.</para>

      <para>The first is the <code>_base</code> directory. This contains all
      the cached base images from glance for each unique image that has been
      launched on that compute node. Files ending in <code>_20</code> (or a
      different number) are the ephemeral base images.</para>

      <para>The other directories are titled <code>instance-xxxxxxxx</code>.
      These directories correspond to instances running on that compute node.
      The files inside are related to one of the files in the
      <code>_base</code> directory. They're essentially differential-based
      files containing only the changes made from the original
      <code>_base</code> directory.</para>

      <para>All files and directories in <code>/var/lib/nova/instances</code>
      are uniquely named. The files in _base are uniquely titled for the
      glance image that they are based on, and the directory names
      <code>instance-xxxxxxxx</code> are uniquely titled for that particular
      instance. For example, if you copy all data from
      <code>/var/lib/nova/instances</code> on one compute node to another, you
      do not overwrite any files or cause any damage to images that have the
      same unique name, because they are essentially the same file.</para>

      <para>Although this method is not documented or supported, you can use
      it when your compute node is permanently offline but you have instances
      locally stored on it.</para>
    </section>
  </section>

  <section xml:id="storage_node_failures">
    <?dbhtml stop-chunking?>

    <title>Storage Node Failures and Maintenance</title>

    <para>Because of the high redundancy of Object Storage, dealing with
    object storage node issues is a lot easier than dealing with compute node
    issues.</para>

    <section xml:id="reboot_storage_node">
      <?dbhtml stop-chunking?>

      <title>Rebooting a Storage Node</title>

      <para>If a storage node requires a reboot, simply reboot it. Requests
      for data hosted on that node are redirected to other copies while the
      server is rebooting.<indexterm class="singular">
          <primary>storage node</primary>
        </indexterm><indexterm class="singular">
          <primary>nodes</primary>

          <secondary>storage nodes</secondary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>storage node reboot</secondary>
        </indexterm></para>
    </section>

    <section xml:id="shut_down_storage_node">
      <?dbhtml stop-chunking?>

      <title>Shutting Down a Storage Node</title>

      <para>If you need to shut down a storage node for an extended period of
      time (one or more days), consider removing the node from the storage
      ring. For example:<indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>storage node shut down</secondary>
        </indexterm></para>

      <programlisting><?db-font-size 65%?># swift-ring-builder account.builder remove &lt;ip address of storage node&gt;
# swift-ring-builder container.builder remove &lt;ip address of storage node&gt;
# swift-ring-builder object.builder remove &lt;ip address of storage node&gt;
# swift-ring-builder account.builder rebalance
# swift-ring-builder container.builder rebalance
# swift-ring-builder object.builder rebalance</programlisting>

      <para>Next, redistribute the ring files to the other nodes:</para>

      <programlisting><?db-font-size 65%?># for i in s01.example.com s02.example.com s03.example.com
&gt; do
&gt; scp *.ring.gz $i:/etc/swift
&gt; done</programlisting>

      <para>These actions effectively take the storage node out of the storage
      cluster.</para>

      <para>When the node is able to rejoin the cluster, just add it back to
      the ring. The exact syntax you use to add a node to your swift cluster
      with <code>swift-ring-builder</code> heavily depends on the original
      options used when you originally created your cluster. Please refer back
      to those commands.</para>
    </section>

    <section xml:id="replace_swift_disk">
      <?dbhtml stop-chunking?>

      <title>Replacing a Swift Disk</title>

      <para>If a hard drive fails in an Object Storage node, replacing it is
      relatively easy. This assumes that your Object Storage environment is
      configured correctly, where the data that is stored on the failed drive
      is also replicated to other drives in the Object Storage
      environment.<indexterm class="singular">
          <primary>hard drives, replacing</primary>
        </indexterm><indexterm class="singular">
          <primary>maintenance/debugging</primary>

          <secondary>swift disk replacement</secondary>
        </indexterm></para>

      <para>This example assumes that <code>/dev/sdb</code> has failed.</para>

      <para>First, unmount the disk:</para>

      <programlisting><?db-font-size 65%?># umount /dev/sdb</programlisting>

      <para>Next, physically remove the disk from the server and replace it
      with a working disk.</para>

      <para>Ensure that the operating system has recognized the new
      disk:</para>

      <programlisting><?db-font-size 65%?># dmesg | tail</programlisting>

      <para>You should see a message about <code>/dev/sdb</code>.</para>

      <para>Because it is recommended to not use partitions on a swift disk,
      simply format the disk as a whole:</para>

      <programlisting><?db-font-size 65%?># mkfs.xfs /dev/sdb</programlisting>

      <para>Finally, mount the disk:</para>

      <programlisting><?db-font-size 65%?># mount -a</programlisting>

      <para>Swift should notice the new disk and that no data exists. It then
      begins replicating the data to the disk from the other existing
      replicas.</para>
    </section>
  </section>

  <section xml:id="complete_failure">
    <?dbhtml stop-chunking?>

    <title>Handling a Complete Failure</title>

    <para>A common way of dealing with the recovery from a full system
    failure, such as a power outage of a data center, is to assign each
    service a priority, and restore in order. <xref
    linkend="restor-prior-table" /> shows an example.<indexterm
        class="singular">
        <primary>service restoration</primary>
      </indexterm><indexterm class="singular">
        <primary>maintenance/debugging</primary>

        <secondary>complete failures</secondary>
      </indexterm></para>

    <table rules="all" xml:id="restor-prior-table">
      <caption>Example service restoration priority list</caption>

      <thead>
        <tr>
          <th>Priority</th>

          <th>Services</th>
        </tr>
      </thead>

      <tbody>
        <tr>
          <td><para>1</para></td>

          <td><para>Internal network connectivity</para></td>
        </tr>

        <tr>
          <td><para>2</para></td>

          <td><para>Backing storage services</para></td>
        </tr>

        <tr>
          <td><para>3</para></td>

          <td><para>Public network connectivity for user virtual
          machines</para></td>
        </tr>

        <tr>
          <td><para>4</para></td>

          <td><para><literal>nova-compute</literal>,
          <literal>nova-network</literal>, cinder hosts</para></td>
        </tr>

        <tr>
          <td><para>5</para></td>

          <td><para>User virtual machines</para></td>
        </tr>

        <tr>
          <td><para>10</para></td>

          <td><para>Message queue and database services</para></td>
        </tr>

        <tr>
          <td><para>15</para></td>

          <td><para>Keystone services</para></td>
        </tr>

        <tr>
          <td><para>20</para></td>

          <td><para><literal>cinder-scheduler</literal></para></td>
        </tr>

        <tr>
          <td><para>21</para></td>

          <td><para>Image Catalog and Delivery services</para></td>
        </tr>

        <tr>
          <td><para>22</para></td>

          <td><para><literal>nova-scheduler</literal> services</para></td>
        </tr>

        <tr>
          <td><para>98</para></td>

          <td><para><literal>cinder-api</literal></para></td>
        </tr>

        <tr>
          <td><para>99</para></td>

          <td><para><literal>nova-api</literal> services</para></td>
        </tr>

        <tr>
          <td><para>100</para></td>

          <td><para>Dashboard node</para></td>
        </tr>
      </tbody>
    </table>

    <para>Use this example priority list to ensure that user-affected services
    are restored as soon as possible, but not before a stable environment is
    in place. Of course, despite being listed as a single-line item, each step
    requires significant work. For example, just after starting the database,
    you should check its integrity, or, after starting the nova services, you
    should verify that the hypervisor matches the database and fix any <phrase
    role="keep-together">mismatches</phrase>.</para>
  </section>

  <section xml:id="config_mgmt">
    <?dbhtml stop-chunking?>

    <title>Configuration Management</title>

    <para>Maintaining an OpenStack cloud requires that you manage multiple
    physical servers, and this number might grow over time. Because managing
    nodes manually is error prone, we strongly recommend that you use a
    configuration-management tool. These tools automate the process of
    ensuring that all your nodes are configured properly and encourage you to
    maintain your configuration information (such as packages and
    configuration options) in a version-controlled repository.<indexterm
        class="singular">
        <primary>configuration management</primary>
      </indexterm><indexterm class="singular">
        <primary>networks</primary>

        <secondary>configuration management</secondary>
      </indexterm><indexterm class="singular">
        <primary>maintenance/debugging</primary>

        <secondary>configuration management</secondary>
      </indexterm></para>

    <tip>
      <para>Several configuration-management tools are available, and this
      guide does not recommend a specific one. The two most popular ones in
      the OpenStack community are <link
      xlink:href="https://puppetlabs.com/">Puppet</link>, with available
      <link xlink:href="https://github.com/puppetlabs/puppetlabs-openstack">OpenStack Puppet
      modules</link>; and <link
      xlink:href="http://www.getchef.com/chef/">Chef</link>, with available <link
      xlink:href="https://github.com/opscode/openstack-chef-repo">OpenStack Chef recipes</link>.
      Other newer configuration tools include <link
      xlink:href="https://juju.ubuntu.com/">Juju</link>, <link
      xlink:href="https://www.ansible.com/">Ansible</link>, and <link
      xlink:href="http://www.saltstack.com/">Salt</link>; and more mature
      configuration management tools include <link
      xlink:href="http://cfengine.com/">CFEngine</link> and <link
      xlink:href="http://bcfg2.org/">Bcfg2</link>.</para>
    </tip>
  </section>

  <section xml:id="hardware">
    <?dbhtml stop-chunking?>

    <title>Working with Hardware</title>

    <para>As for your initial deployment, you should ensure that all hardware
    is appropriately burned in before adding it to production. Run software
    that uses the hardware to its limits—maxing out RAM, CPU, disk, and
    network. Many options are available, and normally double as benchmark
    software, so you also get a good idea of the performance of your
    system.<indexterm class="singular">
        <primary>hardware</primary>

        <secondary>maintenance/debugging</secondary>
      </indexterm><indexterm class="singular">
        <primary>maintenance/debugging</primary>

        <secondary>hardware</secondary>
      </indexterm></para>

    <section xml:id="add_new_node">
      <?dbhtml stop-chunking?>

      <title>Adding a Compute Node</title>

      <para>If you find that you have reached or are reaching the capacity
      limit of your computing resources, you should plan to add additional
      compute nodes. Adding more nodes is quite easy. The process for adding
      compute nodes is the same as when the initial compute nodes were
      deployed to your cloud: use an automated deployment system to bootstrap
      the bare-metal server with the operating system and then have a
      configuration-management system install and configure OpenStack Compute.
      Once the Compute service has been installed and configured in the same
      way as the other compute nodes, it automatically attaches itself to the
      cloud. The cloud controller notices the new node(s) and begins
      scheduling instances to launch there.<indexterm class="singular">
          <primary>cloud controllers</primary>

          <secondary>new compute nodes and</secondary>
        </indexterm><indexterm class="singular">
          <primary>nodes</primary>

          <secondary>adding</secondary>
        </indexterm><indexterm class="singular">
          <primary>compute nodes</primary>

          <secondary>adding</secondary>
        </indexterm></para>

      <para>If your OpenStack Block Storage nodes are separate from your
      compute nodes, the same procedure still applies because the same queuing
      and polling system is used in both services.</para>

      <para>We recommend that you use the same hardware for new compute and
      block storage nodes. At the very least, ensure that the CPUs are similar
      in the compute nodes to not break live migration.</para>
    </section>

    <section xml:id="add_new_object_node">
      <?dbhtml stop-chunking?>

      <title>Adding an Object Storage Node</title>

      <para>Adding a new object storage node is different from adding compute
      or block storage nodes. You still want to initially configure the server
      by using your automated deployment and configuration-management systems.
      After that is done, you need to add the local disks of the object
      storage node into the object storage ring. The exact command to do this
      is the same command that was used to add the initial disks to the ring.
      Simply rerun this command on the object storage proxy server for all
      disks on the new object storage node. Once this has been done, rebalance
      the ring and copy the resulting ring files to the other storage
      nodes.<indexterm class="singular">
          <primary>Object Storage</primary>

          <secondary>adding nodes</secondary>
        </indexterm></para>

      <note>
        <para>If your new object storage node has a different number of disks
        than the original nodes have, the command to add the new node is
        different from the original commands. These parameters vary from
        environment to environment.</para>
      </note>
    </section>

    <section xml:id="replace_components">
      <?dbhtml stop-chunking?>

      <title>Replacing Components</title>

      <para>Failures of hardware are common in large-scale deployments such as
      an infrastructure cloud. Consider your processes and balance time saving
      against availability. For example, an Object Storage cluster can easily
      live with dead disks in it for some period of time if it has sufficient
      capacity. Or, if your compute installation is not full, you could
      consider live migrating instances off a host with a RAM failure until
      you have time to deal with the problem.</para>
    </section>
  </section>

  <section xml:id="databases">
    <?dbhtml stop-chunking?>

    <title>Databases</title>

    <para>Almost all OpenStack components have an underlying database to store
    persistent information. Usually this database is MySQL. Normal MySQL
    administration is applicable to these databases. OpenStack does not
    configure the databases out of the ordinary. Basic administration includes
    performance tweaking, high availability, backup, recovery, and repairing.
    For more information, see a standard MySQL administration guide.<indexterm
        class="singular">
        <primary>databases</primary>

        <secondary>maintenance/debugging</secondary>
      </indexterm><indexterm class="singular">
        <primary>maintenance/debugging</primary>

        <secondary>databases</secondary>
      </indexterm></para>

    <para>You can perform a couple of tricks with the database to either more
    quickly retrieve information or fix a data inconsistency error—for
    example, an instance was terminated, but the status was not updated in the
    database. These tricks are discussed throughout this book.</para>

    <section xml:id="database_connect">
      <?dbhtml stop-chunking?>

      <title>Database Connectivity</title>

      <para>Review the component's configuration file to see how each
      OpenStack component accesses its corresponding database. Look for either
      <code>sql_connection</code> or simply <code>connection</code>. The
      following command uses <code>grep</code> to display the SQL connection
      string for nova, glance, cinder, and keystone:</para>

      <programlisting><?db-font-size 65%?># <emphasis role="bold">grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf
/etc/cinder/cinder.conf /etc/keystone/keystone.conf</emphasis>
sql_connection = mysql+pymysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
sql_connection = mysql+pymysql://cinder:password@cloud.example.com/cinder
    connection = mysql+pymysql://keystone_admin:password@cloud.example.com/keystone</programlisting>

      <para>The connection strings take this format:</para>

      <programlisting><?db-font-size 65%?>mysql+pymysql:// &lt;username&gt; : &lt;password&gt; @ &lt;hostname&gt; / &lt;database name&gt;</programlisting>
    </section>

    <section xml:id="perf_and_opt">
      <?dbhtml stop-chunking?>

      <title>Performance and Optimizing</title>

      <para>As your cloud grows, MySQL is utilized more and more. If you
      suspect that MySQL might be becoming a bottleneck, you should start
      researching MySQL optimization. The MySQL manual has an entire section
      dedicated to this topic: <link
      xlink:href="http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html">Optimization
      Overview</link>.</para>
    </section>
  </section>

  <section xml:id="hdmy">
    <?dbhtml stop-chunking?>

    <title>HDWMY</title>

    <para>Here's a quick list of various to-do items for each hour, day, week,
    month, and year. Please note that these tasks are neither required nor
    definitive but helpful ideas:<indexterm class="singular">
        <primary>maintenance/debugging</primary>

        <secondary>schedule of tasks</secondary>
      </indexterm></para>

    <section xml:id="hourly">
      <?dbhtml stop-chunking?>

      <title>Hourly</title>

      <itemizedlist>
        <listitem>
          <para>Check your monitoring system for alerts and act on
          them.</para>
        </listitem>

        <listitem>
          <para>Check your ticket queue for new tickets.</para>
        </listitem>
      </itemizedlist>
    </section>

    <section xml:id="daily">
      <?dbhtml stop-chunking?>

      <title>Daily</title>

      <itemizedlist>
        <listitem>
          <para>Check for instances in a failed or weird state and investigate
          why.</para>
        </listitem>

        <listitem>
          <para>Check for security patches and apply them as needed.</para>
        </listitem>
      </itemizedlist>
    </section>

    <section xml:id="weekly">
      <?dbhtml stop-chunking?>

      <title>Weekly</title>

      <itemizedlist>
        <listitem>
          <para>Check cloud usage: <itemizedlist>
              <listitem>
                <para>User quotas</para>
              </listitem>

              <listitem>
                <para>Disk space</para>
              </listitem>

              <listitem>
                <para>Image usage</para>
              </listitem>

              <listitem>
                <para>Large instances</para>
              </listitem>

              <listitem>
                <para>Network usage (bandwidth and IP usage)</para>
              </listitem>
            </itemizedlist></para>
        </listitem>

        <listitem>
          <para>Verify your alert mechanisms are still working.</para>
        </listitem>
      </itemizedlist>
    </section>

    <section xml:id="monthly">
      <?dbhtml stop-chunking?>

      <title>Monthly</title>

      <itemizedlist>
        <listitem>
          <para>Check usage and trends over the past month.</para>
        </listitem>

        <listitem>
          <para>Check for user accounts that should be removed.</para>
        </listitem>

        <listitem>
          <para>Check for operator accounts that should be removed.</para>
        </listitem>
      </itemizedlist>
    </section>

    <section xml:id="quarterly">
      <?dbhtml stop-chunking?>

      <title>Quarterly</title>

      <itemizedlist>
        <listitem>
          <para>Review usage and trends over the past quarter.</para>
        </listitem>

        <listitem>
          <para>Prepare any quarterly reports on usage and statistics.</para>
        </listitem>

        <listitem>
          <para>Review and plan any necessary cloud additions.</para>
        </listitem>

        <listitem>
          <para>Review and plan any major OpenStack upgrades.</para>
        </listitem>
      </itemizedlist>
    </section>

    <section xml:id="semiannual">
      <?dbhtml stop-chunking?>

      <title>Semiannually</title>

      <itemizedlist>
        <listitem>
          <para>Upgrade OpenStack.</para>
        </listitem>

        <listitem>
          <para>Clean up after an OpenStack upgrade (any unused or new
          services to be aware of?).</para>
        </listitem>
      </itemizedlist>
    </section>
  </section>

  <section xml:id="broken_component">
    <?dbhtml stop-chunking?>

    <title>Determining Which Component Is Broken</title>

    <para>OpenStack's collection of different components interact with each
    other strongly. For example, uploading an image requires interaction from
    <code>nova-api</code>, <code>glance-api</code>,
    <code>glance-registry</code>, keystone, and potentially
    <code>swift-proxy</code>. As a result, it is sometimes difficult to
    determine exactly where problems lie. Assisting in this is the purpose of
    this section.<indexterm class="singular">
        <primary>logging/monitoring</primary>

        <secondary>tailing logs</secondary>
      </indexterm><indexterm class="singular">
        <primary>maintenance/debugging</primary>

        <secondary>determining component affected</secondary>
      </indexterm></para>

    <section xml:id="tailing_logs">
      <?dbhtml stop-chunking?>

      <title>Tailing Logs</title>

      <para>The first place to look is the log file related to the command you
      are trying to run. For example, if <code>nova list</code> is failing,
      try tailing a nova log file and running the command again:<indexterm
          class="singular">
          <primary>tailing logs</primary>
        </indexterm></para>

      <para>Terminal 1:</para>

      <programlisting><?db-font-size 65%?># tail -f /var/log/nova/nova-api.log</programlisting>

      <para>Terminal 2:</para>

      <programlisting><?db-font-size 65%?># nova list</programlisting>

      <para>Look for any errors or traces in the log file. For more
      information, see <xref linkend="logging_monitoring" />.</para>

      <para>If the error indicates that the problem is with another component,
      switch to tailing that component's log file. For example, if nova cannot
      access glance, look at the <literal>glance-api</literal> log:</para>

      <para>Terminal 1:</para>

      <programlisting><?db-font-size 65%?># tail -f /var/log/glance/api.log</programlisting>

      <para>Terminal 2:</para>

      <programlisting><?db-font-size 65%?># nova list</programlisting>

      <para>Wash, rinse, and repeat until you find the core cause of the
      problem.</para>
    </section>

    <section xml:id="daemons_cli">
      <?dbhtml stop-chunking?>

      <title>Running Daemons on the CLI</title>

      <para>Unfortunately, sometimes the error is not apparent from the log
      files. In this case, switch tactics and use a different command; maybe
      run the service directly on the command line. For example, if the
      <code>glance-api</code> service refuses to start and stay running, try
      launching the daemon from the command line:<indexterm class="singular">
          <primary>daemons</primary>

          <secondary>running on CLI</secondary>
        </indexterm><indexterm class="singular">
          <primary>Command-line interface (CLI)</primary>
        </indexterm></para>

      <programlisting><?db-font-size 65%?># sudo -u glance -H glance-api</programlisting>

      <para>This might print the error and cause of the problem.<note>
          <para>The <literal>-H</literal> flag is required when running the
          daemons with sudo because some daemons will write files relative to
          the user's home directory, and this write may fail if
          <literal>-H</literal> is left off.</para>
        </note></para>

      <sidebar>
        <title>Example of Complexity</title>

        <para>One morning, a compute node failed to run any instances. The log
        files were a bit vague, claiming that a certain instance was unable to
        be started. This ended up being a red herring because the instance was
        simply the first instance in alphabetical order, so it was the first
        instance that <literal>nova-compute</literal> would touch.</para>

        <para>Further troubleshooting showed that libvirt was not running at
        all. This made more sense. If libvirt wasn't running, then no instance
        could be virtualized through KVM. Upon trying to start libvirt, it
        would silently die immediately. The libvirt logs did not explain
        why.</para>

        <para>Next, the <code>libvirtd</code> daemon was run on the command
        line. Finally a helpful error message: it could not connect to d-bus.
        As ridiculous as it sounds, libvirt, and thus
        <code>nova-compute</code>, relies on d-bus and somehow d-bus crashed.
        Simply starting d-bus set the entire chain back on track, and soon
        everything was back up and running.</para>
      </sidebar>
    </section>
  </section>

  <?hard-pagebreak ?>

  <section xml:id="runningslow">
    <?dbhtml stop-chunking?>

    <title>What to do when things are running slowly</title>

    <para>
      When you are getting slow responses from various services, it can be
      hard to know where to start looking. The first thing to check is the
      extent of the slowness: is it specific to a single service, or varied
      among different services? If your problem is isolated to a specific
      service, it can temporarily be fixed by restarting the service, but that
      is often only a fix for the symptom and not the actual problem.
    </para>

    <para>
      This is a collection of ideas from experienced operators on common
      things to look at that may be the cause of slowness. It is not, however,
      designed to be an exhaustive list.
    </para>

    <section xml:id="runningslow_keystone">
      <?dbhtml stop-chunking?>
      <title>OpenStack Identity service</title>
      <para>
        If OpenStack Identity is responding slowly, it could be due to the
        token table getting large. This can be fixed by running the
        <command>keystone-manage token_flush</command> command.
      </para>
      <para>
        Additionally, for Identity-related issues, try the tips in
        <xref linkend="runningslow_sql" />.
      </para>
    </section>

    <section xml:id="runningslow_glance">
      <?dbhtml stop-chunking?>
      <title>OpenStack Image service</title>
      <para>
        OpenStack Image service can be slowed down by things related to the
        Identity service, but the Image service itself can be slowed down if
        connectivity to the back-end storage in use is slow or otherwise
        problematic. For example, your back-end NFS server might have gone
        down.
      </para>
    </section>

    <section xml:id="runningslow_cinder">
      <?dbhtml stop-chunking?>
      <title>OpenStack Block Storage service</title>
      <para>
        OpenStack Block Storage service is similar to the Image service, so
        start by checking Identity-related services, and the back-end storage.
        Additionally, both the Block Storage and Image services rely on AMQP
        and SQL functionality, so consider these when debugging.
      </para>
    </section>

    <section xml:id="runningslow_nova">
      <?dbhtml stop-chunking?>
      <title>OpenStack Compute service</title>
      <para>
        Services related to OpenStack Compute are normally fairly fast and
        rely on a couple of backend services: Identity for authentication and
        authorization), and AMQP for interoperability. Any slowness related to
        services is normally related to one of these. Also, as with all other
        services, SQL is used extensively.
      </para>
    </section>

    <section xml:id="runningslow_neutron">
      <?dbhtml stop-chunking?>
      <title>OpenStack Networking service</title>
      <para>
        Slowness in the OpenStack Networking service can be caused by services
        that it relies upon, but it can also be related to either physical or
        virtual networking. For example: network namespaces that do not exist
        or are not tied to interfaces correctly; DHCP daemons that have hung
        or are not running; a cable being physically disconnected; a switch
        not being configured correctly. When debugging Networking service
        problems, begin by verifying all physical networking functionality
        (switch configuration, physical cabling, etc.). After the physical
        networking is verified, check to be sure all of the Networking
        services are running (neutron-server, neutron-dhcp-agent, etc.), then
        check on AMQP and SQL back ends.
      </para>
    </section>

    <section xml:id="runningslow_amqp">
      <?dbhtml stop-chunking?>
      <title>AMQP broker</title>
      <para>
        Regardless of which AMQP broker you use, such as RabbitMQ, there are
        common issues which not only slow down operations, but can also cause
        real problems. Sometimes messages queued for services stay on the
        queues and are not consumed. This can be due to dead or stagnant
        services and can be commonly cleared up by either restarting the
        AMQP-related services or the OpenStack service in question.
      </para>
    </section>

    <section xml:id="runningslow_sql">
      <?dbhtml stop-chunking?>
      <title>SQL back end</title>
      <para>
        Whether you use SQLite or an RDBMS (such as MySQL), SQL
        interoperability is essential to a functioning OpenStack environment.
        A large or fragmented SQLite file can cause slowness when using files
        as a back end. A locked or long-running query can cause delays for
        most RDBMS services. In this case, do not kill the query immediately,
        but look into it to see if it is a problem with something that is
        hung, or something that is just taking a long time to run and needs to
        finish on its own. The administration of an RDBMS is outside the scope
        of this document, but it should be noted that a properly functioning
        RDBMS is essential to most OpenStack services.
      </para>
    </section>

  </section>

  <?hard-pagebreak ?>

  <section xml:id="uninstalling">
    <?dbhtml stop-chunking?>

    <title>Uninstalling</title>

    <para>While we'd always recommend using your automated deployment system
    to reinstall systems from scratch, sometimes you do need to remove
    OpenStack from a system the hard way. Here's how:<indexterm
        class="singular">
        <primary>uninstall operation</primary>
      </indexterm><indexterm class="singular">
        <primary>maintenance/debugging</primary>

        <secondary>uninstalling</secondary>
      </indexterm></para>

    <itemizedlist>
      <listitem>
        <para>Remove all packages.</para>
      </listitem>

      <listitem>
        <para>Remove remaining files.</para>
      </listitem>

      <listitem>
        <para>Remove databases.</para>
      </listitem>
    </itemizedlist>

    <para>These steps depend on your underlying distribution, but in general
    you should be looking for "purge" commands in your package manager, like
    <literal>aptitude purge ~c $package</literal>. Following this, you can
    look for orphaned files in the directories referenced throughout this
    guide. To uninstall the database properly, refer to the manual appropriate
    for the product in use.<indexterm class="endofrange"
    startref="maindebug" /></para>
  </section>
</chapter>