openstack-manuals/doc/arch-design/massively_scalable/section_operational_conside...

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="operational-considerations-massive-scale">
    <?dbhtml stop-chunking?>
    <title>Operational considerations</title>
    <para>In order to run efficiently at massive scale, automate
        as many of the operational processes as
        possible. Automation includes the configuration of
        provisioning, monitoring and alerting systems. Part of the
        automation process includes the capability to determine when
        human intervention is required and who should act. The
        objective is to increase the ratio of operational staff to
        running systems as much as possible in order to reduce maintenance
        costs. In a massively scaled environment, it is very difficult
        for staff to give each system individual care.</para>
    <para>Configuration management tools such as Puppet and Chef enable
        operations staff to categorize systems into groups based on
        their roles and thus create configurations and system states
        that the provisioning system enforces. Systems
        that fall out of the defined state due to errors or failures
        are quickly removed from the pool of active nodes and
        replaced.</para>
    <para>At large scale the resource cost of diagnosing failed individual
        systems is far greater than the cost of
        replacement. It is more economical to replace the failed
        system with a new system, provisioning and configuring it
        automatically and adding it to the pool of active nodes.
        By automating tasks that are labor-intensive,
        repetitive, and critical to operations, cloud operations
        teams can work more
        efficiently because fewer resources are required for these
        common tasks. Administrators are then free to tackle
        tasks that are not easy to automate and that have longer-term
        impacts on the business, for example, capacity planning.</para>
    <section xml:id="the-bleeding-edge">
      <title>The bleeding edge</title>
    <para>Running OpenStack at massive scale requires striking a
        balance between stability and features. For example, it might
        be tempting to run an older stable release branch of OpenStack
        to make deployments easier. However, when running at massive
        scale, known issues that may be of some concern or only have
        minimal impact in smaller deployments could become pain points.
        Recent releases may address well known issues. The OpenStack
        community can help resolve reported issues by applying
        the collective expertise of the OpenStack developers.</para>
    <para>The number of organizations running at
        massive scales is a small proportion of the
        OpenStack community, therefore it is important to share
        related issues with the community and be a vocal advocate for
        resolving them. Some issues only manifest when operating at
        large scale, and the number of organizations able to duplicate
        and validate an issue is small, so it is important to
        document and dedicate resources to their resolution.</para>
    <para>In some cases, the resolution to the problem is ultimately
        to deploy a more recent version of OpenStack. Alternatively,
        when you must resolve an issue in a production
        environment where rebuilding the entire environment is not an
        option, it is sometimes possible to deploy updates to specific
        underlying components in order to resolve issues or gain
        significant performance improvements. Although this may appear
        to expose the deployment to
        increased risk and instability, in many cases it
        could be an undiscovered issue.</para>
    <para>We recommend building a development and operations
        organization that is responsible for creating desired
        features, diagnosing and resolving issues, and building the
        infrastructure for large scale continuous integration tests
        and continuous deployment. This helps catch bugs early and
        makes deployments faster and easier. In addition to
        development resources, we also recommend the recruitment
        of experts in the fields of message queues, databases, distributed
        systems, networking, cloud, and storage.</para></section>
    <section xml:id="growth-and-capacity-planning">
      <title>Growth and capacity planning</title>
    <para>An important consideration in running at massive scale is
        projecting growth and utilization trends in order to plan capital
        expenditures for the short and long term. Gather utilization
        meters for compute, network, and storage, along with historical
        records of these meters. While securing major
        anchor tenants can lead to rapid jumps in the utilization
        rates of all resources, the steady adoption of the cloud
        inside an organization or by consumers in a public
        offering also creates a steady trend of increased
        utilization.</para></section>
    <section xml:id="skills-and-training">
      <title>Skills and training</title>
    <para>Projecting growth for storage, networking, and compute is
        only one aspect of a growth plan for running OpenStack at
        massive scale. Growing and nurturing development and
        operational staff is an additional consideration. Sending team
        members to OpenStack conferences, meetup events, and
        encouraging active participation in the mailing lists and
        committees is a very important way to maintain skills and
        forge relationships in the community. For a list of OpenStack
        training providers in the marketplace, see: <link
        xlink:href="http://www.openstack.org/marketplace/training/">http://www.openstack.org/marketplace/training/</link>.
    </para>
    </section>
</section>