Merge "Update config management for the Infra Control Plane"

2018-07-20 16:08:14 +00:00 · 2018-07-20 16:08:14 +00:00 · 1ac4dfdf9c
parent 9120b9d627 8117480559
commit 1ac4dfdf9c
2 changed files with 466 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -42,6 +42,7 @@ permits.
   specs/translation_check_site
   specs/wiki_modernization
   specs/project-hosting
+   specs/update-config-management

 Help Wanted
 ===========
--- a/specs/update-config-management.rst
+++ b/specs/update-config-management.rst
@ -0,0 +1,465 @@
+::
+
+  Copyright 2018 Red Hat, Inc.
+
+  This work is licensed under a Creative Commons Attribution 3.0
+  Unported License.
+  http://creativecommons.org/licenses/by/3.0/legalcode
+
+========================
+Update Config Management
+========================
+
+Puppet 3 has been EOL since December 31, 2016.
+
+We are also increasingly finding ourselves in a position where our
+configuration management is a master we serve rather than a force-multiplier
+to help us get things done quickly and more repeatably.
+
+Since our current system was designed, we've also grown capabilities in our
+CI infrastructure that we're ironically unsuited to fully take advantage of.
+
+Our CI jobs are now written in Ansible, which means we have a large number
+of people who are interfacing with Ansible on a regular basis making switching
+to Puppet a mental context switch.
+
+We tend to run a large amount of our software CD and often either from our
+own sources or from a third party in such a manner that the traditional value
+of installing software that comes from the distro channel is not as great as
+it once was.
+
+Problem Description
+===================
+
+Our current system uses Ansible to orchestrate the running of Puppet. It
+has grown organically since it was first put in place back in 2011 and it
+is truly amazing it still works as well as it does.
+
+However, with the advent of Ansible-based job content in Zuul v3, we are
+regularly writing large amounts of Ansible in service of OpenStack, so the
+cognitive shift back to implementing support for services in Puppet feels
+onerous. In our current system, we aren't taking advantage of the power
+that Zuul's Ansible integration affords us.
+
+We currently have an awkward hybrid system with some information in
+Ansible inventory and some information in Puppet hiera with host group
+generation based on a template file.
+
+A majority of the services we deploy end up being compiled from source on
+production servers with the help of various development toolchains.
+
+While we could start systematically packaging our services and run a
+distribution mirror, it incurs a significant overhead we'd prefer to avoid.
+
+More and more of our systems are multi-node distributed systems which are
+not well served by Puppet.
+
+We do have a very large corpus of existing Puppet, so an all-in-one change to
+anything is unreasonable.
+
+Proposed Change
+===============
+
+OpenStack Infra should migrate its control plane from Ansible-orchestrated
+Puppet to Ansible orchestrated Container-based deployments. Doing that has
+three fundamental pieces:
+
+* Upgrade the existing Puppet to Puppet 4 (and possibly Puppet 5 depending
+  on how long other tasks take).
+* Config management and orchestration should migrate from Puppet to Ansible.
+* Software installation should migrate to container images built in Zuul
+  jobs.
+
+Puppet 4
+--------
+
+Upgrading to Puppet 4 gives us additional time to deal with the other larger
+changes while not feeling pressure due to the Puppet 3 EOL.
+
+We need to complete and enhance the puppet module functional tests so that they
+are easier to manage and so they are capable of installing and validating
+services with puppet 4.
+
+When we are confident that all the modules needed for a given site.pp node work
+with puppet 4, upgrade that node to puppet 4. We'll track that a node should
+be running puppet 4 by adding a new puppet-4 group in groups.txt and adding
+nodes to it one at a time. A new playbook will be written that runs an
+idempotent upgrade for those nodes.
+
+Ansible
+-------
+
+We should update to using at least Ansible 2.5 and the OpenStack Inventory
+plugin instead of the exisitng inventory script. Inventory plugins are
+stackable, so we should be able to rework the group membership and emergency
+file system to be a collection of Inventory plugins instead of a static
+generation script.
+
+We should shift away from the ``run_all.sh`` model and move instead to
+per-service ansible playbooks. The first version of the per-service playbooks
+can simply call puppet as the existing playbook does. As we write these, we
+should make individual cron jobs for each playbook as each of their running
+does not depend on the other.
+
+We should then replace the cron jobs with Zuul jobs in
+``openstack-infra/system-config``. Those jobs should use ``add_host`` to
+add ``puppetmaster.openstack.org`` with the secret key for ssh in a Zuul
+secret. Using a secret and ``add_host`` will ensure the jobs can't be used by
+other projects. Since the job will use ``add_host`` for puppetmaster, the job
+itself can be nodeless, which should ensure we don't have issues running
+deployment jobs while under periods of high build traffic.
+
+We should `add a central ARA instance`_ that ansible is configured to log to
+when run on puppetmaster. That way sysadmins running ansible by hand on
+puppetmaster and Zuul-driven jobs will both log to a consistent place. The
+ssh key for a zuul user account on the puppetmaster host can be stored in
+Zuul as a secret. As a future improvement, improving ARA to be able to
+export reports from one ARA and import them into a second ARA could allow us
+to log Zuul-driven playbook runs to a per-job ARA - as well as have that same
+report data go into the central ARA.
+
+We should migrate the ``openstack_project::server`` base puppet pieces to
+Ansible roles in the ``openstack-infra/system-config`` repo. This currently
+involves creating users, setting timezone to UTC, setting up rsyslog,
+configuring apt to retry and not pull translations, setting up ntp, setting
+up the root ssh account for ansible management, setting up snmp, disabling
+cloud-init and uninstalling some things we don't need. There are options for
+installing the AFS client, managing exim, enabling unbound and managing
+iptables rules that should just be turned in to roles and included in the
+playbooks for a given service. Similarly, we install pip/virtualenv in
+``openstack_project::server``. We should be able to just stop doing that since
+we'll be shifting from installing things directly on the system with pip to
+installing them in containers, although we still want to put it in the
+ansible version of ``openstack_project::server`` so that we can transition
+our puppet services one at a time.
+
+We should keep the roles in the ``roles`` directory of
+``openstack-infra/system-config`` for the time being. While we might want to
+eventually split things out into role repositories in the future, there is
+enough complication related to an in-place CD transition from puppet to ansible
+without over-organizing at the outset.
+
+Once we have per-service playbooks and base server roles, we can begin to
+rework the services to be native Ansible one service at a time.
+
+As we work on per-service Ansible, We should shift our secrets from hiera
+to an Ansible inventory based set of host and group vars. They can continue to
+be yaml files in a private git repo - and in fact the current structure may
+even work directly. But rather than having ansible copy some variables to the
+remote host so that the local puppet apply has access to the hiera data,
+if the code being run is actual ansible roles and playbooks we can just have
+Ansible use the secrets as variables and stop copying secret chunks to remote
+hosts completely. Cleaning up and organizing the existing hiera data first is
+likely a good idea.
+
+We may want to investiagate use of Ansible Vault to store the secrets with
+GPG for encrypting/decrypting secrets. The GPG private key for zuul can be
+stored as a Zuul secret, and we can encrypt things for the union of Zuul and
+infra-root. However, this would be more than what we're doing currently with
+hiera, so should be considered a future improvement. If we shift from
+``add_host`` for adding puppetmaster to using the static driver for
+puppetmaster, then we may want to consider shifting to protecting the secrets
+using vault with a GPG key in a Zuul secret. Doing so would be a
+belt-and-suspenders for protection against the node being used in the wrong
+context.
+
+On a per-service basis, as we migrate from Puppet to Ansible, we may find
+that updating to installing the software via containers at the same time is
+more straightforward than breaking it into two steps.
+
+Containers
+----------
+
+We should start installing the software for the services we run using
+thin per-process containers based on images that we build in the CI system.
+
+We should build and run those containers using Docker. We should install
+Docker from the upstream Docker package repository.
+
+
+Adoption of container technology can happen in phases and in parallel to the
+Ansible migration so that we're not biting off too much at one time, nor
+blocking progress on a phased approach. There is no need to go overboard
+needlessly. If a service doesn't make sense in containers, such as potentially
+AFS, we can just run those services as we are running them now except using
+Ansible instead of Puppet. Services like AFS or exim, where we're installing
+from distro packages anyway are less likely to see a win from bundling the
+software into containers first. On the other hand, services where we're
+installing from source in production like Zuul, or building artifacts in CI
+like Gerrit (nearly all of our services) are the most likley to see a win and
+should be focused on first.
+
+Building container images in CI allows us to decouple essential dependency
+versions from underlying distro releases. Where possible, we should prefer to
+use ecosystem-specific base images rather than distro-specific base images.
+For instance, we should build container images for each zuul service using the
+``python:3.6-slim`` base image with Python 3.6, a container for etherpad using
+the ``nodejs`` base image with the correct tag version of node and a container
+for Gerrit with the ``openjdk`` base image. For our Python services, a new tool
+is in work, `pbrx`_, which has a command for making single-process containers
+from pbr setup.cfg files and bindep.txt.
+
+The container images we make should be single-process containers and should
+use `dumb-init`_ as an Entrypoint so that signals and forking work properly.
+This will allow us to start building and using containers of various pieces
+of software only by changing the software installation and init scripts even
+while config files, data volumes and the like are still managed by puppet.
+Config files and data volumes will be exposed to the running container via
+normal bind mounts. Something like:
+
+.. code-block:: console
+
+  docker run -v /etc/zuul:/etc/zuul -v /var/log/zuul:/var/log/zuul zuul/zuul-scheduler
+
+By doing this, we'll still have config files and log files in locations we
+expect.
+
+
+Our services are all currently designed with the assumption that they exist
+in a direct internet networking environment. Network namespacing is not a
+feature that provides value to us, so we should run docker with
+``--network host`` to disable network namespacing.
+
+We also have a need to run local commands, such as ``zuul enqueue``. Those
+commands should all exist in the containers, so something like:
+
+.. code-block:: console
+
+  docker run -it --rm zuul/zuul -- enqueue
+
+would do the trick, but is a bit unwieldy. We should add wrapper scripts to
+the surrounding host that allow us to run utility scripts from the services
+as needed. So that a ``/usr/local/bin/zuul`` script would be:
+
+.. code-block:: console
+
+  docker run -it --rm zuul/zuul -- $*
+
+Generating those scripts could be a utility that we add to `pbrx`_ - or it
+could be an Ansible role we write.
+
+Alternatives
+------------
+
+- Stay on puppet 3 forever
+- Stay on puppet forever but upgrade it
+- Migrate to ansible but without containers
+- Building distro packages of our software
+- Use software other than docker for containers
+
+There are alternative tools for building and running containers that can be
+explored. To keep initial adoption simple, starting with Docker seems like the
+best bet, but alternate technology can be explored as a follow on. The Docker
+daemon is a piece of operational complexity that is not required to use Linux
+namespaces.
+
+For building images we (or `pbrx`_) can use Ansible playbooks or `img`_ or
+`buildah`_ or `s2i`_. For running containers we can look at `rkt`_ or
+`podman`_. Since `rkt`_ and `podman`_ follow a traditional fork/exec model
+rather than having a daemon, we'd want to use systemd to ensure services run
+on boot or are restarted appropriately. As we start working, it may end up
+being an easier transition from systemd-process to systemd-podman-container
+than to transition from systemd-process to docker-container to
+systemd-podman-container.
+
+If, in the future, we deploy a Container Orchestration Engine such as
+Kubernetes, we'll should consider running it with `cri-o`_ to avoid the Docker
+daemon on the backend.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  mordred
+  Colleen Murphy <colleen@gazlene.net>
+  infra-root
+
+mordred can help get it going, but in order for it to be successful, we'll
+all need to be involved. Coleen has already done more of Puppet 4.
+
+Gerrit Topic
+------------
+
+Use Gerrit topic "update-cfg-mgmt" for all patches related to this spec.
+
+.. code-block:: bash
+
+    git-review -t update-cfg-mgmt
+
+Work Items
+----------
+
+Puppet
+~~~~~~
+
+#. Complete and enhance puppet module functional tests.
+
+##. We need to ensure all modules have proper functional tests that at least
+perform a basic smoke test.
+
+##. The functional tests need to accept a puppet version parameter.
+
+##. An experimental functional test job needs to be added to use the puppet
+version parameter. The job should be graduated to non-voting and then to gating.
+
+#. Audit all upstream modules in modules.env for version compatibility and take
+steps to upgrade to a cross-compatible version if necessary.
+
+#. Turn on the future parser in puppet.conf on all nodes in production. The
+future parser will start interpreting manifests with puppet 4 syntax without
+actually having to run the upgrade yet.
+
+#. Enhance the install_puppet.sh script for puppet 4
+
+##. The script already installs puppet 4 when PUPPET_VERSION=4 is set. Since
+this script is currently only run during launch-node.py and not periodically, we
+do not need to worry about the script accidentally downgrading puppet at some
+point after the upgrade. However, in the event things go wrong and we want to
+revert a node back to puppet 3, we need to be able to manually run the script
+again to forcefully downgrade, so we most likely need to enhance the script to
+ensure this works properly.
+
+#. Write ansible logic to notate nodes that should be running puppet 4 and run
+the upgrade.
+
+## The playbook will need to run the install_puppet.sh script with
+PUPPET_VERSION=4.
+
+Ansible
+~~~~~~~
+
+#. Split run_all.yaml into service-specific playbooks.
+#. Rewrite ``openstack_project::server`` in Ansible (infra.server).
+#. Add a playbook targetting hosts: all that runs infra.server.
+#. Either add "install docker" to infra.server or make an ansible hostgroup
+   that contains it.
+#. Either rewrite launch_node.py to bootstrap using infra.server or ensure we
+   can use ansible-cloud-launcher instead.
+#. Install a local container registry service as our first docker-based service.
+
+On a service by service basis:
+
+#. Add a Zuul job to build the software into container(s) and publish the
+   containers into our local container registry (and to dockerhub)
+#. Translate the puppet for the service into ansible that runs the software from
+   the container.
+#. Add a Zuul job that runs the new ansible with the container for testing.
+#. Change the service's playbook to use the ansible/container deployment.
+#. Retire the service's puppet.
+
+Repositories
+------------
+
+We may need to create some repositories as a place to put
+jobs/roles/Dockerfiles for services where we aren't tracking a git repo locally
+already. For instance, etherpad doesn't have a great place for us to put
+things.
+
+When we're done, we'll have a LOT of ``puppet-.*`` repos that we will no longer
+care about. We should soft-retire them, leaving them in place for people still
+depending on them.
+
+Servers
+-------
+
+All existing servers will be affected.
+
+DNS Entries
+-----------
+
+Not explicitly.
+
+Documentation
+-------------
+
+All of the Infra system config documentation on how to manage puppet things
+will need to be rewritten. OpenStack developers should not notice any changes
+in their daily workflow.
+
+Security
+--------
+
+This change should help improve security since we'll be getting security
+updates to Puppet.
+
+We are not proposing using containers for any increased isolation at this point
+other than as a build step and convenient software installation vehicle.
+However, building container images in CI and then deploying them means we will
+need to track software manifests of the built images so that we can know if
+we need to trigger a container rebuild due to CVEs.
+
+We should make sure we have a mechanism to trigger a rebuild / redeploy.
+
+We could **also** periodically rebuild / redeploy our service containers just
+in case we miss a CVE somewhere.
+
+The playbooks will be run by Zuul on puppetmaster using the secrets system to
+protect the private ssh key. Normal infra-core reviews in system-config should
+be sufficient to protect this.
+
+Testing
+-------
+
+Puppet 4 is already validated with the puppet-apply noop tests and this spec
+proposes enhancing the module functional tests before proceeding with the
+upgrade.
+
+As we shift to Ansible, the functional tests for puppet need to to be shifted
+as well. We should use `testinfra`_ for our Ansible testing.
+
+We're currently using `serverspec`_ with our Puppet. However, `serverspec`_ is
+Ruby-based which is additional context for admins to deal with and carrying
+that additional context with a shift to Ansible seems less desirable.
+
+`testinfra`_ is python-based, so fits with the larger majority of our
+ecosystem, but will require us to write all new tests. It has ansible, docker
+and kubectl backends, so should allow us to plug in to things where we'd like
+to. It is implemented as a **py.test** plugin, which has a different
+test-writing paradigm than we are used to with **testtools**, but the context
+shift there is still likely less than the python to ruby context shift.
+
+On a per-service basis, as we transition a service from Puppet to Ansible, we
+should the deploy playbooks such that they can be run in Zuul. We should then
+make jobs that use those playbooks against nodes in the jobs and then run
+`testinfra`_ tests to validate the playbooks did the right thing.
+
+Since the rest of our testing is **subunit** based, we may want to pick up
+the work on `pytest-subunit`_.
+
+Dependencies
+============
+
+Concurrent with `add a central ARA instance`_.
+
+Docker will need to be installed and we'll want to decide if we want to use
+the distro-supplied Docker or install it more directly from Docker upstream.
+
+We'll need to run a container registry into which we can publish our container
+images so that we are not dependent on hub.docker.com to update our system.
+We should still publish our containers to hub.docker.com as well.
+
+References
+==========
+
+- `Puppet Inc Upgrade Announcement <https://docs.puppet.com/upgrade>`_
+- `Puppet 4 Release notes <https://docs.puppet.com/puppet/4.0/release_notes.html>`_
+- `Features of the Puppet 4 Language <https://www.devco.net/archives/2015/07/31/shiny-new-things-in-puppet-4.php>`_
+
+.. _`add a central ARA instance`: https://review.openstack.org/527500/
+.. _`Puppet 4 Preliminary Testing spec`: http://specs.openstack.org/openstack-infra/infra-specs/specs/puppet_4_prelim_testing.html
+.. _dumb-init: https://github.com/Yelp/dumb-init
+.. _openshift-ansible: https://github.com/openshift/openshift-ansible
+.. _oc cluster up: https://github.com/openshift/origin/blob/master/docs/cluster_up_down.md
+.. _cri-o: http://cri-o.io/
+.. _pbrx: http://git.openstack.org/cgit/openstack/pbrx
+.. _img: https://github.com/genuinetools/img
+.. _buildah: https://github.com/projectatomic/buildah
+.. _s2i: https://github.com/openshift/source-to-image
+.. _rkt: https://coreos.com/rkt/
+.. _podman: https://github.com/projectatomic/libpod
+.. _testinfra: https://testinfra.readthedocs.io/en/latest/
+.. _serverspec: https://serverspec.org/
+.. _pytest-subunit: https://github.com/lukaszo/pytest-subunit