From bb19892728c5b68b7c62bbc2d905eca6327ffd7c Mon Sep 17 00:00:00 2001 From: Ian Wienand Date: Tue, 31 Jul 2018 15:39:14 +1000 Subject: [PATCH] letsencrypt spec A spec for exploring the specifics of letsencrypt support Change-Id: I5475619f86d623418126d27261ecaf8af311bb9a --- doc/source/index.rst | 1 + specs/letsencrypt.rst | 460 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 461 insertions(+) create mode 100644 specs/letsencrypt.rst diff --git a/doc/source/index.rst b/doc/source/index.rst index 8c38bb1..735f8dc 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -44,6 +44,7 @@ permits. specs/survey specs/translation_check_site specs/wiki_modernization + specs/letsencrypt Help Wanted =========== diff --git a/specs/letsencrypt.rst b/specs/letsencrypt.rst new file mode 100644 index 0000000..688e8c2 --- /dev/null +++ b/specs/letsencrypt.rst @@ -0,0 +1,460 @@ +.. code-block:: text + + Copyright 2019 Red Hat, Inc. + + This work is licensed under a Creative Commons Attribution 3.0 + Unported License. + http://creativecommons.org/licenses/by/3.0/legalcode + +.. + This template should be in ReSTructured text. Please do not delete + any of the sections in this template. If you have nothing to say + for a whole section, just write: "None". For help with syntax, see + http://sphinx-doc.org/rest.html To test out your formatting, see + http://www.tele3.cz/jbar/rest/rest.html + +=================================== +Use letsencrypt for infra SSL needs +=================================== + +Incorporate https://letsencrypt.org/ into our configuration +management. + +Problem Description +=================== + +Currently SSL certificates for infrastructure services are manually +purchased at some cost, renewed by hand and provisioned into +configuration management. The process is costly, work-intensive and +does not completely match our usual infrastructure-as-code workflows. +The overheads are a barrier to getting encryption rolled out across +all services. + +We would like to replace manual provision of purchased certificates +with those provided by https://letsencrypt.org/. The solution should +make certificate provision and renewal completely automated and +integrate with the existing configuration management environment. + +What is the existing situation? +------------------------------- + +* Certificates manually renewed. + +* Certificate key data is manually inserted into Ansible variables + on bridge.o.o in the key storage area + +* Key material is then deployed as part of usual periodic + ansible/puppet run on host; usually by puppet as a hiera variable + (e.g. ``ssl_cert => hiera('foo_ssl_cert')``) but possibly by ansible + directly. + +* Self-signed certificates are created for a number of less + "important" services, such as dev servers, or nodepool status pages. + Primarily due to the cost an inconvenience of creating certificates. + +What can we use letsencrypt for? +--------------------------------- + +Domain validation certificates. This means you have proved you +control the domain to letsencrypt. They do not offer higher level +verification such as `Extended Validation +`__ +which prove the identity of the owner. + +Our needs are currently exclusively domain validation certificates. + +What about non-http services? +----------------------------- + +Although not a initial target, non-https services like zookeeper and +gearman should be possible. + +Zookeeper for example uses `JKS +`__ +which it should be possible to `import +`__ + +SMTP can also integrate with letsencrypt for TLS; see +https://starttls-everywhere.org/ and +https://www.jwz.org/blog/2018/06/starttls-everywhere/ + +Is letsencrypt a long term solution? +------------------------------------ + +The `Internet Security Research Group +`__ and by extension the letsencrypt +project appears to have gained a critical mass of support, making it +the defacto solution for domain validation certificates. We can +benefit and contribute to open tooling around letsencrypt. + +How do you validate you own a domain for renewal? +------------------------------------------------- + +* https://letsencrypt.org/how-it-works/ + +You run an agent that communicates to letsencrypt, which challenges it +to either + + * put a http resource on the domain + * add a dns entry + +Once done, it will accept the public key associated with the challenge +and issue certificates. + +Are there rate limit concerns? +------------------------------ + +* https://letsencrypt.org/docs/rate-limits/ + +50 new certificates per registered domain per week. It is possible to +have up to 100 names per certificate. Requests that count as renewals +count towards quota, but are also allowed if you are over quota. + +The last renewal was 17 certificates, with one having 8 names and the +rest only 1 name (per clarkb). Thus we are well within rate limits. + +Any interactions with other TLS infrastructure? +------------------------------------------------ + +There already exists a ``*.openstack.org`` certificate in use for the +caching provider for ``www.openstack.org`` and is thus outside infra +control. There is also a certificate for +``openstackid-resources.openstack.org`` which was not infra issued. + +Can we use the HTTP challenge model? +------------------------------------ + +This model has a chicken-egg problem of needing a webserver to respond +before being able to get a certificate for the webserver you are +trying to secure. There are several facets to this. + +It would be possible for any host to configure it's Apache/webserver +to statically serve the ACME challenge, and have an agent run +periodically on the host to generate the keys. Some common roles may +be useful in the base job repos, but essentially each service would +set this up in a besopke manner for it's environment. + +This probably works fine for completely new services. However, in a +server-replacement scenario, we can not transparently generate new +keys for the server before switching the A/AAAA DNS entries. For +example, if the ``foo.opendev.org`` service is handled by +``foo01.opendev.org`` and we wish to make ``foo02.opendev.org`` +replace it, we can not generate a certificate on ``foo02.opendev.org`` +that covers ``foo.opendev.org,foo2.opendev.org`` before +``foo.opendev.org`` is redirected. Thus we either need longer +downtime, or to start copying keys, etc. This *can* be done with DNS +validation, because all we need to do is prove that we can update +``foo.opendev.org``; not actually switch it's A/AAAA records. + +Incorporating HTTP challenge response into our existing deployment is +problematic. Updating Apache configuration, installing ACME agents +and other setup crosses into a lot of existing puppet which would need +to be heavily reworked. + +Other approaches can could be considered. Certificates could be +initially issued centrally and the remote HTTP server would need to be +either given the token respond to the challenge, or otherwise somehow +forward the challenge back to the issuing host. This is possible with +redirects and such; but again requires much more engineering than a +DNS based approach for no benefits. It helps the extant puppet +situation of distributing keys from central storage, but makes things +more difficult for new deployments where we wish to keep keys on host +only. + +We would also loose some flexibility; port 80/443 doesn't *have* to be +bound to a "standard" webserver, and thus it may require bespoke +solutions to ensure the HTTP challenge URL can be responded to. With +DNS validation we never have to worry about making the host respond +correctly. + +The conclusion of this spec is that HTTP authentication is not +suitable for our requirements. + +DNS Validation Overview +----------------------- + +DNS based validation seems a more reasonable path forward. We have +two DNS environments to consider. + +Firstly, the extant ``openstack.org`` domain is hosted by Rackspace's +Cloud DNS offering. We do not expect to start new services homed +within the ``openstack.org`` domain, but we would ideally like to +support existing services having valid letsencrypt certificates. + +Secondly, infra runs its own authoritative nameservers as part of the +OpenDev initiative. This is the primary focus. + +Are certificates precious? +-------------------------- + +Moving forward, we do not want to continue distributing certificates +from centralised secrets; they should be considered ephemeral and +generated and kept on the host as much as possible. i.e. certificates +should not be precious. + +However, there will be a period of transition as service deployment +moves from extant puppet modules which are taking their key data from +centralised secrets. + +Ideally we can handle both. + +What about load balancing? +-------------------------- + +With DNS based validation, every host can have a certificate that is +valid for itself and the balanced domain. For example, +``git01.openstack.org`` will request a certificate to cover +``git.openstack.org,git01.openstack.org``; upon request the two +relevant TXT authentication records (one for each domain) are placed +into DNS and the authentication is complete. + +What ACME agent/tool should we use? +------------------------------------ + +The underlying protocol for getting certificates is called `ACME +`__. There are many clients; +see ``__. + +Many of these tools are quite extensive, as they perform a range of +operations like automatically deploying and updating +apache/ngnix/other configuration. As we manage configuration via our +own config management, we do not need any of these features. + +The `acme.sh ` agent stands out +as being particularly suitable. + +* It is implemented in ``sh`` (specifically *not* ``bash``) with very + standard UNIX tools (curl, openssl, etc). In contrast, tools like + `certbot `__ need a python + environment. i.e. it basically runs anywhere, including tiny + containers. +* It is currently maintained and an active project, with CI +* We have some experience with large shell-based projects (devstack, + d-g). The code looks good. +* It supports authentication via the two methods of interest to us; a + "manual" mode where putting the TXT records is left up to the user, + and it can operate directly on Rackspace DNS API. +* For future flexibility can operate on many other hosted DNS + environments (like 50+ different ones). +* While it does have certificate deployment options to update + apache/ngnix/exim/sshd/and so on, they are all opt-in and not + necessary. +* It has an "I (heart) Docker" sticker on the page + +For these reasons, ``acme.sh`` is used below. + +Proposed Change +=============== + +The proposal is in two parts: the first covers the legacy +``openstack.org`` environment, whilst the second uses the self-hosted +nameservers behind ``opendev.org``. + +openstack.org +------------- + +Note: this proposal is prototyped at https://review.openstack.org/637456 + +An ansible-playbook can be written to iterate each ``openstack.org`` +host of interest and generate certificate material with ``acme.sh`` +using Rackspace DNS for letsencrypt authentication. This playbook +runs periodically on ``bridge.o.o`` (once a day is sufficient) and +will store any generated certificate material in a private certifcate +storage area. + +Minor post-processing after issuing a certificate combines the +resulting file contents of the key material into a single YAML file +for each host, with pre-defined and unchanging variable names. + +As part of the base run, this generated YAML file with certificate +data is copied into the relevant host's hiera directory (and hiera is +configured to load from this file). At this point, extant puppet has +access to the certificates as normal hiera variables and can deploy +them with very little change. + +OpenDev +------- + +Note: this proposal is prototyped at https://review.openstack.org/636759 + +For projects under the OpenDev umbrella we have two goals: + +* Key material ephemeral and stored on server +* Using infra hosted DNS servers + +We create an ansible role that will generate the certifiate data into +files on a given host. We will consider actually *using* the +certificate data as implementation specific; playbooks for ensuring +service restart, mounting into containers, etc will need to be +incorporated on top of this proposal. + +The proposal is as follows: + +* We utilise a separate signing domain, hereon referred to as + ``acme-opendev.org`` (see note below). This is preconfigured, and + only used for ACME authentication. +* Hosts that want a certificate preconfigure a CNAME record + ``_acme-challenge.hostname.opendev.org`` pointing to + ``_acme-challenge.acme-opendev.org``. This is committed via the + normal review process to the main DNS configuration. +* Each host defines the certificates it wants and the domains that + should be covered by that certificate in a known configuration + variable. +* Provisioning happens via ansible roles running on ``bridge.o.o`` + against the remote host that wants certificates as part of the + standard ``base.yaml`` playbook. The roles: + + 0. Ensure a valid install of ``acme.sh`` on the remote host + 1. For each certificate required on the host, run ``acme.sh`` in + "DNS manual mode" on the remote host with the required domains to + be authenticated. + 2. This generates an authentication token(s) output that needs to be + put into a TXT record on the signing domain. This is stored in a + pre-known host variable for the next step. + 3. Another role then reads the authentication tokens from each host + via the variables, which are collated and put into a TXT records + created on ``adns1.opendev.org`` as + ``_acme-challenge.acme-opendev.org``. Any necessary reloads + etc. are done to make the records visible (see note below on + lifespan). + 4. When records are available, the acme.sh renewal process is run on + the remote host. The remote letsencrypt servers authenticate the + TXT records, and certificate material (cert, private key, chain + file) is generated and put in a well-known location on the host. + +The host now has valid keys and further roles/playbooks can configure +services to use them. + +Notes: + + * we *could* have the signing domain as a sub-domain of + ``opendev.org``. We choose a separate domain for several reasons. + Firstly, a large benefit of the signing domain is that it is + completely separate to your production domain; it is one less thing + to potentially mess up. Secondly, having a separate domain gives + us excellent forward flexibility -- for example; in the future we + could host this domain somewhere that has an API for updating TXT + records (designate instance, say) and use that directly for ACME + authentication. This would be transparent to ``opendev.org`` or + any other hosts. + * letsencyrpt `walks all the TXT records + `__, + and stops when it finds a match. So we don't have any problem with + multiple concurrent updates. + * The TXT records are not explicitly removed. They remain until the + next time a certificate is issued or renewed, when the entire + signing zone file is overwritten. They are not in any way + sensitive data. Their useful life, however, is only within the + single ansible/puppet pulse they were created in. + * The exact renewal process will be defined when we can issue + certificates. The plan is that if the host has all valid + certificates, step 1 above will return an empty list an nothing + further will be done. If any certificate has less than 30 days + remaining, acme.sh should automatically renew it, which is + essentially the same as issuing a new certificate. + * For organisations, letsencrypt has group accounts and such. These + will be implemented once basic issuance and renewal is working. + +Alternatives +------------ + +* Continue with manual renewal +* We could only implement the ``opendev.org`` portion, leaving the RAX + DNS based authentication for ``openstack.org`` alone, assuming it + will not be required after the renewal period of the existing + certificates. + +Implementation +============== + +Assignee(s) +----------- + +Who is leading the writing of the code? Or is this a blueprint where you're +throwing it out there to see who picks it up? + +If more than one person is working on the implementation, please designate the +primary author and contact. + +Primary assignee: + ianw + +Can optionally list additional ids if they intend on doing substantial +implementation work on this blueprint. + +Gerrit Topic +------------ + +Use Gerrit topic "letsencrypt-infra" for all patches related to this spec. + +.. code-block:: bash + + git-review -t letsencrypt-infra + +Work Items +---------- + +* The broad strokes of the proposals are layed out in the prototypes. + Getting those various roles production ready, tested and documented + are the major work items. + + +Repositories +------------ + +Unlikely to require new repositories. Jobs will be in existing repos. + +Servers +------- + +No new servers required. + +DNS Entries +----------- + +``acme-opendev.org`` or whatever separate domain is used for CNAME +redirection will need to be provisioned. + + +Documentation +------------- + +Standard infra documentation (system-config, job documentation, etc) +will need updates, but no external documentation should require +updates. + +Security +-------- + +We are dealing with more private halves of keys. However we are +moving to a model where they are considered ephemeral and can be +rotated at any time. + + +Testing +------- + +* We can stage implementation of keys to hosts, meaning no "flag" day + of switching everything, and we can test in isolation. + +* We do not have a good way to update DNS entries for gate CI testing. + Initially we can generate self-signed certificates. For future + work, there are docker environments that run a LE environment. We + could stand-up one of these either on a per-test basis or as a + service (might be useful if we have a set CI signging key we could + configure hosts under test to trust). + +* Initial production testing can be done with the letsencrypt staging + environment, which doesn't give valid keys but also doesn't have + restrictive limits if things are failing. + +* Ongoing validation with certcheck + + +Dependencies +============ + +This work needs to integrate with ongoing work in the `update config +management spec`_. + +.. _`update config management spec`: https://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html