diff --git a/doc/source/design/k8s_1000_nodes_architecture.rst b/doc/source/design/k8s_1000_nodes_architecture.rst
new file mode 100644
index 00000000..fb26ca8f
--- /dev/null
+++ b/doc/source/design/k8s_1000_nodes_architecture.rst
@@ -0,0 +1,1322 @@
+.. _k8s_1000_nodes:
+
+===========================================
+Kubernetes Master Tier For 1000 Nodes Scale
+===========================================
+
+.. contents:: Table of Contents
+
+Introduction
+------------
+
+This document describes architecture, configuration and installation
+workflow of Kubernetes cluster for OpenStack Containerised Control Plane
+(CCP) on a set of hosts, either baremetal or virtual. Proposed architecture
+should scale up to 1000 nodes.
+
+Scope of the document
+~~~~~~~~~~~~~~~~~~~~~
+
+This document does not cover preparation of host nodes and installation
+of a CI/CD system. This document covers only Kubernetes and related
+services on a preinstalled operating system with configured partitioning
+and networking.
+
+Monitoring related tooling will be installed on ready to use Kubernetes
+as Pods, after Kubernetes installer finishes installation. This document
+does not cover architecture and implementation details of monitoring and
+profiling tools.
+
+Lifecycle Management section describes only Kubernetes and related
+services. It does not cover applications that run in Kubernetes cluster.
+
+Solution Prerequisites
+----------------------
+
+Hardware
+~~~~~~~~
+
+The proposed design was verified on a hardware lab that included 181
+physical hosts of the following configuration:
+
+-  Server model: HP ProLiant DL380 Gen9
+
+-  CPU: 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
+
+-  RAM: 264G
+
+-  Storage: 3.0T on RAID on HP Smart Array P840 Controller
+
+-  HDD: 12 x HP EH0600JDYTL
+
+-  Network: 2 x Intel Corporation Ethernet 10G 2P X710
+
+3 out of the 181 hosts were used to install Kubernetes Master control
+plane services. On every other host, 5 virtual machines were started
+to ensure contention of resources and serve as Minion nodes in Kubernetes
+cluster.
+
+Minimal requirements for the control plane services at scale of
+1000 nodes are relatively modest. Tests demonstrate that three physical
+nodes in the configuration specified above are sufficient to run
+all control plane services for cluster of this size, even though
+an application running on top of the cluster is rather complex
+(i.e. OpenStack control plane + compute cluster).
+
+Provisioning
+~~~~~~~~~~~~
+
+Hosts for Kubernetes cluster must be prepared by a provsioning system of
+some sort. It is assumed that users might have their own provisioning
+system to handle prerequisites for this.
+
+Provisioning system provides installed and configured operating system,
+networking, partitioning. It should operate on its own subset of cluster
+metadata. Some elements of that metadata will be used by installer tools
+for Kubernetes Master and OpenStack Control tiers.
+
+The following prerequisites are required from Provisioning system.
+
+Operating System
+^^^^^^^^^^^^^^^^
+
+-  Ubuntu 16.04 is default choice of operating system.
+
+-  It has to be installed and configured by provisioning system.
+
+Networking
+^^^^^^^^^^
+
+Before the deployment starts networking has to be configured and
+verified by underlay tooling:
+
+-  Bonding.
+
+-  Bridges (possibly).
+
+-  Multi-tiered networking.
+
+-  IP addresses assignment.
+
+-  SSH access from CI/CD nodes to cluster nodes (is required for
+   Kubernetes installer).
+
+Such things as DPDK and Contrail can be most likely configured in
+containers boot in privileged mode, no underlay involvement is required:
+
+-  Load DKMS modules
+
+-  Change runtime kernel parameters
+
+Partitioning
+^^^^^^^^^^^^
+
+Nodes should be efficiently pre-partitioned (e.g. separation of ``/``,
+``/var/log``, ``/var/lib`` directories).
+
+Additionally it’s required to have LVM Volume Groups, which further will
+be used by:
+
+-  LVM backend for ephemeral storage for Nova.
+
+-  LVM backend for Kubernetes, it
+   may be required to create several Volume Groups for Kubernetes,
+   e.g. some of the services require SSD (InfluxDB), other will work
+   fine on HDD.
+
+Some customers also require Multipath disks to be configured.
+
+Additional Ansible packages (optional)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Currently `Kubespray <https://github.com/kubespray/>`__ project is
+employed for installing Kubernetes. It provides Calico and
+Ubuntu/Debian support.
+
+Kubespray Ansible playbooks (or Kargo) are accepted into `Kubernenes
+incubator <https://github.com/kubernetes-incubator/kargo>`__ by the community.
+
+Ansible requires:
+
+-  ``python2.7``
+-  ``python-netaddr``
+
+Ansible 2.1.0 or greater is required for Kargo deployment.
+
+Ansible installs and manages Kubernetes related services (see
+Components section) which should be delivered and
+installed as containers. Kubernetes has to be installed in HA mode, so
+that failure of a single master node does not cause control plane
+down-time.
+
+The long term strategy should be to reduce amount of Ansible playbooks
+we have to support and to do initial deployment and Lifecycle Management
+with Kubernetes itself and related tools.
+
+Node Decommissioning
+^^^^^^^^^^^^^^^^^^^^
+
+Many Lifecycle Management scenarios require nodes decommissioning
+procedure. Strategy on decommissioning may depend on the customer and
+tightly coupled with Underlay tooling.
+
+In order to properly remove the node from the cluster, a sequence of
+actions has to be performed by overlay tooling, to gracefully remove
+services from cluster and migrate workload (depends on the role).
+
+Possible scenarios of node decommissioning for underlay tooling:
+
+-  Shut the node down.
+
+-  Move node to bootstrap stage.
+
+-  As a common practise we should not erase disks of the node, customers
+   occasionally delete their production nodes, there should be a way
+   to recover them (if they were not recycled).
+
+CI/CD
+~~~~~
+
+Runs a chain of jobs in predefined order, like deployment and
+verification. CI/CD has to provide a way to trigger a chain of jobs (git
+push trigger -> deploy -> verify), also there should be a way to share
+data between different jobs for example if IP allocation happens on job
+execution allocated IP addresses should be available for overlay
+installer job to consume.
+
+Non comprehensive list of functionality:
+
+-  Jobs definitions.
+
+-  Declarative definition of jobs pipelines.
+
+-  Data sharing between jobs.
+
+-  Artifacts (images, configurations, packages etc).
+
+User experience
+^^^^^^^^^^^^^^^
+
+1. User should be able to define a mapping of node and high level roles
+   (master, minion) also there should be a way to define mapping
+   more granularly (e.g. etcd master on separate nodes).
+
+2. After the change in pushed CI/CD job for rollout is triggered,
+   Ansible starts Kubernetes deployment from CI/CD via SSH (the
+   access from CI/CD to Kubernetes cluster using SSH has to be
+   provided).
+
+Updates
+^^^^^^^
+
+When new package is published (for example libssl) it should trigger a
+chain of jobs:
+
+1. Build new container image (Etcd, Calico, Hyperkube, Docker etc)
+
+2. Rebuild all images which depend on base
+
+3. Run image specific tests
+
+4. Deploy current production version on staging
+
+5. Run verification
+
+6. Deploy update on staging
+
+7. Run verification
+
+8. Send for promotion to production
+
+Solution Overview
+-----------------
+
+Current implementation considers two high-level groups of services -
+Master and Minion. Master nodes should run control-plane related
+services. Minion nodes should run user’s workload. In the future,
+additional Network node might be added.
+
+There are few additional requirements which should be addressed:
+
+-  Components placement should be flexible enough to install most of the
+   services on different nodes, for example it may be required to
+   install etcd cluster members to dedicated nodes.
+
+-  It should be possible to have a single-node installation, when all
+   required services to run Kubernetes cluster can be placed on a
+   single node. Using scale up mechanism it should be possible to
+   make the cluster HA. It would reduce amount of resources required
+   for development and testing of simple integration scenarios.
+
+Common Components
+~~~~~~~~~~~~~~~~~
+
+-  Calico is an SDN controller that provides pure L3 networking to
+   Kubernetes cluster. It includes the following most important
+   components that run on every node in the cluster.
+
+   -  Felix is an agent component of Calico, responsible for configuring
+      and managing routing tables, network interfaces and filters on
+      pariticipating hosts.
+
+   -  Bird is a lightweight BGP daemon that allows for exchange of
+      addressing information between nodes of Calico network.
+
+-  Kubernetes
+
+   -  kube-dns provides discovery capabilities for Kubernetes Services.
+
+   -  kubelet is an agent service of Kubernetes. It is responsible for
+      creating and managing Docker containers at the nodes of
+      Kubernetes cluster.
+
+Plugins for Kubernetes should be delivered within Kubernetes containers.
+The following plugins are required:
+
+-  CNI plugin for integration with Calico SDN.
+
+-  Volume plugins (e.g. Ceph, Cinder) for persistent storage.
+
+Another option which may be considered in the future, is to deliver
+plugins in separate containers, but it would complicate rollout of
+containers, since requires to rollout containers in specific order to
+mount plugins directory.
+
+Master Components
+~~~~~~~~~~~~~~~~~
+
+Master Components of Kubernetes control plane run on Master nodes.
+The proposed architecture includes 3 Master nodes with similar set
+of components running on every node.
+
+In addition to Common, the following components run on Master nodes:
+
+-  etcd
+
+-  Kubernetes
+
+   -  Kubedns
+
+   -  Kube-proxy (iptables mode)
+
+   -  Kube-apiserver
+
+   -  Kube-scheduler
+
+   -  Kube-controller-manager
+
+Each component runs on container. Some of them are running in static
+pods in Kubernetes. Others are running as docker containers under
+management of operating system (i.e. as ``systemd`` service). See
+details in Installation section below.
+
+Minion Components
+~~~~~~~~~~~~~~~~~
+
+Everything from Common plus:
+
+-  etcd-proxy is a mode of operation of etcd which doesn't provide
+   storage, but rather redirects requests to alive nodes in etcd
+   clutser.
+
+Optional Components
+~~~~~~~~~~~~~~~~~~~
+
+-  Contrail SDN is an alternative to Calico in cases when L2 features
+   required.
+
+-  Flannel is another alternative implementation of CNI plugin for
+   Kubernetes. As Calico, it creates an L3 overlay network.
+
+-  Tools for debugging (see Troubleshooting below).
+
+Component Versions
+~~~~~~~~~~~~~~~~~~
+
+================ ===============
+Component        Version
+================ ===============
+Kubernetes       1.4
+---------------- ---------------
+Etcd             3.0.12
+---------------- ---------------
+Calico           0.21-dev
+---------------- ---------------
+Docker           1.12.3
+================ ===============
+
+Components Overview
+-------------------
+
+Kubernetes
+~~~~~~~~~~
+
+kube-apiserver
+^^^^^^^^^^^^^^
+
+This server exposes Kubernetes API to internal and external clients.
+
+The proposed architecture includes 3 API server pods running on 3 different
+nodes for redundancy and load distribution purposes. API servers run as
+static pods, defined by a kubelet manifest
+(``/etc/kubernetes/manifests/kube-apiserver.manifest``). This manifest is
+created and managed by the Kubernetes installer.
+
+kube-scheduler
+^^^^^^^^^^^^^^
+
+Scheduler service of Kubernetes cluster monitors API server for
+unallocated pods and automatically assigns every such pod to a node
+based on filters or 'predicates' and weights or 'priority functions'.
+
+Scheduler runs as a single-container pod. Similarly to API server,
+it is a static pod, defined and managed by Kubernetes installer.
+Its manifest lives in ``/etc/kubernetes/manifests/kube-scheduler.manifest``.
+
+The proposed architecture suggests that 3 instances of scheduler
+run on 3 Master nodes. These instances are joined in a cluster whith
+elected leader that is active, and two warm stan-dy spares. When
+leader is lost for some reason, a re-election occurs and one of the
+spares becomes active leader.
+
+The following parameters control election of leader and are set
+for scheduler:
+
+-  Leader election parameter for scheduler must be “true”.
+
+-  Leader elect lease duration
+
+-  Leader elect renew deadline
+
+-  Leader elect retry period
+
+kube-controller-manager
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Controller manager executes a main loops of all entities (controllers)
+supported by Kubernetes API. It is similar to scheduler and API server
+in terms of configuration: it is a static pod defined and managed by
+Kubernetes installer via manifest file
+``/etc/kubernetes/manifests/kube-controller-manager.manifest``.
+
+In the proposed architecture, 3 instances of controller manager run
+in the same clustered mode as schedulers, with 1 active leader and
+2 stand-by spares.
+
+The same set of parameters controls election of leader for controller
+manager as well:
+
+-  Leader election parameter for controller manager must be “true”
+
+-  Leader elect lease duration
+
+-  Leader elect renew deadline
+
+-  Leader elect retry period
+
+kube-proxy
+^^^^^^^^^^
+
+Kubernetes proxy
+`forwards traffic <http://kubernetes.io/docs/admin/kube-proxy/>`__
+to alive Kubernetes Pods. This is an internal component that exposes
+Services created via Kubernetes API inside the cluster. Some
+Ingress/Proxy server is required to expose services to outside of the
+cluster via globally routed virtual IP (see above).
+
+The pod ``kube-proxy`` runs on every node in the cluster. It is a static
+pod defined by manifest file
+``/etc/kubernetes/manifests/kube-proxy.manifest``. It includes single
+container that runs ``hyperkube`` application in proxy mode.
+
+kubedns
+^^^^^^^
+
+Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures
+the kubelets to tell individual containers to use the DNS Service’s IP to
+resolve DNS names.
+
+The DNS pod (``kubedns``) includes 3 containers:
+
+-  ``kubedns`` is a resolver that communicates to API server and controls
+   DNS names resolving
+
+-  ``dnsmasq`` is a relay and cache provider
+
+-  ``healthz`` is a health check service
+
+In the proposed architecture, ``kubedns`` pod is controller by
+ReplicationController with replica factor 1, which means that only
+one instance of the pod is working in a cluster at any time.
+
+Etcd Cluster
+~~~~~~~~~~~~
+
+Etcd is a distributed, consistent key-value store for shared
+configuration and service discovery, with a focus on being:
+
+-  Simple: well-defined, user-facing API (gRPC)
+
+-  Secure: automatic TLS with optional client cert authentication
+
+-  Fast: benchmarked 10,000 writes/sec
+
+-  Reliable: properly distributed using Raft
+
+``etcd`` is written in Go and uses the Raft consensus algorithm to
+manage a highly-available replicated log.
+
+Every instance of ``etcd`` can operate in one of the two modes:
+
+-  full mode
+
+-  proxy mode
+
+In *full mode*, the instance participates in Raft consensus and
+has persistent storage.
+
+In *proxy mode*, ``etcd`` acts as a reverse proxy and forwards client
+requests to an active etcd cluster. The etcd proxy does not
+participate in the consensus replication of the etcd cluster,
+thus it neither increases the resilience nor decreases the write
+performance of the etcd cluster.
+
+In proposed architecture, ``etcd`` runs as a static container
+under control of host operating system. See details below in
+Installation section. The assumed version of ``etcd`` in this
+proposal is ``etcdv2``.
+
+Etcd full daemon
+^^^^^^^^^^^^^^^^
+
+There are three instances of ``etcd`` running in full mode on Master
+nodes in the proposed solution. This ensures the quorum in the cluster
+and resiliency of service.
+
+Etcd native proxy
+^^^^^^^^^^^^^^^^^
+
+Etcd in proxy mode runs on every node in Kubernetes cluster, including
+Masters and Minions. It automatically forwards requests to active Etcd
+cluster members. `According to the
+documentation <https://coreos.com/os/docs/latest/cluster-architectures.html#production-cluster-with-central-services>`__
+it’s recommended etcd cluster architecture.
+
+Calico
+~~~~~~
+
+Calico is an L3 overlay network provider for Kubernetes. It
+propagates internal addresses of containers via BGP to all
+minions and ensures connectivity between containers.
+
+Calico uses etcd as a vessel for its configuraiton information.
+Separate etcd cluster is recommended for Calico instead of sharing
+one with Kubernetes.
+
+calico-node
+^^^^^^^^^^^
+
+In the proposed architecture, Calico is integrated with Kubernetes
+as Common Network Interface (CNI) plugin.
+
+The Calico container called ``calico-node`` runs on every node in
+Kubernetes cluster, including Masters and Minions. It is controlled
+by operating system directly as ``systemd`` service.
+
+The ``calico-node`` container incorporates 3 main services of Calico:
+
+-  `Felix <http://docs.projectcalico.org/v1.6/reference/architecture/#felix>`__,
+    the primary Calico agent. It is responsible for programming routes and
+    ACLs, and anything else required on the host, in order to provide the
+    desired connectivity for the endpoints on that host.
+-  `BIRD <http://docs.projectcalico.org/v1.6/reference/architecture/#bgp-client-bird>`__
+    is a BGP client that distributes routing information.
+-  `confd` is a dynamic configuration manager for BIRD, triggered
+    automatically by updates in the configuration data.
+
+High Availability Architecture
+------------------------------
+
+Proxy server
+~~~~~~~~~~~~
+
+Proxy server should forward traffic to alive backends, health checking
+mechanism has to be in place to stop forwarding traffic to unhealthy
+backends.
+
+Nginx is used to implement Proxy service. It is deployed in a static pod,
+one pod per cluster. It provides access to K8s API endpoint on a single
+by redirecting requests to instances of kube-apiserver in a round-robin
+fashion. It exposes the endpoint both to external clients and internal
+clients (i.e. Kubernetes minions).
+
+SSL termination
+~~~~~~~~~~~~~~~
+
+SSL termination can be optionally configured on Nginx server. From
+there, traffic to instances of kube-apiserver will go over internal K8s
+network.
+
+Proxy Resiliency Alternatives
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the Proxy Server is a single point of failure for
+Kubernetes API and exposed Services, it must run in highly available
+configuration. The following alternatives were considered for high
+availability solution:
+
+1. `Keepalived <http://www.keepalived.org/>`__
+   Although `Keepalived has problems with split brain
+   detection <https://youtu.be/yq5nYPKxBCo?t=189>`__ there is `a
+   subproject in
+   Kubernetes <https://github.com/kubernetes/contrib/tree/master/keepalived-vip>`__
+   which uses Keepalived with an attempt to implement VIP management.
+
+2. `OSPF <https://en.wikipedia.org/wiki/Open_Shortest_Path_First>`__
+   Using OSPF routing protocol for resilient access and failover between
+   Proxy Servers requires configuration of external routers consistently
+   with internal OSPF configurations.
+
+3. VIP managed by `cluster management
+   tools <http://clusterlabs.org/pacemaker.html>`__
+   Etcd might serve as a cluster mangement tool for a Virtual IP address
+   where Proxy Server is listening. It will allow to converge the
+   technology stack of the whole solution.
+
+4. DNS-based reservation
+   Implementing DNS based High Availability is very
+   `problematic <http://kubernetes.io/docs/user-guide/services/#why-not-use-round-robin-dns>`__
+   due to caching on client side. It also requires additional tools for
+   fencing and failover of faulty Proxy Servers.
+
+Resilient Kubernetes Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In the proposed architecture, there is a single static pod with Proxy
+Server running under control of Kubelet on every Minion node.
+
+Each of the 3 Master nodes runs its own instance of ``kube-apiserver``
+on localhost address. All services working on a Master node address
+the Kubernetes API locally. All services on Minion nodes connect to
+the API via local instance of Proxy Server.
+
+Etcd daemons forming the cluster run on Master nodes. Every node in
+the cluster also runs etcd-proxy. This includes both Masters and
+Minions. Any service that requires access to etcd cluster talks
+to local instance of etcd-proxy to reach it. External access to
+etcd cluster is restricted.
+
+Calico node container runs on every node in the cluster, including
+Masters and Minions.
+
+The following diagram summarizes the proposed architecture.
+
+|image3|
+
+Alternative approaches to the resiliency of Kubernetes cluster were
+considered, researched and summarized in `Appendix A. High Availability
+Alternatives`_.
+
+Next steps in development of this architecture include implementation of
+a Proxy server as an Ingress Controller. It will allow for closer
+integration with K8s in terms of pods mobility and life-cycle management
+operations. For example, Ingress Controller can be written to only relay
+incoming requests to updated nodes during rolling update. It also allows
+to manage virtual endpoint using native Kubernetes tools (see below).
+
+Logging
+-------
+
+Logs collection was made by Heka broker running at all nodes in the
+Kubernetes cluster. It used `Docker
+logging <https://docs.docker.com/engine/admin/logging/overview/>`__
+in configuration when all logs are written to a volume. Heka reads files
+from the volume using `Docker
+plugin <http://hekad.readthedocs.io/en/v0.10.0/config/inputs/docker_log.html>`__
+and uploads them to ElasticSearch storage.
+
+Installation
+------------
+
+This section describes the installation of Kubernetes cluster on
+pre-provisioned nodes.
+
+The following list shows containers that belong to Kubernetes
+Master Tier and run under control of systemd on Master and/or
+Minion nodes, along with a short explaination why it is necessary
+in every case:
+
+-  Etcd
+
+   -  Should have directory mounted from host system.
+
+-  Calico
+
+   -  Depending on network architecture it may be required to disable
+      node-to-node mesh and configure route reflectors instead. This
+      is especially recommended for large scale deployments (see below).
+
+-  Kubelet
+
+   -  Certificates directory should be mounted from host system in Read
+      Only mode.
+
+The following containers are defined as ReplicationController objects
+in Kubernetes API:
+
+-  kubedns
+
+All other containers are started as `static
+pods <http://kubernetes.io/docs/admin/static-pods/>`__ by Kubelet in
+'kube-system' namespace of Kubernetes cluster. This includes:
+
+- kube-apiserver
+
+- kube-scheduler
+
+- kube-controller-manager
+
+- Proxy Server (nginx)
+
+- dnsmasq
+
+.. note::
+
+    An option to start all other services in Kubelet is being considered.
+    There is a potential chicken-and-egg type issue that Kubelet requires
+    `CNI <http://kubernetes.io/docs/admin/network-plugins/>`__ plugin to
+    be configured prior its start, as a result when Calico pod started by
+    Kubelet, it tries to perform a hook for a plugin and
+    `fails
+    <https://gist.github.com/rustyrobot/d087c9177534edec82c3d79ad9576935>`__.
+    Thi happens if a pod uses host networking as well.
+    After several attempts it starts the container, but currently
+    such cases `are not handled
+    explicitly <https://github.com/kubernetes/kubernetes/blob/8cabbcbdcfa2bdf17c9ddc1db6754c862d6e08a2/pkg/kubelet/dockertools/docker_manager.go#L343-L350>`__.
+
+Common practices
+~~~~~~~~~~~~~~~~
+
+-  Manifests for static Pods should be mounted (read only) from host
+   system, it will simplify update and reconfiguration procedure.
+
+-  SSL certificates and any secrets should be mounted (read only) from
+   host system, also they should have appropriate permissions.
+
+Installation workflow
+~~~~~~~~~~~~~~~~~~~~~
+
+#. Ansible retrieves SSL certificates.
+
+#. Ansible installs and configures docker.
+
+   a. Systemd config
+
+   b. Use external registry
+
+#. All control-plane related Pods must be started in separate namespace
+   ``kube-system``. This will allow  to restrict access to control plane
+   pods `in future <http://kubernetes.io/docs/user-guide/namespaces/>`__.
+
+#. Ansible generates manifests for static pods and writes them to
+   ``/etc/kubernetes/manifests`` directory.
+
+#. Ansible generates configuration files, systemd units and services
+   for Etcd, Calico and Kubelet.
+
+#. Ansible starts all systemd-based services listed above.
+
+#. When Kubelet is started, it reads manifests and starts services
+   defined as static pods (see above).
+
+#. Run health-check.
+
+#. This operations are repeated for every node in the cluster.
+
+Scaling to 1000 Nodes
+---------------------
+
+Scaling Kubernetes cluster to magnitude of 1000 nodes requires certain
+changes to confiugration and, in a few cases, the source code of
+components.
+
+The following modifications were made to default configuration
+deployed by Kargo installer.
+
+Proxy Server
+~~~~~~~~~~~~
+
+Default configuration of parameter ``proxy_timteout`` in Nginx
+caused issues with long-polling "watch" requests from kube-proxy
+and kubelet to apiserver. Nginx by default terminates such sessions
+in 3 seconds. Once session is cut, Kubernetes client has to restore
+it, including repeat of SSL handshake, and at scale it generates
+high load on Kube API servers, about 2000% of CPU in given
+configuration.
+
+This problem was solved by changing the default value (3s) to
+more appropriate value of 10m::
+
+    proxy_timeout: 10m
+
+As a result, CPU usage of ``kube-apiserver`` processes dropped
+10 times, to 100-200%.
+
+The `corresponding change <https://github.com/kubernetes-incubator/kargo/issues/655>`__
+was proposed into upstream Kargo.
+
+kube-apiserver
+~~~~~~~~~~~~~~
+
+The default rate limit of Kube API server proved to be too low for
+the scale of 1000 nodes. Long before the top load on the API server,
+it starts to return ``429 Rate Limit Exceeded`` HTTP code.
+
+Rate limits were adjusted by passing new value to ``kube-apiserver``
+with ``--max-requests-inflight`` command line option. While default
+value for this parameter is 400, it has to be adjusted to 2000 at
+the given scale to accommodate to actual rate of incoming requests.
+
+kube-scheduler
+~~~~~~~~~~~~~~
+
+Scheduling of so many pods with anti-affinity rules, as required by
+CCP architecture, puts ``kube-scheduler`` under high load. A few
+optimizations were made to its code to accommodate to the 1000
+node scale.
+
+*  scheduling algorithm improved to reduce a number of expensive
+   operations: `pull request <https://github.com/kubernetes/kubernetes/pull/37691>`__.
+
+*  cache eviction/miss bug in scheduler has to be fixed to improve
+   handling of anti-affinity rules. It was `worked
+   around <https://github.com/kubernetes/kubernetes/pull/37691>`__ in
+   Kubernetes, but root cause still requires effort to fix.
+
+The active scheduler was placed to dedicated hardware node in order
+to cope with high load while scheduling large number of OpenStack
+control plane pods.
+
+kubedns and dnsmaq
+~~~~~~~~~~~~~~~~~~
+
+Default settings of resource limits for dnsmasq in Kargo don't fit for
+scale of 1000 nodes. The following settings must be adjusted to accommodate
+for that scale:
+
+- ``dns_replicas: 6``
+
+- ``dns_cpu_limit: 100m``
+
+- ``dns_memory_limit: 512Mi``
+
+- ``dns_cpu_requests 70m``
+
+- ``dns_memory_requests: 70Mi``
+
+A number of instances of ``kubedns`` pod was increased to 6 to
+handle load generated by the cluster of the given size.
+
+Following limits were tuned in ``dnsmasq`` configuration:
+
+* number of parallel connections the daemon could handle
+  was increased to 1000::
+
+    --dns-forward-max=1000
+
+* size of cache was set to the highest possible value of 10000
+
+Ansible
+~~~~~~~
+
+Several parameters in Ansible configuration have to be adjusted to
+improve its robustness in higher scale environments. This includes
+the following:
+
+- ``forks`` for a number of parallel processes to spawn when communicating
+  to remote hosts.
+
+- ``timeout`` default SSH timeout on connection attepmts.
+
+- ``download_run_once`` and ``download_localhost`` boolean parameters
+  control how container images are being distributed to nodes.
+
+Calico
+~~~~~~
+
+In the tested architecture Calico was configured without route
+reflectors for BIRD BGP daemons. Therefore, Calico established
+a full mesh connections between all nodes in the cluster. This
+operation took significant time during node startup.
+
+It is recommended to configure route reflectors for BGP daemons
+in all cases at scale of 1000 nodes. This will reduce the
+number of BGP connections across the cluster and improve
+startup time for nodes.
+
+Lifecycle Management
+--------------------
+
+Validation
+~~~~~~~~~~
+
+Many LCM use-cases may cause destructive consequences for the cluster,
+we should cover such use-cases with static validation, because it’s easy
+to make a mistake when user edits the configuration files.
+
+Examples of such use-cases:
+
+-  Check that there are nodes with Master related services.
+
+-  Check that quorum for etcd cluster is satisfied.
+
+-  Check that scale down or node decommissioning does not cause data
+   lose.
+
+The validation checks should be implemented on CI/CD level, when new
+patch is published, a set of gates should be started, where validation
+logic is implemented, based on gates configuration they may or may not
+block the patch for promotion to staging or production.
+
+Scale up
+~~~~~~~~
+
+User assigns a role to a new node in configuration file, after changes
+are committed in the branch, CI/CD runs Ansible playbooks.
+
+Master
+^^^^^^
+
+1. Deploy additional master node.
+
+2. Ensure that after new component is deployed, it’s available via
+   endpoints.
+
+Minion
+^^^^^^
+
+1. Deploy additional minion node.
+
+2. Enable workload scheduling on new node.
+
+Scale down
+~~~~~~~~~~
+
+Scaledown can also be described as Node Deletion. During scaledown user
+should remove the node from configuration file, and add the node for
+decommissioning.
+
+Master
+^^^^^^
+
+1. Run Ansible against the cluster to make sure that the node being
+   deleted is not present in any service's configuration.
+
+2. Run node decommissioning.
+
+Minion
+^^^^^^
+
+1. Disable scheduling to the minion being deleted.
+
+2. Move workloads away from the minion.
+
+3. Run decommission of services managed by Ansible (see section
+   `Installation`_).
+
+4. Run node decommissioning.
+
+Test Plan
+~~~~~~~~~
+
+-  Initial deploy
+
+   Tests must verify that Kubernetes cluster has all required
+   services and generally functional in terms of standard
+   operations, e.g. add, remove a pod, service and other
+   entities.
+
+-  Scaleup
+
+   Verify that Master node and Minion node could be added to
+   the cluster. The cluster must remain functional in terms
+   defined above after the scaleup operation.
+
+-  Scaledown
+
+   Verify that the cluster retains its functionality after
+   removing Master or Minion node. This test set is subject
+   to additional limitations to number of removed nodes
+   since there is a absolute minimum or nodes required for
+   Kubernetes cluster to function.
+
+-  Update
+
+   Verify that updating single service or a set of thereof
+   doesn't degrade functions of the cluster compared to
+   its initial deploy state.
+
+   -  Intrusive
+
+   -  Non-intrusive
+
+-  Rollback
+
+   Verify that restoring version of one or more components to
+   previously working state after they were updated does not
+   lead to degradation of functions of the cluster.
+
+-  Rollout abort
+
+   Verify that if a Rollback operation is aborted, the cluster
+   can be reverted to working state by resuming the operation.
+
+Updating
+--------
+
+Updating is one the most complex Lifecycle management use-cases, that is
+the reason it was decided to allocate dedicated section for that. We
+split updates use-cases into two groups. The first group
+“Non-intrusive”, is the simplest one, update of services which do not
+cause workload downtime. The second “Intrusive”, is more complicated
+since may cause updates downtime and has to involve a sequence of
+actions in order to move stateful workload to different node in the
+cluster.
+
+Update procedure starts with publishing of new version of image in
+Docker repository. Then a service's metadata should be updated to new
+version by operator of the cloud in staging or production branch of
+configuration repository for Kubernetes cluster.
+
+Non-intrusive
+~~~~~~~~~~~~~
+
+Non-intrusive type of update does not cause workload downtime, hence it
+does not require workload migration.
+
+Master
+^^^^^^
+
+Update of Master nodes with minimal downtime can be achieved if
+Kubernetes installed in HA mode, minimum 3 nodes.
+
+Key points in updating Master related services:
+
+-  First action which has to be run prior to update is backup of
+   Kubernetes related stateful services (in our case it is etcd).
+
+-  Update of services managed by Ansible is done by ensuring version of
+   running docker image and updating it in systemd and related
+   services.
+
+-  Update of services managed by Kubelet is done by ensuring of files
+   with Pod description which contain specific version.
+
+-  Nodes has to be updated one-by-one, without restarting services on
+   all nodes simultaneously.
+
+Minion
+^^^^^^
+
+Key points in updating Minion nodes, where workload is run:
+
+-  Prior to restarting Kubelet, Kubernetes has to be notified that
+   Kubelet is under maintenance and
+   its workload must not be rescheduled to different node.
+
+-  Update of Kubelet should be managed by Ansible.
+
+-  Update of services managed by Kubelet is done by ensuring of files
+   with Pod description.
+
+Intrusive
+~~~~~~~~~
+
+Intrusive update is an update which may cause workload downtime,
+separate update flow for such kind of updates has to be considered. In
+order to provide update with minimal downtime for the tenant we want to
+leverage VMs Live Migration capabilities. Migration requires to start
+maintenance procedure in the right order by butches of specific sizes.
+
+Common
+^^^^^^
+
+-  Services managed by Ansible, are updated using Ansible playbooks
+   which triggers pull of new version, and restart.
+
+-  If service is managed by Kubelet, Ansible only updates static
+   manifest and Kubelet automatically updates services it manages
+
+Master
+^^^^^^
+
+Since master node does not have user workload update the key points for
+update are the same as for “Non-intrusive” use-cases.
+
+Minion
+^^^^^^
+
+User’s workload is run on Minion nodes, in order to apply intrusive
+updates, rollout system has to move workload to a different node. On big
+clusters updates in butch-update will be required, to achieve faster
+rollout.
+
+Key requirements for Kubernetes installer and orchestrator:
+
+-  Kubernetes installer is agnostic of which workloads run in Kubernetes
+   cluster and in VMs on top of OpenStack which works as Kubernetes
+   application.
+
+-  Kubernetes installer should receive rollout plan, where the order,
+   and grouping of nodes, update pf which can be rolled out in
+   parallel are defined. This update plan will be generated by
+   different tool, which knows “something” about types of workload
+   run on the cluster.
+
+-  In order to move workload to different node, installer has to trigger
+   workload evacuation from the node.
+
+   -  Scheduling of new workload to the node should be disabled.
+
+   -  Node has to be considered as in maintenance mode, that
+      unavailability of kubelet does not cause workload
+      rescheduling.
+
+   -  Installer has to trigger workload evacuation in kubelet, kubelet
+      should use hooks defined in Pods, to start workload migration.
+
+-  In rollout plan it should be possible to specify, when to fail
+   rollout procedure.
+
+   -  If some percent of nodes failed to update.
+
+   -  There may be some critical for failure nodes, it’s important to
+      provide per node configuration, if it is important to stop
+      rollout procedure if this node failed to be updated.
+
+Limitations
+~~~~~~~~~~~
+
+Hyperkube
+^^^^^^^^^
+
+Current Kubernetes deliver mechanism relies on Hyperkube distribution.
+Hyperkube is a single binary file which contains all set of core
+Kubernetes components, e.g. API, Scheduler, Controller, etc. The problem
+with this approach is that bug-fix for API causes update of all core
+Kubernetes containers, even if API is installed on few controllers, new
+version has to be rolled out to all thousands of minions.
+
+Possible solutions:
+
+-  For different roles rollout different versions of Hyperkube. This
+   approach significantly complicates versions and fixes tracking
+   process.
+
+-  Make split between those roles and create for them different images.
+   The problem will remain since most of the core components are
+   developed in a single repository and released together, hence it
+   is still an issue, if release tag is published on the repo,
+   rebuild of all core components will be required.
+
+For now we go with native way of distribution until better solution is
+found.
+
+Update Configuration
+~~~~~~~~~~~~~~~~~~~~
+
+Update of configurations in most of the cases should not cause downtime.
+
+-  Update of Kubernetes and related services (calico, etcd, etc).
+
+-  Rotation of SSL certificates (e.g. those which are used for Kubelet
+   authentication)
+
+Abort Rollout
+~~~~~~~~~~~~~
+
+Despite the fact that this operation may be dangerous, user should be
+able to interrupt update procedure.
+
+Rollback
+~~~~~~~~
+
+Some of the operations are impossible to rollback, rollback may require
+to have different flow of actions to be executed on the cluster.
+
+Troubleshooting
+---------------
+
+There should be a simple way to provide for a developer tooling for
+debugging and troubleshooting. These tools should not be installed on
+each machine by default, but there should be a simple way to get this
+tools installed on demand.
+
+-  Image with all tools required for debugging
+
+-  Container should be run in privileged mode with host networking.
+
+-  User can rollout this container to required nodes using Ansible.
+
+Example of tools which may be required:
+
+-  Sysdig
+
+-  Tcpdump
+
+-  Strace/Ltrace
+
+-  Clients for etcd, calico etc
+
+-  ...
+
+Open questions
+--------------
+
+-  Networking node?
+
+Related links
+-------------
+
+-  `Keepalived based VIP managament for Kuberentes
+   <https://github.com/kubernetes/contrib/tree/master/keepalived-vip>`__
+
+-  `HA endpoints for K8s in Kargo
+   <https://github.com/kubernetes-incubator/kargo/blob/master/docs/ha-mode.md>`__
+
+-  `Large deployments in Kargo
+   <https://github.com/kubernetes-incubator/kargo/blob/master/docs/large-deployments.md>`__
+
+-  `ECMP load balancing for external IPs
+   <https://github.com/Mirantis/k8s-externalipcontroller/blob/master/doc/ecmp-load-balancing.md>`__
+
+Contributors
+------------
+
+-  Evgeny Li
+
+-  Matthew Mosesohn
+
+-  Bogdan Dobrelya
+
+-  Jedrzej Nowak
+
+-  Vladimir Eremin
+
+-  Dmytriy Novakovskiy
+
+-  Michael Korolev
+
+-  Alexey Shtokolov
+
+-  Mike Scherbakov
+
+-  Vladimir Kuklin
+
+-  Sergii Golovatiuk
+
+-  Aleksander Didenko
+
+-  Ihor Dvoretskyi
+
+-  Oleg Gelbukh
+
+Appendix A. High Availability Alternatives
+------------------------------------------
+
+This section contains some High Availability options that were
+considered and researched, but deemed too complicated or too
+risky to implement in the first iteration of the project.
+
+Option #1 VIP for external and internal with native etcd proxy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First approach to Highly Available Kubernetes with Kargo assumes
+using VIP for external and internal access to Kubernetes API, etcd proxy
+for internal access to etcd cluster.
+
+-  VIP for external and internal access to Kubernetes API.
+
+-  VIP for external access to etcd.
+
+-  Native etcd proxy on each node for internal access to etcd cluster.
+
+|image1|
+
+Option #2 VIP for external and Proxy on each node for internal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The second considered option is each node that needs to access
+Kubernetes API also has Proxy Server installed. Each Proxy forwards
+traffic to alive Kubernetes API backends. External clients access
+Etcd and Kubernetes API using VIP.
+
+-  Internal access to APIs is done via proxies which are installed
+   locally.
+
+-  External access is done via Virtual IP address.
+
+|image2|
+
+Option #3 VIP for external Kubernetes API on each node
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Another similar to “VIP for external and Proxy on each node for
+internal” option, may be to install Kubernetes API on each node which
+requires access to it instead of installing Proxy which forwards the
+traffic to Kubernetes API on master nodes.
+
+-  VIP on top of proxies for external access.
+
+-  Etcd proxy on each node for internal services.
+
+-  Kubernetes API on each node, where access to Kubernetes is required.
+
+**This option was selected despite potential limitations listed
+above.**
+
+|image3|
+
+Option #4 VIP for external and internal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to achieve High Availability of Kubernetes master proxy server
+on every master node can be used, each proxy is configured to forward
+traffic to all available backends in the cluster (e.g. etcd,
+kubernetes-api), also there has to be a mechanism to achieve High
+Availability between these proxies, it can be achieved by VIP managed by
+cluster management system (see “High Availability between proxies”
+section).
+
+-  Internal and External access to Etcd or Kubernetes cluster is done
+   via Virtual IP address.
+
+-  Kubernetes API also access to Etcd using VIP.
+
+|image4|
+
+Option #5 VIP for external native Kubernetes proxy for internal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We considered using native Kubernetes proxy for forwarding traffic
+between APIs. Kubernetes proxy cannot work without Kubernetes API, hence
+it should be installed on each node, where Kubernetes proxy is
+installed. If Kubernetes API is installed on each node, there is no
+reason to use Kubernetes proxy to forward traffic with it, internal
+client can access the Kubernetes API through localhost.
+
+.. |image0| image:: media/k8s_1000_nodes/image07.png
+   :width: 3.36979in
+   :height: 1.50903in
+.. |image1| image:: media/k8s_1000_nodes/image09.png
+   :width: 6.37500in
+   :height: 4.01389in
+.. |image2| image:: media/k8s_1000_nodes/image08.png
+   :width: 6.37500in
+   :height: 4.13889in
+.. |image3| image:: media/k8s_1000_nodes/image11.png
+   :width: 6.37500in
+   :height: 4.59722in
+.. |image4| image:: media/k8s_1000_nodes/image03.png
+   :width: 6.37500in
+   :height: 4.12500in
diff --git a/doc/source/design/media/k8s_1000_nodes/image03.png b/doc/source/design/media/k8s_1000_nodes/image03.png
new file mode 100644
index 00000000..7129757e
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image03.png differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image07.png b/doc/source/design/media/k8s_1000_nodes/image07.png
new file mode 100644
index 00000000..94bb30d3
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image07.png differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image08.png b/doc/source/design/media/k8s_1000_nodes/image08.png
new file mode 100644
index 00000000..8190b027
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image08.png differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image09.png b/doc/source/design/media/k8s_1000_nodes/image09.png
new file mode 100644
index 00000000..9bf2753f
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image09.png differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image10.gif b/doc/source/design/media/k8s_1000_nodes/image10.gif
new file mode 100644
index 00000000..66a11fad
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image10.gif differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image11.png b/doc/source/design/media/k8s_1000_nodes/image11.png
new file mode 100644
index 00000000..37c4a0d1
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image11.png differ
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 726f4403..baa9ffbf 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -67,6 +67,7 @@ Design docs
    design/ost_compute_on_k8s
    design/ref_arch_100_nodes
    design/ref_arch_1000_nodes
+   design/k8s_1000_nodes_architecture
 
 Indices and tables
 ------------------