diff --git a/doc/source/design/k8s_1000_nodes_architecture.rst b/doc/source/design/k8s_1000_nodes_architecture.rst
new file mode 100644
index 00000000..fb26ca8f
--- /dev/null
+++ b/doc/source/design/k8s_1000_nodes_architecture.rst
@@ -0,0 +1,1322 @@
+.. _k8s_1000_nodes:
+
+===========================================
+Kubernetes Master Tier For 1000 Nodes Scale
+===========================================
+
+.. contents:: Table of Contents
+
+Introduction
+------------
+
+This document describes architecture, configuration and installation
+workflow of Kubernetes cluster for OpenStack Containerised Control Plane
+(CCP) on a set of hosts, either baremetal or virtual. Proposed architecture
+should scale up to 1000 nodes.
+
+Scope of the document
+~~~~~~~~~~~~~~~~~~~~~
+
+This document does not cover preparation of host nodes and installation
+of a CI/CD system. This document covers only Kubernetes and related
+services on a preinstalled operating system with configured partitioning
+and networking.
+
+Monitoring related tooling will be installed on ready to use Kubernetes
+as Pods, after Kubernetes installer finishes installation. This document
+does not cover architecture and implementation details of monitoring and
+profiling tools.
+
+Lifecycle Management section describes only Kubernetes and related
+services. It does not cover applications that run in Kubernetes cluster.
+
+Solution Prerequisites
+----------------------
+
+Hardware
+~~~~~~~~
+
+The proposed design was verified on a hardware lab that included 181
+physical hosts of the following configuration:
+
+- Server model: HP ProLiant DL380 Gen9
+
+- CPU: 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
+
+- RAM: 264G
+
+- Storage: 3.0T on RAID on HP Smart Array P840 Controller
+
+- HDD: 12 x HP EH0600JDYTL
+
+- Network: 2 x Intel Corporation Ethernet 10G 2P X710
+
+3 out of the 181 hosts were used to install Kubernetes Master control
+plane services. On every other host, 5 virtual machines were started
+to ensure contention of resources and serve as Minion nodes in Kubernetes
+cluster.
+
+Minimal requirements for the control plane services at scale of
+1000 nodes are relatively modest. Tests demonstrate that three physical
+nodes in the configuration specified above are sufficient to run
+all control plane services for cluster of this size, even though
+an application running on top of the cluster is rather complex
+(i.e. OpenStack control plane + compute cluster).
+
+Provisioning
+~~~~~~~~~~~~
+
+Hosts for Kubernetes cluster must be prepared by a provsioning system of
+some sort. It is assumed that users might have their own provisioning
+system to handle prerequisites for this.
+
+Provisioning system provides installed and configured operating system,
+networking, partitioning. It should operate on its own subset of cluster
+metadata. Some elements of that metadata will be used by installer tools
+for Kubernetes Master and OpenStack Control tiers.
+
+The following prerequisites are required from Provisioning system.
+
+Operating System
+^^^^^^^^^^^^^^^^
+
+- Ubuntu 16.04 is default choice of operating system.
+
+- It has to be installed and configured by provisioning system.
+
+Networking
+^^^^^^^^^^
+
+Before the deployment starts networking has to be configured and
+verified by underlay tooling:
+
+- Bonding.
+
+- Bridges (possibly).
+
+- Multi-tiered networking.
+
+- IP addresses assignment.
+
+- SSH access from CI/CD nodes to cluster nodes (is required for
+ Kubernetes installer).
+
+Such things as DPDK and Contrail can be most likely configured in
+containers boot in privileged mode, no underlay involvement is required:
+
+- Load DKMS modules
+
+- Change runtime kernel parameters
+
+Partitioning
+^^^^^^^^^^^^
+
+Nodes should be efficiently pre-partitioned (e.g. separation of ``/``,
+``/var/log``, ``/var/lib`` directories).
+
+Additionally it’s required to have LVM Volume Groups, which further will
+be used by:
+
+- LVM backend for ephemeral storage for Nova.
+
+- LVM backend for Kubernetes, it
+ may be required to create several Volume Groups for Kubernetes,
+ e.g. some of the services require SSD (InfluxDB), other will work
+ fine on HDD.
+
+Some customers also require Multipath disks to be configured.
+
+Additional Ansible packages (optional)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Currently `Kubespray `__ project is
+employed for installing Kubernetes. It provides Calico and
+Ubuntu/Debian support.
+
+Kubespray Ansible playbooks (or Kargo) are accepted into `Kubernenes
+incubator `__ by the community.
+
+Ansible requires:
+
+- ``python2.7``
+- ``python-netaddr``
+
+Ansible 2.1.0 or greater is required for Kargo deployment.
+
+Ansible installs and manages Kubernetes related services (see
+Components section) which should be delivered and
+installed as containers. Kubernetes has to be installed in HA mode, so
+that failure of a single master node does not cause control plane
+down-time.
+
+The long term strategy should be to reduce amount of Ansible playbooks
+we have to support and to do initial deployment and Lifecycle Management
+with Kubernetes itself and related tools.
+
+Node Decommissioning
+^^^^^^^^^^^^^^^^^^^^
+
+Many Lifecycle Management scenarios require nodes decommissioning
+procedure. Strategy on decommissioning may depend on the customer and
+tightly coupled with Underlay tooling.
+
+In order to properly remove the node from the cluster, a sequence of
+actions has to be performed by overlay tooling, to gracefully remove
+services from cluster and migrate workload (depends on the role).
+
+Possible scenarios of node decommissioning for underlay tooling:
+
+- Shut the node down.
+
+- Move node to bootstrap stage.
+
+- As a common practise we should not erase disks of the node, customers
+ occasionally delete their production nodes, there should be a way
+ to recover them (if they were not recycled).
+
+CI/CD
+~~~~~
+
+Runs a chain of jobs in predefined order, like deployment and
+verification. CI/CD has to provide a way to trigger a chain of jobs (git
+push trigger -> deploy -> verify), also there should be a way to share
+data between different jobs for example if IP allocation happens on job
+execution allocated IP addresses should be available for overlay
+installer job to consume.
+
+Non comprehensive list of functionality:
+
+- Jobs definitions.
+
+- Declarative definition of jobs pipelines.
+
+- Data sharing between jobs.
+
+- Artifacts (images, configurations, packages etc).
+
+User experience
+^^^^^^^^^^^^^^^
+
+1. User should be able to define a mapping of node and high level roles
+ (master, minion) also there should be a way to define mapping
+ more granularly (e.g. etcd master on separate nodes).
+
+2. After the change in pushed CI/CD job for rollout is triggered,
+ Ansible starts Kubernetes deployment from CI/CD via SSH (the
+ access from CI/CD to Kubernetes cluster using SSH has to be
+ provided).
+
+Updates
+^^^^^^^
+
+When new package is published (for example libssl) it should trigger a
+chain of jobs:
+
+1. Build new container image (Etcd, Calico, Hyperkube, Docker etc)
+
+2. Rebuild all images which depend on base
+
+3. Run image specific tests
+
+4. Deploy current production version on staging
+
+5. Run verification
+
+6. Deploy update on staging
+
+7. Run verification
+
+8. Send for promotion to production
+
+Solution Overview
+-----------------
+
+Current implementation considers two high-level groups of services -
+Master and Minion. Master nodes should run control-plane related
+services. Minion nodes should run user’s workload. In the future,
+additional Network node might be added.
+
+There are few additional requirements which should be addressed:
+
+- Components placement should be flexible enough to install most of the
+ services on different nodes, for example it may be required to
+ install etcd cluster members to dedicated nodes.
+
+- It should be possible to have a single-node installation, when all
+ required services to run Kubernetes cluster can be placed on a
+ single node. Using scale up mechanism it should be possible to
+ make the cluster HA. It would reduce amount of resources required
+ for development and testing of simple integration scenarios.
+
+Common Components
+~~~~~~~~~~~~~~~~~
+
+- Calico is an SDN controller that provides pure L3 networking to
+ Kubernetes cluster. It includes the following most important
+ components that run on every node in the cluster.
+
+ - Felix is an agent component of Calico, responsible for configuring
+ and managing routing tables, network interfaces and filters on
+ pariticipating hosts.
+
+ - Bird is a lightweight BGP daemon that allows for exchange of
+ addressing information between nodes of Calico network.
+
+- Kubernetes
+
+ - kube-dns provides discovery capabilities for Kubernetes Services.
+
+ - kubelet is an agent service of Kubernetes. It is responsible for
+ creating and managing Docker containers at the nodes of
+ Kubernetes cluster.
+
+Plugins for Kubernetes should be delivered within Kubernetes containers.
+The following plugins are required:
+
+- CNI plugin for integration with Calico SDN.
+
+- Volume plugins (e.g. Ceph, Cinder) for persistent storage.
+
+Another option which may be considered in the future, is to deliver
+plugins in separate containers, but it would complicate rollout of
+containers, since requires to rollout containers in specific order to
+mount plugins directory.
+
+Master Components
+~~~~~~~~~~~~~~~~~
+
+Master Components of Kubernetes control plane run on Master nodes.
+The proposed architecture includes 3 Master nodes with similar set
+of components running on every node.
+
+In addition to Common, the following components run on Master nodes:
+
+- etcd
+
+- Kubernetes
+
+ - Kubedns
+
+ - Kube-proxy (iptables mode)
+
+ - Kube-apiserver
+
+ - Kube-scheduler
+
+ - Kube-controller-manager
+
+Each component runs on container. Some of them are running in static
+pods in Kubernetes. Others are running as docker containers under
+management of operating system (i.e. as ``systemd`` service). See
+details in Installation section below.
+
+Minion Components
+~~~~~~~~~~~~~~~~~
+
+Everything from Common plus:
+
+- etcd-proxy is a mode of operation of etcd which doesn't provide
+ storage, but rather redirects requests to alive nodes in etcd
+ clutser.
+
+Optional Components
+~~~~~~~~~~~~~~~~~~~
+
+- Contrail SDN is an alternative to Calico in cases when L2 features
+ required.
+
+- Flannel is another alternative implementation of CNI plugin for
+ Kubernetes. As Calico, it creates an L3 overlay network.
+
+- Tools for debugging (see Troubleshooting below).
+
+Component Versions
+~~~~~~~~~~~~~~~~~~
+
+================ ===============
+Component Version
+================ ===============
+Kubernetes 1.4
+---------------- ---------------
+Etcd 3.0.12
+---------------- ---------------
+Calico 0.21-dev
+---------------- ---------------
+Docker 1.12.3
+================ ===============
+
+Components Overview
+-------------------
+
+Kubernetes
+~~~~~~~~~~
+
+kube-apiserver
+^^^^^^^^^^^^^^
+
+This server exposes Kubernetes API to internal and external clients.
+
+The proposed architecture includes 3 API server pods running on 3 different
+nodes for redundancy and load distribution purposes. API servers run as
+static pods, defined by a kubelet manifest
+(``/etc/kubernetes/manifests/kube-apiserver.manifest``). This manifest is
+created and managed by the Kubernetes installer.
+
+kube-scheduler
+^^^^^^^^^^^^^^
+
+Scheduler service of Kubernetes cluster monitors API server for
+unallocated pods and automatically assigns every such pod to a node
+based on filters or 'predicates' and weights or 'priority functions'.
+
+Scheduler runs as a single-container pod. Similarly to API server,
+it is a static pod, defined and managed by Kubernetes installer.
+Its manifest lives in ``/etc/kubernetes/manifests/kube-scheduler.manifest``.
+
+The proposed architecture suggests that 3 instances of scheduler
+run on 3 Master nodes. These instances are joined in a cluster whith
+elected leader that is active, and two warm stan-dy spares. When
+leader is lost for some reason, a re-election occurs and one of the
+spares becomes active leader.
+
+The following parameters control election of leader and are set
+for scheduler:
+
+- Leader election parameter for scheduler must be “true”.
+
+- Leader elect lease duration
+
+- Leader elect renew deadline
+
+- Leader elect retry period
+
+kube-controller-manager
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Controller manager executes a main loops of all entities (controllers)
+supported by Kubernetes API. It is similar to scheduler and API server
+in terms of configuration: it is a static pod defined and managed by
+Kubernetes installer via manifest file
+``/etc/kubernetes/manifests/kube-controller-manager.manifest``.
+
+In the proposed architecture, 3 instances of controller manager run
+in the same clustered mode as schedulers, with 1 active leader and
+2 stand-by spares.
+
+The same set of parameters controls election of leader for controller
+manager as well:
+
+- Leader election parameter for controller manager must be “true”
+
+- Leader elect lease duration
+
+- Leader elect renew deadline
+
+- Leader elect retry period
+
+kube-proxy
+^^^^^^^^^^
+
+Kubernetes proxy
+`forwards traffic `__
+to alive Kubernetes Pods. This is an internal component that exposes
+Services created via Kubernetes API inside the cluster. Some
+Ingress/Proxy server is required to expose services to outside of the
+cluster via globally routed virtual IP (see above).
+
+The pod ``kube-proxy`` runs on every node in the cluster. It is a static
+pod defined by manifest file
+``/etc/kubernetes/manifests/kube-proxy.manifest``. It includes single
+container that runs ``hyperkube`` application in proxy mode.
+
+kubedns
+^^^^^^^
+
+Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures
+the kubelets to tell individual containers to use the DNS Service’s IP to
+resolve DNS names.
+
+The DNS pod (``kubedns``) includes 3 containers:
+
+- ``kubedns`` is a resolver that communicates to API server and controls
+ DNS names resolving
+
+- ``dnsmasq`` is a relay and cache provider
+
+- ``healthz`` is a health check service
+
+In the proposed architecture, ``kubedns`` pod is controller by
+ReplicationController with replica factor 1, which means that only
+one instance of the pod is working in a cluster at any time.
+
+Etcd Cluster
+~~~~~~~~~~~~
+
+Etcd is a distributed, consistent key-value store for shared
+configuration and service discovery, with a focus on being:
+
+- Simple: well-defined, user-facing API (gRPC)
+
+- Secure: automatic TLS with optional client cert authentication
+
+- Fast: benchmarked 10,000 writes/sec
+
+- Reliable: properly distributed using Raft
+
+``etcd`` is written in Go and uses the Raft consensus algorithm to
+manage a highly-available replicated log.
+
+Every instance of ``etcd`` can operate in one of the two modes:
+
+- full mode
+
+- proxy mode
+
+In *full mode*, the instance participates in Raft consensus and
+has persistent storage.
+
+In *proxy mode*, ``etcd`` acts as a reverse proxy and forwards client
+requests to an active etcd cluster. The etcd proxy does not
+participate in the consensus replication of the etcd cluster,
+thus it neither increases the resilience nor decreases the write
+performance of the etcd cluster.
+
+In proposed architecture, ``etcd`` runs as a static container
+under control of host operating system. See details below in
+Installation section. The assumed version of ``etcd`` in this
+proposal is ``etcdv2``.
+
+Etcd full daemon
+^^^^^^^^^^^^^^^^
+
+There are three instances of ``etcd`` running in full mode on Master
+nodes in the proposed solution. This ensures the quorum in the cluster
+and resiliency of service.
+
+Etcd native proxy
+^^^^^^^^^^^^^^^^^
+
+Etcd in proxy mode runs on every node in Kubernetes cluster, including
+Masters and Minions. It automatically forwards requests to active Etcd
+cluster members. `According to the
+documentation `__
+it’s recommended etcd cluster architecture.
+
+Calico
+~~~~~~
+
+Calico is an L3 overlay network provider for Kubernetes. It
+propagates internal addresses of containers via BGP to all
+minions and ensures connectivity between containers.
+
+Calico uses etcd as a vessel for its configuraiton information.
+Separate etcd cluster is recommended for Calico instead of sharing
+one with Kubernetes.
+
+calico-node
+^^^^^^^^^^^
+
+In the proposed architecture, Calico is integrated with Kubernetes
+as Common Network Interface (CNI) plugin.
+
+The Calico container called ``calico-node`` runs on every node in
+Kubernetes cluster, including Masters and Minions. It is controlled
+by operating system directly as ``systemd`` service.
+
+The ``calico-node`` container incorporates 3 main services of Calico:
+
+- `Felix `__,
+ the primary Calico agent. It is responsible for programming routes and
+ ACLs, and anything else required on the host, in order to provide the
+ desired connectivity for the endpoints on that host.
+- `BIRD `__
+ is a BGP client that distributes routing information.
+- `confd` is a dynamic configuration manager for BIRD, triggered
+ automatically by updates in the configuration data.
+
+High Availability Architecture
+------------------------------
+
+Proxy server
+~~~~~~~~~~~~
+
+Proxy server should forward traffic to alive backends, health checking
+mechanism has to be in place to stop forwarding traffic to unhealthy
+backends.
+
+Nginx is used to implement Proxy service. It is deployed in a static pod,
+one pod per cluster. It provides access to K8s API endpoint on a single
+by redirecting requests to instances of kube-apiserver in a round-robin
+fashion. It exposes the endpoint both to external clients and internal
+clients (i.e. Kubernetes minions).
+
+SSL termination
+~~~~~~~~~~~~~~~
+
+SSL termination can be optionally configured on Nginx server. From
+there, traffic to instances of kube-apiserver will go over internal K8s
+network.
+
+Proxy Resiliency Alternatives
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the Proxy Server is a single point of failure for
+Kubernetes API and exposed Services, it must run in highly available
+configuration. The following alternatives were considered for high
+availability solution:
+
+1. `Keepalived `__
+ Although `Keepalived has problems with split brain
+ detection `__ there is `a
+ subproject in
+ Kubernetes `__
+ which uses Keepalived with an attempt to implement VIP management.
+
+2. `OSPF `__
+ Using OSPF routing protocol for resilient access and failover between
+ Proxy Servers requires configuration of external routers consistently
+ with internal OSPF configurations.
+
+3. VIP managed by `cluster management
+ tools `__
+ Etcd might serve as a cluster mangement tool for a Virtual IP address
+ where Proxy Server is listening. It will allow to converge the
+ technology stack of the whole solution.
+
+4. DNS-based reservation
+ Implementing DNS based High Availability is very
+ `problematic `__
+ due to caching on client side. It also requires additional tools for
+ fencing and failover of faulty Proxy Servers.
+
+Resilient Kubernetes Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In the proposed architecture, there is a single static pod with Proxy
+Server running under control of Kubelet on every Minion node.
+
+Each of the 3 Master nodes runs its own instance of ``kube-apiserver``
+on localhost address. All services working on a Master node address
+the Kubernetes API locally. All services on Minion nodes connect to
+the API via local instance of Proxy Server.
+
+Etcd daemons forming the cluster run on Master nodes. Every node in
+the cluster also runs etcd-proxy. This includes both Masters and
+Minions. Any service that requires access to etcd cluster talks
+to local instance of etcd-proxy to reach it. External access to
+etcd cluster is restricted.
+
+Calico node container runs on every node in the cluster, including
+Masters and Minions.
+
+The following diagram summarizes the proposed architecture.
+
+|image3|
+
+Alternative approaches to the resiliency of Kubernetes cluster were
+considered, researched and summarized in `Appendix A. High Availability
+Alternatives`_.
+
+Next steps in development of this architecture include implementation of
+a Proxy server as an Ingress Controller. It will allow for closer
+integration with K8s in terms of pods mobility and life-cycle management
+operations. For example, Ingress Controller can be written to only relay
+incoming requests to updated nodes during rolling update. It also allows
+to manage virtual endpoint using native Kubernetes tools (see below).
+
+Logging
+-------
+
+Logs collection was made by Heka broker running at all nodes in the
+Kubernetes cluster. It used `Docker
+logging `__
+in configuration when all logs are written to a volume. Heka reads files
+from the volume using `Docker
+plugin `__
+and uploads them to ElasticSearch storage.
+
+Installation
+------------
+
+This section describes the installation of Kubernetes cluster on
+pre-provisioned nodes.
+
+The following list shows containers that belong to Kubernetes
+Master Tier and run under control of systemd on Master and/or
+Minion nodes, along with a short explaination why it is necessary
+in every case:
+
+- Etcd
+
+ - Should have directory mounted from host system.
+
+- Calico
+
+ - Depending on network architecture it may be required to disable
+ node-to-node mesh and configure route reflectors instead. This
+ is especially recommended for large scale deployments (see below).
+
+- Kubelet
+
+ - Certificates directory should be mounted from host system in Read
+ Only mode.
+
+The following containers are defined as ReplicationController objects
+in Kubernetes API:
+
+- kubedns
+
+All other containers are started as `static
+pods `__ by Kubelet in
+'kube-system' namespace of Kubernetes cluster. This includes:
+
+- kube-apiserver
+
+- kube-scheduler
+
+- kube-controller-manager
+
+- Proxy Server (nginx)
+
+- dnsmasq
+
+.. note::
+
+ An option to start all other services in Kubelet is being considered.
+ There is a potential chicken-and-egg type issue that Kubelet requires
+ `CNI `__ plugin to
+ be configured prior its start, as a result when Calico pod started by
+ Kubelet, it tries to perform a hook for a plugin and
+ `fails
+ `__.
+ Thi happens if a pod uses host networking as well.
+ After several attempts it starts the container, but currently
+ such cases `are not handled
+ explicitly `__.
+
+Common practices
+~~~~~~~~~~~~~~~~
+
+- Manifests for static Pods should be mounted (read only) from host
+ system, it will simplify update and reconfiguration procedure.
+
+- SSL certificates and any secrets should be mounted (read only) from
+ host system, also they should have appropriate permissions.
+
+Installation workflow
+~~~~~~~~~~~~~~~~~~~~~
+
+#. Ansible retrieves SSL certificates.
+
+#. Ansible installs and configures docker.
+
+ a. Systemd config
+
+ b. Use external registry
+
+#. All control-plane related Pods must be started in separate namespace
+ ``kube-system``. This will allow to restrict access to control plane
+ pods `in future `__.
+
+#. Ansible generates manifests for static pods and writes them to
+ ``/etc/kubernetes/manifests`` directory.
+
+#. Ansible generates configuration files, systemd units and services
+ for Etcd, Calico and Kubelet.
+
+#. Ansible starts all systemd-based services listed above.
+
+#. When Kubelet is started, it reads manifests and starts services
+ defined as static pods (see above).
+
+#. Run health-check.
+
+#. This operations are repeated for every node in the cluster.
+
+Scaling to 1000 Nodes
+---------------------
+
+Scaling Kubernetes cluster to magnitude of 1000 nodes requires certain
+changes to confiugration and, in a few cases, the source code of
+components.
+
+The following modifications were made to default configuration
+deployed by Kargo installer.
+
+Proxy Server
+~~~~~~~~~~~~
+
+Default configuration of parameter ``proxy_timteout`` in Nginx
+caused issues with long-polling "watch" requests from kube-proxy
+and kubelet to apiserver. Nginx by default terminates such sessions
+in 3 seconds. Once session is cut, Kubernetes client has to restore
+it, including repeat of SSL handshake, and at scale it generates
+high load on Kube API servers, about 2000% of CPU in given
+configuration.
+
+This problem was solved by changing the default value (3s) to
+more appropriate value of 10m::
+
+ proxy_timeout: 10m
+
+As a result, CPU usage of ``kube-apiserver`` processes dropped
+10 times, to 100-200%.
+
+The `corresponding change `__
+was proposed into upstream Kargo.
+
+kube-apiserver
+~~~~~~~~~~~~~~
+
+The default rate limit of Kube API server proved to be too low for
+the scale of 1000 nodes. Long before the top load on the API server,
+it starts to return ``429 Rate Limit Exceeded`` HTTP code.
+
+Rate limits were adjusted by passing new value to ``kube-apiserver``
+with ``--max-requests-inflight`` command line option. While default
+value for this parameter is 400, it has to be adjusted to 2000 at
+the given scale to accommodate to actual rate of incoming requests.
+
+kube-scheduler
+~~~~~~~~~~~~~~
+
+Scheduling of so many pods with anti-affinity rules, as required by
+CCP architecture, puts ``kube-scheduler`` under high load. A few
+optimizations were made to its code to accommodate to the 1000
+node scale.
+
+* scheduling algorithm improved to reduce a number of expensive
+ operations: `pull request `__.
+
+* cache eviction/miss bug in scheduler has to be fixed to improve
+ handling of anti-affinity rules. It was `worked
+ around `__ in
+ Kubernetes, but root cause still requires effort to fix.
+
+The active scheduler was placed to dedicated hardware node in order
+to cope with high load while scheduling large number of OpenStack
+control plane pods.
+
+kubedns and dnsmaq
+~~~~~~~~~~~~~~~~~~
+
+Default settings of resource limits for dnsmasq in Kargo don't fit for
+scale of 1000 nodes. The following settings must be adjusted to accommodate
+for that scale:
+
+- ``dns_replicas: 6``
+
+- ``dns_cpu_limit: 100m``
+
+- ``dns_memory_limit: 512Mi``
+
+- ``dns_cpu_requests 70m``
+
+- ``dns_memory_requests: 70Mi``
+
+A number of instances of ``kubedns`` pod was increased to 6 to
+handle load generated by the cluster of the given size.
+
+Following limits were tuned in ``dnsmasq`` configuration:
+
+* number of parallel connections the daemon could handle
+ was increased to 1000::
+
+ --dns-forward-max=1000
+
+* size of cache was set to the highest possible value of 10000
+
+Ansible
+~~~~~~~
+
+Several parameters in Ansible configuration have to be adjusted to
+improve its robustness in higher scale environments. This includes
+the following:
+
+- ``forks`` for a number of parallel processes to spawn when communicating
+ to remote hosts.
+
+- ``timeout`` default SSH timeout on connection attepmts.
+
+- ``download_run_once`` and ``download_localhost`` boolean parameters
+ control how container images are being distributed to nodes.
+
+Calico
+~~~~~~
+
+In the tested architecture Calico was configured without route
+reflectors for BIRD BGP daemons. Therefore, Calico established
+a full mesh connections between all nodes in the cluster. This
+operation took significant time during node startup.
+
+It is recommended to configure route reflectors for BGP daemons
+in all cases at scale of 1000 nodes. This will reduce the
+number of BGP connections across the cluster and improve
+startup time for nodes.
+
+Lifecycle Management
+--------------------
+
+Validation
+~~~~~~~~~~
+
+Many LCM use-cases may cause destructive consequences for the cluster,
+we should cover such use-cases with static validation, because it’s easy
+to make a mistake when user edits the configuration files.
+
+Examples of such use-cases:
+
+- Check that there are nodes with Master related services.
+
+- Check that quorum for etcd cluster is satisfied.
+
+- Check that scale down or node decommissioning does not cause data
+ lose.
+
+The validation checks should be implemented on CI/CD level, when new
+patch is published, a set of gates should be started, where validation
+logic is implemented, based on gates configuration they may or may not
+block the patch for promotion to staging or production.
+
+Scale up
+~~~~~~~~
+
+User assigns a role to a new node in configuration file, after changes
+are committed in the branch, CI/CD runs Ansible playbooks.
+
+Master
+^^^^^^
+
+1. Deploy additional master node.
+
+2. Ensure that after new component is deployed, it’s available via
+ endpoints.
+
+Minion
+^^^^^^
+
+1. Deploy additional minion node.
+
+2. Enable workload scheduling on new node.
+
+Scale down
+~~~~~~~~~~
+
+Scaledown can also be described as Node Deletion. During scaledown user
+should remove the node from configuration file, and add the node for
+decommissioning.
+
+Master
+^^^^^^
+
+1. Run Ansible against the cluster to make sure that the node being
+ deleted is not present in any service's configuration.
+
+2. Run node decommissioning.
+
+Minion
+^^^^^^
+
+1. Disable scheduling to the minion being deleted.
+
+2. Move workloads away from the minion.
+
+3. Run decommission of services managed by Ansible (see section
+ `Installation`_).
+
+4. Run node decommissioning.
+
+Test Plan
+~~~~~~~~~
+
+- Initial deploy
+
+ Tests must verify that Kubernetes cluster has all required
+ services and generally functional in terms of standard
+ operations, e.g. add, remove a pod, service and other
+ entities.
+
+- Scaleup
+
+ Verify that Master node and Minion node could be added to
+ the cluster. The cluster must remain functional in terms
+ defined above after the scaleup operation.
+
+- Scaledown
+
+ Verify that the cluster retains its functionality after
+ removing Master or Minion node. This test set is subject
+ to additional limitations to number of removed nodes
+ since there is a absolute minimum or nodes required for
+ Kubernetes cluster to function.
+
+- Update
+
+ Verify that updating single service or a set of thereof
+ doesn't degrade functions of the cluster compared to
+ its initial deploy state.
+
+ - Intrusive
+
+ - Non-intrusive
+
+- Rollback
+
+ Verify that restoring version of one or more components to
+ previously working state after they were updated does not
+ lead to degradation of functions of the cluster.
+
+- Rollout abort
+
+ Verify that if a Rollback operation is aborted, the cluster
+ can be reverted to working state by resuming the operation.
+
+Updating
+--------
+
+Updating is one the most complex Lifecycle management use-cases, that is
+the reason it was decided to allocate dedicated section for that. We
+split updates use-cases into two groups. The first group
+“Non-intrusive”, is the simplest one, update of services which do not
+cause workload downtime. The second “Intrusive”, is more complicated
+since may cause updates downtime and has to involve a sequence of
+actions in order to move stateful workload to different node in the
+cluster.
+
+Update procedure starts with publishing of new version of image in
+Docker repository. Then a service's metadata should be updated to new
+version by operator of the cloud in staging or production branch of
+configuration repository for Kubernetes cluster.
+
+Non-intrusive
+~~~~~~~~~~~~~
+
+Non-intrusive type of update does not cause workload downtime, hence it
+does not require workload migration.
+
+Master
+^^^^^^
+
+Update of Master nodes with minimal downtime can be achieved if
+Kubernetes installed in HA mode, minimum 3 nodes.
+
+Key points in updating Master related services:
+
+- First action which has to be run prior to update is backup of
+ Kubernetes related stateful services (in our case it is etcd).
+
+- Update of services managed by Ansible is done by ensuring version of
+ running docker image and updating it in systemd and related
+ services.
+
+- Update of services managed by Kubelet is done by ensuring of files
+ with Pod description which contain specific version.
+
+- Nodes has to be updated one-by-one, without restarting services on
+ all nodes simultaneously.
+
+Minion
+^^^^^^
+
+Key points in updating Minion nodes, where workload is run:
+
+- Prior to restarting Kubelet, Kubernetes has to be notified that
+ Kubelet is under maintenance and
+ its workload must not be rescheduled to different node.
+
+- Update of Kubelet should be managed by Ansible.
+
+- Update of services managed by Kubelet is done by ensuring of files
+ with Pod description.
+
+Intrusive
+~~~~~~~~~
+
+Intrusive update is an update which may cause workload downtime,
+separate update flow for such kind of updates has to be considered. In
+order to provide update with minimal downtime for the tenant we want to
+leverage VMs Live Migration capabilities. Migration requires to start
+maintenance procedure in the right order by butches of specific sizes.
+
+Common
+^^^^^^
+
+- Services managed by Ansible, are updated using Ansible playbooks
+ which triggers pull of new version, and restart.
+
+- If service is managed by Kubelet, Ansible only updates static
+ manifest and Kubelet automatically updates services it manages
+
+Master
+^^^^^^
+
+Since master node does not have user workload update the key points for
+update are the same as for “Non-intrusive” use-cases.
+
+Minion
+^^^^^^
+
+User’s workload is run on Minion nodes, in order to apply intrusive
+updates, rollout system has to move workload to a different node. On big
+clusters updates in butch-update will be required, to achieve faster
+rollout.
+
+Key requirements for Kubernetes installer and orchestrator:
+
+- Kubernetes installer is agnostic of which workloads run in Kubernetes
+ cluster and in VMs on top of OpenStack which works as Kubernetes
+ application.
+
+- Kubernetes installer should receive rollout plan, where the order,
+ and grouping of nodes, update pf which can be rolled out in
+ parallel are defined. This update plan will be generated by
+ different tool, which knows “something” about types of workload
+ run on the cluster.
+
+- In order to move workload to different node, installer has to trigger
+ workload evacuation from the node.
+
+ - Scheduling of new workload to the node should be disabled.
+
+ - Node has to be considered as in maintenance mode, that
+ unavailability of kubelet does not cause workload
+ rescheduling.
+
+ - Installer has to trigger workload evacuation in kubelet, kubelet
+ should use hooks defined in Pods, to start workload migration.
+
+- In rollout plan it should be possible to specify, when to fail
+ rollout procedure.
+
+ - If some percent of nodes failed to update.
+
+ - There may be some critical for failure nodes, it’s important to
+ provide per node configuration, if it is important to stop
+ rollout procedure if this node failed to be updated.
+
+Limitations
+~~~~~~~~~~~
+
+Hyperkube
+^^^^^^^^^
+
+Current Kubernetes deliver mechanism relies on Hyperkube distribution.
+Hyperkube is a single binary file which contains all set of core
+Kubernetes components, e.g. API, Scheduler, Controller, etc. The problem
+with this approach is that bug-fix for API causes update of all core
+Kubernetes containers, even if API is installed on few controllers, new
+version has to be rolled out to all thousands of minions.
+
+Possible solutions:
+
+- For different roles rollout different versions of Hyperkube. This
+ approach significantly complicates versions and fixes tracking
+ process.
+
+- Make split between those roles and create for them different images.
+ The problem will remain since most of the core components are
+ developed in a single repository and released together, hence it
+ is still an issue, if release tag is published on the repo,
+ rebuild of all core components will be required.
+
+For now we go with native way of distribution until better solution is
+found.
+
+Update Configuration
+~~~~~~~~~~~~~~~~~~~~
+
+Update of configurations in most of the cases should not cause downtime.
+
+- Update of Kubernetes and related services (calico, etcd, etc).
+
+- Rotation of SSL certificates (e.g. those which are used for Kubelet
+ authentication)
+
+Abort Rollout
+~~~~~~~~~~~~~
+
+Despite the fact that this operation may be dangerous, user should be
+able to interrupt update procedure.
+
+Rollback
+~~~~~~~~
+
+Some of the operations are impossible to rollback, rollback may require
+to have different flow of actions to be executed on the cluster.
+
+Troubleshooting
+---------------
+
+There should be a simple way to provide for a developer tooling for
+debugging and troubleshooting. These tools should not be installed on
+each machine by default, but there should be a simple way to get this
+tools installed on demand.
+
+- Image with all tools required for debugging
+
+- Container should be run in privileged mode with host networking.
+
+- User can rollout this container to required nodes using Ansible.
+
+Example of tools which may be required:
+
+- Sysdig
+
+- Tcpdump
+
+- Strace/Ltrace
+
+- Clients for etcd, calico etc
+
+- ...
+
+Open questions
+--------------
+
+- Networking node?
+
+Related links
+-------------
+
+- `Keepalived based VIP managament for Kuberentes
+ `__
+
+- `HA endpoints for K8s in Kargo
+ `__
+
+- `Large deployments in Kargo
+ `__
+
+- `ECMP load balancing for external IPs
+ `__
+
+Contributors
+------------
+
+- Evgeny Li
+
+- Matthew Mosesohn
+
+- Bogdan Dobrelya
+
+- Jedrzej Nowak
+
+- Vladimir Eremin
+
+- Dmytriy Novakovskiy
+
+- Michael Korolev
+
+- Alexey Shtokolov
+
+- Mike Scherbakov
+
+- Vladimir Kuklin
+
+- Sergii Golovatiuk
+
+- Aleksander Didenko
+
+- Ihor Dvoretskyi
+
+- Oleg Gelbukh
+
+Appendix A. High Availability Alternatives
+------------------------------------------
+
+This section contains some High Availability options that were
+considered and researched, but deemed too complicated or too
+risky to implement in the first iteration of the project.
+
+Option #1 VIP for external and internal with native etcd proxy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First approach to Highly Available Kubernetes with Kargo assumes
+using VIP for external and internal access to Kubernetes API, etcd proxy
+for internal access to etcd cluster.
+
+- VIP for external and internal access to Kubernetes API.
+
+- VIP for external access to etcd.
+
+- Native etcd proxy on each node for internal access to etcd cluster.
+
+|image1|
+
+Option #2 VIP for external and Proxy on each node for internal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The second considered option is each node that needs to access
+Kubernetes API also has Proxy Server installed. Each Proxy forwards
+traffic to alive Kubernetes API backends. External clients access
+Etcd and Kubernetes API using VIP.
+
+- Internal access to APIs is done via proxies which are installed
+ locally.
+
+- External access is done via Virtual IP address.
+
+|image2|
+
+Option #3 VIP for external Kubernetes API on each node
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Another similar to “VIP for external and Proxy on each node for
+internal” option, may be to install Kubernetes API on each node which
+requires access to it instead of installing Proxy which forwards the
+traffic to Kubernetes API on master nodes.
+
+- VIP on top of proxies for external access.
+
+- Etcd proxy on each node for internal services.
+
+- Kubernetes API on each node, where access to Kubernetes is required.
+
+**This option was selected despite potential limitations listed
+above.**
+
+|image3|
+
+Option #4 VIP for external and internal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to achieve High Availability of Kubernetes master proxy server
+on every master node can be used, each proxy is configured to forward
+traffic to all available backends in the cluster (e.g. etcd,
+kubernetes-api), also there has to be a mechanism to achieve High
+Availability between these proxies, it can be achieved by VIP managed by
+cluster management system (see “High Availability between proxies”
+section).
+
+- Internal and External access to Etcd or Kubernetes cluster is done
+ via Virtual IP address.
+
+- Kubernetes API also access to Etcd using VIP.
+
+|image4|
+
+Option #5 VIP for external native Kubernetes proxy for internal
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We considered using native Kubernetes proxy for forwarding traffic
+between APIs. Kubernetes proxy cannot work without Kubernetes API, hence
+it should be installed on each node, where Kubernetes proxy is
+installed. If Kubernetes API is installed on each node, there is no
+reason to use Kubernetes proxy to forward traffic with it, internal
+client can access the Kubernetes API through localhost.
+
+.. |image0| image:: media/k8s_1000_nodes/image07.png
+ :width: 3.36979in
+ :height: 1.50903in
+.. |image1| image:: media/k8s_1000_nodes/image09.png
+ :width: 6.37500in
+ :height: 4.01389in
+.. |image2| image:: media/k8s_1000_nodes/image08.png
+ :width: 6.37500in
+ :height: 4.13889in
+.. |image3| image:: media/k8s_1000_nodes/image11.png
+ :width: 6.37500in
+ :height: 4.59722in
+.. |image4| image:: media/k8s_1000_nodes/image03.png
+ :width: 6.37500in
+ :height: 4.12500in
diff --git a/doc/source/design/media/k8s_1000_nodes/image03.png b/doc/source/design/media/k8s_1000_nodes/image03.png
new file mode 100644
index 00000000..7129757e
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image03.png differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image07.png b/doc/source/design/media/k8s_1000_nodes/image07.png
new file mode 100644
index 00000000..94bb30d3
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image07.png differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image08.png b/doc/source/design/media/k8s_1000_nodes/image08.png
new file mode 100644
index 00000000..8190b027
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image08.png differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image09.png b/doc/source/design/media/k8s_1000_nodes/image09.png
new file mode 100644
index 00000000..9bf2753f
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image09.png differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image10.gif b/doc/source/design/media/k8s_1000_nodes/image10.gif
new file mode 100644
index 00000000..66a11fad
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image10.gif differ
diff --git a/doc/source/design/media/k8s_1000_nodes/image11.png b/doc/source/design/media/k8s_1000_nodes/image11.png
new file mode 100644
index 00000000..37c4a0d1
Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image11.png differ
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 726f4403..baa9ffbf 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -67,6 +67,7 @@ Design docs
design/ost_compute_on_k8s
design/ref_arch_100_nodes
design/ref_arch_1000_nodes
+ design/k8s_1000_nodes_architecture
Indices and tables
------------------