diff --git a/doc/source/design/k8s_1000_nodes_architecture.rst b/doc/source/design/k8s_1000_nodes_architecture.rst new file mode 100644 index 00000000..fb26ca8f --- /dev/null +++ b/doc/source/design/k8s_1000_nodes_architecture.rst @@ -0,0 +1,1322 @@ +.. _k8s_1000_nodes: + +=========================================== +Kubernetes Master Tier For 1000 Nodes Scale +=========================================== + +.. contents:: Table of Contents + +Introduction +------------ + +This document describes architecture, configuration and installation +workflow of Kubernetes cluster for OpenStack Containerised Control Plane +(CCP) on a set of hosts, either baremetal or virtual. Proposed architecture +should scale up to 1000 nodes. + +Scope of the document +~~~~~~~~~~~~~~~~~~~~~ + +This document does not cover preparation of host nodes and installation +of a CI/CD system. This document covers only Kubernetes and related +services on a preinstalled operating system with configured partitioning +and networking. + +Monitoring related tooling will be installed on ready to use Kubernetes +as Pods, after Kubernetes installer finishes installation. This document +does not cover architecture and implementation details of monitoring and +profiling tools. + +Lifecycle Management section describes only Kubernetes and related +services. It does not cover applications that run in Kubernetes cluster. + +Solution Prerequisites +---------------------- + +Hardware +~~~~~~~~ + +The proposed design was verified on a hardware lab that included 181 +physical hosts of the following configuration: + +- Server model: HP ProLiant DL380 Gen9 + +- CPU: 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz + +- RAM: 264G + +- Storage: 3.0T on RAID on HP Smart Array P840 Controller + +- HDD: 12 x HP EH0600JDYTL + +- Network: 2 x Intel Corporation Ethernet 10G 2P X710 + +3 out of the 181 hosts were used to install Kubernetes Master control +plane services. On every other host, 5 virtual machines were started +to ensure contention of resources and serve as Minion nodes in Kubernetes +cluster. + +Minimal requirements for the control plane services at scale of +1000 nodes are relatively modest. Tests demonstrate that three physical +nodes in the configuration specified above are sufficient to run +all control plane services for cluster of this size, even though +an application running on top of the cluster is rather complex +(i.e. OpenStack control plane + compute cluster). + +Provisioning +~~~~~~~~~~~~ + +Hosts for Kubernetes cluster must be prepared by a provsioning system of +some sort. It is assumed that users might have their own provisioning +system to handle prerequisites for this. + +Provisioning system provides installed and configured operating system, +networking, partitioning. It should operate on its own subset of cluster +metadata. Some elements of that metadata will be used by installer tools +for Kubernetes Master and OpenStack Control tiers. + +The following prerequisites are required from Provisioning system. + +Operating System +^^^^^^^^^^^^^^^^ + +- Ubuntu 16.04 is default choice of operating system. + +- It has to be installed and configured by provisioning system. + +Networking +^^^^^^^^^^ + +Before the deployment starts networking has to be configured and +verified by underlay tooling: + +- Bonding. + +- Bridges (possibly). + +- Multi-tiered networking. + +- IP addresses assignment. + +- SSH access from CI/CD nodes to cluster nodes (is required for + Kubernetes installer). + +Such things as DPDK and Contrail can be most likely configured in +containers boot in privileged mode, no underlay involvement is required: + +- Load DKMS modules + +- Change runtime kernel parameters + +Partitioning +^^^^^^^^^^^^ + +Nodes should be efficiently pre-partitioned (e.g. separation of ``/``, +``/var/log``, ``/var/lib`` directories). + +Additionally it’s required to have LVM Volume Groups, which further will +be used by: + +- LVM backend for ephemeral storage for Nova. + +- LVM backend for Kubernetes, it + may be required to create several Volume Groups for Kubernetes, + e.g. some of the services require SSD (InfluxDB), other will work + fine on HDD. + +Some customers also require Multipath disks to be configured. + +Additional Ansible packages (optional) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Currently `Kubespray `__ project is +employed for installing Kubernetes. It provides Calico and +Ubuntu/Debian support. + +Kubespray Ansible playbooks (or Kargo) are accepted into `Kubernenes +incubator `__ by the community. + +Ansible requires: + +- ``python2.7`` +- ``python-netaddr`` + +Ansible 2.1.0 or greater is required for Kargo deployment. + +Ansible installs and manages Kubernetes related services (see +Components section) which should be delivered and +installed as containers. Kubernetes has to be installed in HA mode, so +that failure of a single master node does not cause control plane +down-time. + +The long term strategy should be to reduce amount of Ansible playbooks +we have to support and to do initial deployment and Lifecycle Management +with Kubernetes itself and related tools. + +Node Decommissioning +^^^^^^^^^^^^^^^^^^^^ + +Many Lifecycle Management scenarios require nodes decommissioning +procedure. Strategy on decommissioning may depend on the customer and +tightly coupled with Underlay tooling. + +In order to properly remove the node from the cluster, a sequence of +actions has to be performed by overlay tooling, to gracefully remove +services from cluster and migrate workload (depends on the role). + +Possible scenarios of node decommissioning for underlay tooling: + +- Shut the node down. + +- Move node to bootstrap stage. + +- As a common practise we should not erase disks of the node, customers + occasionally delete their production nodes, there should be a way + to recover them (if they were not recycled). + +CI/CD +~~~~~ + +Runs a chain of jobs in predefined order, like deployment and +verification. CI/CD has to provide a way to trigger a chain of jobs (git +push trigger -> deploy -> verify), also there should be a way to share +data between different jobs for example if IP allocation happens on job +execution allocated IP addresses should be available for overlay +installer job to consume. + +Non comprehensive list of functionality: + +- Jobs definitions. + +- Declarative definition of jobs pipelines. + +- Data sharing between jobs. + +- Artifacts (images, configurations, packages etc). + +User experience +^^^^^^^^^^^^^^^ + +1. User should be able to define a mapping of node and high level roles + (master, minion) also there should be a way to define mapping + more granularly (e.g. etcd master on separate nodes). + +2. After the change in pushed CI/CD job for rollout is triggered, + Ansible starts Kubernetes deployment from CI/CD via SSH (the + access from CI/CD to Kubernetes cluster using SSH has to be + provided). + +Updates +^^^^^^^ + +When new package is published (for example libssl) it should trigger a +chain of jobs: + +1. Build new container image (Etcd, Calico, Hyperkube, Docker etc) + +2. Rebuild all images which depend on base + +3. Run image specific tests + +4. Deploy current production version on staging + +5. Run verification + +6. Deploy update on staging + +7. Run verification + +8. Send for promotion to production + +Solution Overview +----------------- + +Current implementation considers two high-level groups of services - +Master and Minion. Master nodes should run control-plane related +services. Minion nodes should run user’s workload. In the future, +additional Network node might be added. + +There are few additional requirements which should be addressed: + +- Components placement should be flexible enough to install most of the + services on different nodes, for example it may be required to + install etcd cluster members to dedicated nodes. + +- It should be possible to have a single-node installation, when all + required services to run Kubernetes cluster can be placed on a + single node. Using scale up mechanism it should be possible to + make the cluster HA. It would reduce amount of resources required + for development and testing of simple integration scenarios. + +Common Components +~~~~~~~~~~~~~~~~~ + +- Calico is an SDN controller that provides pure L3 networking to + Kubernetes cluster. It includes the following most important + components that run on every node in the cluster. + + - Felix is an agent component of Calico, responsible for configuring + and managing routing tables, network interfaces and filters on + pariticipating hosts. + + - Bird is a lightweight BGP daemon that allows for exchange of + addressing information between nodes of Calico network. + +- Kubernetes + + - kube-dns provides discovery capabilities for Kubernetes Services. + + - kubelet is an agent service of Kubernetes. It is responsible for + creating and managing Docker containers at the nodes of + Kubernetes cluster. + +Plugins for Kubernetes should be delivered within Kubernetes containers. +The following plugins are required: + +- CNI plugin for integration with Calico SDN. + +- Volume plugins (e.g. Ceph, Cinder) for persistent storage. + +Another option which may be considered in the future, is to deliver +plugins in separate containers, but it would complicate rollout of +containers, since requires to rollout containers in specific order to +mount plugins directory. + +Master Components +~~~~~~~~~~~~~~~~~ + +Master Components of Kubernetes control plane run on Master nodes. +The proposed architecture includes 3 Master nodes with similar set +of components running on every node. + +In addition to Common, the following components run on Master nodes: + +- etcd + +- Kubernetes + + - Kubedns + + - Kube-proxy (iptables mode) + + - Kube-apiserver + + - Kube-scheduler + + - Kube-controller-manager + +Each component runs on container. Some of them are running in static +pods in Kubernetes. Others are running as docker containers under +management of operating system (i.e. as ``systemd`` service). See +details in Installation section below. + +Minion Components +~~~~~~~~~~~~~~~~~ + +Everything from Common plus: + +- etcd-proxy is a mode of operation of etcd which doesn't provide + storage, but rather redirects requests to alive nodes in etcd + clutser. + +Optional Components +~~~~~~~~~~~~~~~~~~~ + +- Contrail SDN is an alternative to Calico in cases when L2 features + required. + +- Flannel is another alternative implementation of CNI plugin for + Kubernetes. As Calico, it creates an L3 overlay network. + +- Tools for debugging (see Troubleshooting below). + +Component Versions +~~~~~~~~~~~~~~~~~~ + +================ =============== +Component Version +================ =============== +Kubernetes 1.4 +---------------- --------------- +Etcd 3.0.12 +---------------- --------------- +Calico 0.21-dev +---------------- --------------- +Docker 1.12.3 +================ =============== + +Components Overview +------------------- + +Kubernetes +~~~~~~~~~~ + +kube-apiserver +^^^^^^^^^^^^^^ + +This server exposes Kubernetes API to internal and external clients. + +The proposed architecture includes 3 API server pods running on 3 different +nodes for redundancy and load distribution purposes. API servers run as +static pods, defined by a kubelet manifest +(``/etc/kubernetes/manifests/kube-apiserver.manifest``). This manifest is +created and managed by the Kubernetes installer. + +kube-scheduler +^^^^^^^^^^^^^^ + +Scheduler service of Kubernetes cluster monitors API server for +unallocated pods and automatically assigns every such pod to a node +based on filters or 'predicates' and weights or 'priority functions'. + +Scheduler runs as a single-container pod. Similarly to API server, +it is a static pod, defined and managed by Kubernetes installer. +Its manifest lives in ``/etc/kubernetes/manifests/kube-scheduler.manifest``. + +The proposed architecture suggests that 3 instances of scheduler +run on 3 Master nodes. These instances are joined in a cluster whith +elected leader that is active, and two warm stan-dy spares. When +leader is lost for some reason, a re-election occurs and one of the +spares becomes active leader. + +The following parameters control election of leader and are set +for scheduler: + +- Leader election parameter for scheduler must be “true”. + +- Leader elect lease duration + +- Leader elect renew deadline + +- Leader elect retry period + +kube-controller-manager +^^^^^^^^^^^^^^^^^^^^^^^ + +Controller manager executes a main loops of all entities (controllers) +supported by Kubernetes API. It is similar to scheduler and API server +in terms of configuration: it is a static pod defined and managed by +Kubernetes installer via manifest file +``/etc/kubernetes/manifests/kube-controller-manager.manifest``. + +In the proposed architecture, 3 instances of controller manager run +in the same clustered mode as schedulers, with 1 active leader and +2 stand-by spares. + +The same set of parameters controls election of leader for controller +manager as well: + +- Leader election parameter for controller manager must be “true” + +- Leader elect lease duration + +- Leader elect renew deadline + +- Leader elect retry period + +kube-proxy +^^^^^^^^^^ + +Kubernetes proxy +`forwards traffic `__ +to alive Kubernetes Pods. This is an internal component that exposes +Services created via Kubernetes API inside the cluster. Some +Ingress/Proxy server is required to expose services to outside of the +cluster via globally routed virtual IP (see above). + +The pod ``kube-proxy`` runs on every node in the cluster. It is a static +pod defined by manifest file +``/etc/kubernetes/manifests/kube-proxy.manifest``. It includes single +container that runs ``hyperkube`` application in proxy mode. + +kubedns +^^^^^^^ + +Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures +the kubelets to tell individual containers to use the DNS Service’s IP to +resolve DNS names. + +The DNS pod (``kubedns``) includes 3 containers: + +- ``kubedns`` is a resolver that communicates to API server and controls + DNS names resolving + +- ``dnsmasq`` is a relay and cache provider + +- ``healthz`` is a health check service + +In the proposed architecture, ``kubedns`` pod is controller by +ReplicationController with replica factor 1, which means that only +one instance of the pod is working in a cluster at any time. + +Etcd Cluster +~~~~~~~~~~~~ + +Etcd is a distributed, consistent key-value store for shared +configuration and service discovery, with a focus on being: + +- Simple: well-defined, user-facing API (gRPC) + +- Secure: automatic TLS with optional client cert authentication + +- Fast: benchmarked 10,000 writes/sec + +- Reliable: properly distributed using Raft + +``etcd`` is written in Go and uses the Raft consensus algorithm to +manage a highly-available replicated log. + +Every instance of ``etcd`` can operate in one of the two modes: + +- full mode + +- proxy mode + +In *full mode*, the instance participates in Raft consensus and +has persistent storage. + +In *proxy mode*, ``etcd`` acts as a reverse proxy and forwards client +requests to an active etcd cluster. The etcd proxy does not +participate in the consensus replication of the etcd cluster, +thus it neither increases the resilience nor decreases the write +performance of the etcd cluster. + +In proposed architecture, ``etcd`` runs as a static container +under control of host operating system. See details below in +Installation section. The assumed version of ``etcd`` in this +proposal is ``etcdv2``. + +Etcd full daemon +^^^^^^^^^^^^^^^^ + +There are three instances of ``etcd`` running in full mode on Master +nodes in the proposed solution. This ensures the quorum in the cluster +and resiliency of service. + +Etcd native proxy +^^^^^^^^^^^^^^^^^ + +Etcd in proxy mode runs on every node in Kubernetes cluster, including +Masters and Minions. It automatically forwards requests to active Etcd +cluster members. `According to the +documentation `__ +it’s recommended etcd cluster architecture. + +Calico +~~~~~~ + +Calico is an L3 overlay network provider for Kubernetes. It +propagates internal addresses of containers via BGP to all +minions and ensures connectivity between containers. + +Calico uses etcd as a vessel for its configuraiton information. +Separate etcd cluster is recommended for Calico instead of sharing +one with Kubernetes. + +calico-node +^^^^^^^^^^^ + +In the proposed architecture, Calico is integrated with Kubernetes +as Common Network Interface (CNI) plugin. + +The Calico container called ``calico-node`` runs on every node in +Kubernetes cluster, including Masters and Minions. It is controlled +by operating system directly as ``systemd`` service. + +The ``calico-node`` container incorporates 3 main services of Calico: + +- `Felix `__, + the primary Calico agent. It is responsible for programming routes and + ACLs, and anything else required on the host, in order to provide the + desired connectivity for the endpoints on that host. +- `BIRD `__ + is a BGP client that distributes routing information. +- `confd` is a dynamic configuration manager for BIRD, triggered + automatically by updates in the configuration data. + +High Availability Architecture +------------------------------ + +Proxy server +~~~~~~~~~~~~ + +Proxy server should forward traffic to alive backends, health checking +mechanism has to be in place to stop forwarding traffic to unhealthy +backends. + +Nginx is used to implement Proxy service. It is deployed in a static pod, +one pod per cluster. It provides access to K8s API endpoint on a single +by redirecting requests to instances of kube-apiserver in a round-robin +fashion. It exposes the endpoint both to external clients and internal +clients (i.e. Kubernetes minions). + +SSL termination +~~~~~~~~~~~~~~~ + +SSL termination can be optionally configured on Nginx server. From +there, traffic to instances of kube-apiserver will go over internal K8s +network. + +Proxy Resiliency Alternatives +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since the Proxy Server is a single point of failure for +Kubernetes API and exposed Services, it must run in highly available +configuration. The following alternatives were considered for high +availability solution: + +1. `Keepalived `__ + Although `Keepalived has problems with split brain + detection `__ there is `a + subproject in + Kubernetes `__ + which uses Keepalived with an attempt to implement VIP management. + +2. `OSPF `__ + Using OSPF routing protocol for resilient access and failover between + Proxy Servers requires configuration of external routers consistently + with internal OSPF configurations. + +3. VIP managed by `cluster management + tools `__ + Etcd might serve as a cluster mangement tool for a Virtual IP address + where Proxy Server is listening. It will allow to converge the + technology stack of the whole solution. + +4. DNS-based reservation + Implementing DNS based High Availability is very + `problematic `__ + due to caching on client side. It also requires additional tools for + fencing and failover of faulty Proxy Servers. + +Resilient Kubernetes Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In the proposed architecture, there is a single static pod with Proxy +Server running under control of Kubelet on every Minion node. + +Each of the 3 Master nodes runs its own instance of ``kube-apiserver`` +on localhost address. All services working on a Master node address +the Kubernetes API locally. All services on Minion nodes connect to +the API via local instance of Proxy Server. + +Etcd daemons forming the cluster run on Master nodes. Every node in +the cluster also runs etcd-proxy. This includes both Masters and +Minions. Any service that requires access to etcd cluster talks +to local instance of etcd-proxy to reach it. External access to +etcd cluster is restricted. + +Calico node container runs on every node in the cluster, including +Masters and Minions. + +The following diagram summarizes the proposed architecture. + +|image3| + +Alternative approaches to the resiliency of Kubernetes cluster were +considered, researched and summarized in `Appendix A. High Availability +Alternatives`_. + +Next steps in development of this architecture include implementation of +a Proxy server as an Ingress Controller. It will allow for closer +integration with K8s in terms of pods mobility and life-cycle management +operations. For example, Ingress Controller can be written to only relay +incoming requests to updated nodes during rolling update. It also allows +to manage virtual endpoint using native Kubernetes tools (see below). + +Logging +------- + +Logs collection was made by Heka broker running at all nodes in the +Kubernetes cluster. It used `Docker +logging `__ +in configuration when all logs are written to a volume. Heka reads files +from the volume using `Docker +plugin `__ +and uploads them to ElasticSearch storage. + +Installation +------------ + +This section describes the installation of Kubernetes cluster on +pre-provisioned nodes. + +The following list shows containers that belong to Kubernetes +Master Tier and run under control of systemd on Master and/or +Minion nodes, along with a short explaination why it is necessary +in every case: + +- Etcd + + - Should have directory mounted from host system. + +- Calico + + - Depending on network architecture it may be required to disable + node-to-node mesh and configure route reflectors instead. This + is especially recommended for large scale deployments (see below). + +- Kubelet + + - Certificates directory should be mounted from host system in Read + Only mode. + +The following containers are defined as ReplicationController objects +in Kubernetes API: + +- kubedns + +All other containers are started as `static +pods `__ by Kubelet in +'kube-system' namespace of Kubernetes cluster. This includes: + +- kube-apiserver + +- kube-scheduler + +- kube-controller-manager + +- Proxy Server (nginx) + +- dnsmasq + +.. note:: + + An option to start all other services in Kubelet is being considered. + There is a potential chicken-and-egg type issue that Kubelet requires + `CNI `__ plugin to + be configured prior its start, as a result when Calico pod started by + Kubelet, it tries to perform a hook for a plugin and + `fails + `__. + Thi happens if a pod uses host networking as well. + After several attempts it starts the container, but currently + such cases `are not handled + explicitly `__. + +Common practices +~~~~~~~~~~~~~~~~ + +- Manifests for static Pods should be mounted (read only) from host + system, it will simplify update and reconfiguration procedure. + +- SSL certificates and any secrets should be mounted (read only) from + host system, also they should have appropriate permissions. + +Installation workflow +~~~~~~~~~~~~~~~~~~~~~ + +#. Ansible retrieves SSL certificates. + +#. Ansible installs and configures docker. + + a. Systemd config + + b. Use external registry + +#. All control-plane related Pods must be started in separate namespace + ``kube-system``. This will allow to restrict access to control plane + pods `in future `__. + +#. Ansible generates manifests for static pods and writes them to + ``/etc/kubernetes/manifests`` directory. + +#. Ansible generates configuration files, systemd units and services + for Etcd, Calico and Kubelet. + +#. Ansible starts all systemd-based services listed above. + +#. When Kubelet is started, it reads manifests and starts services + defined as static pods (see above). + +#. Run health-check. + +#. This operations are repeated for every node in the cluster. + +Scaling to 1000 Nodes +--------------------- + +Scaling Kubernetes cluster to magnitude of 1000 nodes requires certain +changes to confiugration and, in a few cases, the source code of +components. + +The following modifications were made to default configuration +deployed by Kargo installer. + +Proxy Server +~~~~~~~~~~~~ + +Default configuration of parameter ``proxy_timteout`` in Nginx +caused issues with long-polling "watch" requests from kube-proxy +and kubelet to apiserver. Nginx by default terminates such sessions +in 3 seconds. Once session is cut, Kubernetes client has to restore +it, including repeat of SSL handshake, and at scale it generates +high load on Kube API servers, about 2000% of CPU in given +configuration. + +This problem was solved by changing the default value (3s) to +more appropriate value of 10m:: + + proxy_timeout: 10m + +As a result, CPU usage of ``kube-apiserver`` processes dropped +10 times, to 100-200%. + +The `corresponding change `__ +was proposed into upstream Kargo. + +kube-apiserver +~~~~~~~~~~~~~~ + +The default rate limit of Kube API server proved to be too low for +the scale of 1000 nodes. Long before the top load on the API server, +it starts to return ``429 Rate Limit Exceeded`` HTTP code. + +Rate limits were adjusted by passing new value to ``kube-apiserver`` +with ``--max-requests-inflight`` command line option. While default +value for this parameter is 400, it has to be adjusted to 2000 at +the given scale to accommodate to actual rate of incoming requests. + +kube-scheduler +~~~~~~~~~~~~~~ + +Scheduling of so many pods with anti-affinity rules, as required by +CCP architecture, puts ``kube-scheduler`` under high load. A few +optimizations were made to its code to accommodate to the 1000 +node scale. + +* scheduling algorithm improved to reduce a number of expensive + operations: `pull request `__. + +* cache eviction/miss bug in scheduler has to be fixed to improve + handling of anti-affinity rules. It was `worked + around `__ in + Kubernetes, but root cause still requires effort to fix. + +The active scheduler was placed to dedicated hardware node in order +to cope with high load while scheduling large number of OpenStack +control plane pods. + +kubedns and dnsmaq +~~~~~~~~~~~~~~~~~~ + +Default settings of resource limits for dnsmasq in Kargo don't fit for +scale of 1000 nodes. The following settings must be adjusted to accommodate +for that scale: + +- ``dns_replicas: 6`` + +- ``dns_cpu_limit: 100m`` + +- ``dns_memory_limit: 512Mi`` + +- ``dns_cpu_requests 70m`` + +- ``dns_memory_requests: 70Mi`` + +A number of instances of ``kubedns`` pod was increased to 6 to +handle load generated by the cluster of the given size. + +Following limits were tuned in ``dnsmasq`` configuration: + +* number of parallel connections the daemon could handle + was increased to 1000:: + + --dns-forward-max=1000 + +* size of cache was set to the highest possible value of 10000 + +Ansible +~~~~~~~ + +Several parameters in Ansible configuration have to be adjusted to +improve its robustness in higher scale environments. This includes +the following: + +- ``forks`` for a number of parallel processes to spawn when communicating + to remote hosts. + +- ``timeout`` default SSH timeout on connection attepmts. + +- ``download_run_once`` and ``download_localhost`` boolean parameters + control how container images are being distributed to nodes. + +Calico +~~~~~~ + +In the tested architecture Calico was configured without route +reflectors for BIRD BGP daemons. Therefore, Calico established +a full mesh connections between all nodes in the cluster. This +operation took significant time during node startup. + +It is recommended to configure route reflectors for BGP daemons +in all cases at scale of 1000 nodes. This will reduce the +number of BGP connections across the cluster and improve +startup time for nodes. + +Lifecycle Management +-------------------- + +Validation +~~~~~~~~~~ + +Many LCM use-cases may cause destructive consequences for the cluster, +we should cover such use-cases with static validation, because it’s easy +to make a mistake when user edits the configuration files. + +Examples of such use-cases: + +- Check that there are nodes with Master related services. + +- Check that quorum for etcd cluster is satisfied. + +- Check that scale down or node decommissioning does not cause data + lose. + +The validation checks should be implemented on CI/CD level, when new +patch is published, a set of gates should be started, where validation +logic is implemented, based on gates configuration they may or may not +block the patch for promotion to staging or production. + +Scale up +~~~~~~~~ + +User assigns a role to a new node in configuration file, after changes +are committed in the branch, CI/CD runs Ansible playbooks. + +Master +^^^^^^ + +1. Deploy additional master node. + +2. Ensure that after new component is deployed, it’s available via + endpoints. + +Minion +^^^^^^ + +1. Deploy additional minion node. + +2. Enable workload scheduling on new node. + +Scale down +~~~~~~~~~~ + +Scaledown can also be described as Node Deletion. During scaledown user +should remove the node from configuration file, and add the node for +decommissioning. + +Master +^^^^^^ + +1. Run Ansible against the cluster to make sure that the node being + deleted is not present in any service's configuration. + +2. Run node decommissioning. + +Minion +^^^^^^ + +1. Disable scheduling to the minion being deleted. + +2. Move workloads away from the minion. + +3. Run decommission of services managed by Ansible (see section + `Installation`_). + +4. Run node decommissioning. + +Test Plan +~~~~~~~~~ + +- Initial deploy + + Tests must verify that Kubernetes cluster has all required + services and generally functional in terms of standard + operations, e.g. add, remove a pod, service and other + entities. + +- Scaleup + + Verify that Master node and Minion node could be added to + the cluster. The cluster must remain functional in terms + defined above after the scaleup operation. + +- Scaledown + + Verify that the cluster retains its functionality after + removing Master or Minion node. This test set is subject + to additional limitations to number of removed nodes + since there is a absolute minimum or nodes required for + Kubernetes cluster to function. + +- Update + + Verify that updating single service or a set of thereof + doesn't degrade functions of the cluster compared to + its initial deploy state. + + - Intrusive + + - Non-intrusive + +- Rollback + + Verify that restoring version of one or more components to + previously working state after they were updated does not + lead to degradation of functions of the cluster. + +- Rollout abort + + Verify that if a Rollback operation is aborted, the cluster + can be reverted to working state by resuming the operation. + +Updating +-------- + +Updating is one the most complex Lifecycle management use-cases, that is +the reason it was decided to allocate dedicated section for that. We +split updates use-cases into two groups. The first group +“Non-intrusive”, is the simplest one, update of services which do not +cause workload downtime. The second “Intrusive”, is more complicated +since may cause updates downtime and has to involve a sequence of +actions in order to move stateful workload to different node in the +cluster. + +Update procedure starts with publishing of new version of image in +Docker repository. Then a service's metadata should be updated to new +version by operator of the cloud in staging or production branch of +configuration repository for Kubernetes cluster. + +Non-intrusive +~~~~~~~~~~~~~ + +Non-intrusive type of update does not cause workload downtime, hence it +does not require workload migration. + +Master +^^^^^^ + +Update of Master nodes with minimal downtime can be achieved if +Kubernetes installed in HA mode, minimum 3 nodes. + +Key points in updating Master related services: + +- First action which has to be run prior to update is backup of + Kubernetes related stateful services (in our case it is etcd). + +- Update of services managed by Ansible is done by ensuring version of + running docker image and updating it in systemd and related + services. + +- Update of services managed by Kubelet is done by ensuring of files + with Pod description which contain specific version. + +- Nodes has to be updated one-by-one, without restarting services on + all nodes simultaneously. + +Minion +^^^^^^ + +Key points in updating Minion nodes, where workload is run: + +- Prior to restarting Kubelet, Kubernetes has to be notified that + Kubelet is under maintenance and + its workload must not be rescheduled to different node. + +- Update of Kubelet should be managed by Ansible. + +- Update of services managed by Kubelet is done by ensuring of files + with Pod description. + +Intrusive +~~~~~~~~~ + +Intrusive update is an update which may cause workload downtime, +separate update flow for such kind of updates has to be considered. In +order to provide update with minimal downtime for the tenant we want to +leverage VMs Live Migration capabilities. Migration requires to start +maintenance procedure in the right order by butches of specific sizes. + +Common +^^^^^^ + +- Services managed by Ansible, are updated using Ansible playbooks + which triggers pull of new version, and restart. + +- If service is managed by Kubelet, Ansible only updates static + manifest and Kubelet automatically updates services it manages + +Master +^^^^^^ + +Since master node does not have user workload update the key points for +update are the same as for “Non-intrusive” use-cases. + +Minion +^^^^^^ + +User’s workload is run on Minion nodes, in order to apply intrusive +updates, rollout system has to move workload to a different node. On big +clusters updates in butch-update will be required, to achieve faster +rollout. + +Key requirements for Kubernetes installer and orchestrator: + +- Kubernetes installer is agnostic of which workloads run in Kubernetes + cluster and in VMs on top of OpenStack which works as Kubernetes + application. + +- Kubernetes installer should receive rollout plan, where the order, + and grouping of nodes, update pf which can be rolled out in + parallel are defined. This update plan will be generated by + different tool, which knows “something” about types of workload + run on the cluster. + +- In order to move workload to different node, installer has to trigger + workload evacuation from the node. + + - Scheduling of new workload to the node should be disabled. + + - Node has to be considered as in maintenance mode, that + unavailability of kubelet does not cause workload + rescheduling. + + - Installer has to trigger workload evacuation in kubelet, kubelet + should use hooks defined in Pods, to start workload migration. + +- In rollout plan it should be possible to specify, when to fail + rollout procedure. + + - If some percent of nodes failed to update. + + - There may be some critical for failure nodes, it’s important to + provide per node configuration, if it is important to stop + rollout procedure if this node failed to be updated. + +Limitations +~~~~~~~~~~~ + +Hyperkube +^^^^^^^^^ + +Current Kubernetes deliver mechanism relies on Hyperkube distribution. +Hyperkube is a single binary file which contains all set of core +Kubernetes components, e.g. API, Scheduler, Controller, etc. The problem +with this approach is that bug-fix for API causes update of all core +Kubernetes containers, even if API is installed on few controllers, new +version has to be rolled out to all thousands of minions. + +Possible solutions: + +- For different roles rollout different versions of Hyperkube. This + approach significantly complicates versions and fixes tracking + process. + +- Make split between those roles and create for them different images. + The problem will remain since most of the core components are + developed in a single repository and released together, hence it + is still an issue, if release tag is published on the repo, + rebuild of all core components will be required. + +For now we go with native way of distribution until better solution is +found. + +Update Configuration +~~~~~~~~~~~~~~~~~~~~ + +Update of configurations in most of the cases should not cause downtime. + +- Update of Kubernetes and related services (calico, etcd, etc). + +- Rotation of SSL certificates (e.g. those which are used for Kubelet + authentication) + +Abort Rollout +~~~~~~~~~~~~~ + +Despite the fact that this operation may be dangerous, user should be +able to interrupt update procedure. + +Rollback +~~~~~~~~ + +Some of the operations are impossible to rollback, rollback may require +to have different flow of actions to be executed on the cluster. + +Troubleshooting +--------------- + +There should be a simple way to provide for a developer tooling for +debugging and troubleshooting. These tools should not be installed on +each machine by default, but there should be a simple way to get this +tools installed on demand. + +- Image with all tools required for debugging + +- Container should be run in privileged mode with host networking. + +- User can rollout this container to required nodes using Ansible. + +Example of tools which may be required: + +- Sysdig + +- Tcpdump + +- Strace/Ltrace + +- Clients for etcd, calico etc + +- ... + +Open questions +-------------- + +- Networking node? + +Related links +------------- + +- `Keepalived based VIP managament for Kuberentes + `__ + +- `HA endpoints for K8s in Kargo + `__ + +- `Large deployments in Kargo + `__ + +- `ECMP load balancing for external IPs + `__ + +Contributors +------------ + +- Evgeny Li + +- Matthew Mosesohn + +- Bogdan Dobrelya + +- Jedrzej Nowak + +- Vladimir Eremin + +- Dmytriy Novakovskiy + +- Michael Korolev + +- Alexey Shtokolov + +- Mike Scherbakov + +- Vladimir Kuklin + +- Sergii Golovatiuk + +- Aleksander Didenko + +- Ihor Dvoretskyi + +- Oleg Gelbukh + +Appendix A. High Availability Alternatives +------------------------------------------ + +This section contains some High Availability options that were +considered and researched, but deemed too complicated or too +risky to implement in the first iteration of the project. + +Option #1 VIP for external and internal with native etcd proxy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First approach to Highly Available Kubernetes with Kargo assumes +using VIP for external and internal access to Kubernetes API, etcd proxy +for internal access to etcd cluster. + +- VIP for external and internal access to Kubernetes API. + +- VIP for external access to etcd. + +- Native etcd proxy on each node for internal access to etcd cluster. + +|image1| + +Option #2 VIP for external and Proxy on each node for internal +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The second considered option is each node that needs to access +Kubernetes API also has Proxy Server installed. Each Proxy forwards +traffic to alive Kubernetes API backends. External clients access +Etcd and Kubernetes API using VIP. + +- Internal access to APIs is done via proxies which are installed + locally. + +- External access is done via Virtual IP address. + +|image2| + +Option #3 VIP for external Kubernetes API on each node +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Another similar to “VIP for external and Proxy on each node for +internal” option, may be to install Kubernetes API on each node which +requires access to it instead of installing Proxy which forwards the +traffic to Kubernetes API on master nodes. + +- VIP on top of proxies for external access. + +- Etcd proxy on each node for internal services. + +- Kubernetes API on each node, where access to Kubernetes is required. + +**This option was selected despite potential limitations listed +above.** + +|image3| + +Option #4 VIP for external and internal +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to achieve High Availability of Kubernetes master proxy server +on every master node can be used, each proxy is configured to forward +traffic to all available backends in the cluster (e.g. etcd, +kubernetes-api), also there has to be a mechanism to achieve High +Availability between these proxies, it can be achieved by VIP managed by +cluster management system (see “High Availability between proxies” +section). + +- Internal and External access to Etcd or Kubernetes cluster is done + via Virtual IP address. + +- Kubernetes API also access to Etcd using VIP. + +|image4| + +Option #5 VIP for external native Kubernetes proxy for internal +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We considered using native Kubernetes proxy for forwarding traffic +between APIs. Kubernetes proxy cannot work without Kubernetes API, hence +it should be installed on each node, where Kubernetes proxy is +installed. If Kubernetes API is installed on each node, there is no +reason to use Kubernetes proxy to forward traffic with it, internal +client can access the Kubernetes API through localhost. + +.. |image0| image:: media/k8s_1000_nodes/image07.png + :width: 3.36979in + :height: 1.50903in +.. |image1| image:: media/k8s_1000_nodes/image09.png + :width: 6.37500in + :height: 4.01389in +.. |image2| image:: media/k8s_1000_nodes/image08.png + :width: 6.37500in + :height: 4.13889in +.. |image3| image:: media/k8s_1000_nodes/image11.png + :width: 6.37500in + :height: 4.59722in +.. |image4| image:: media/k8s_1000_nodes/image03.png + :width: 6.37500in + :height: 4.12500in diff --git a/doc/source/design/media/k8s_1000_nodes/image03.png b/doc/source/design/media/k8s_1000_nodes/image03.png new file mode 100644 index 00000000..7129757e Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image03.png differ diff --git a/doc/source/design/media/k8s_1000_nodes/image07.png b/doc/source/design/media/k8s_1000_nodes/image07.png new file mode 100644 index 00000000..94bb30d3 Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image07.png differ diff --git a/doc/source/design/media/k8s_1000_nodes/image08.png b/doc/source/design/media/k8s_1000_nodes/image08.png new file mode 100644 index 00000000..8190b027 Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image08.png differ diff --git a/doc/source/design/media/k8s_1000_nodes/image09.png b/doc/source/design/media/k8s_1000_nodes/image09.png new file mode 100644 index 00000000..9bf2753f Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image09.png differ diff --git a/doc/source/design/media/k8s_1000_nodes/image10.gif b/doc/source/design/media/k8s_1000_nodes/image10.gif new file mode 100644 index 00000000..66a11fad Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image10.gif differ diff --git a/doc/source/design/media/k8s_1000_nodes/image11.png b/doc/source/design/media/k8s_1000_nodes/image11.png new file mode 100644 index 00000000..37c4a0d1 Binary files /dev/null and b/doc/source/design/media/k8s_1000_nodes/image11.png differ diff --git a/doc/source/index.rst b/doc/source/index.rst index 726f4403..baa9ffbf 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -67,6 +67,7 @@ Design docs design/ost_compute_on_k8s design/ref_arch_100_nodes design/ref_arch_1000_nodes + design/k8s_1000_nodes_architecture Indices and tables ------------------