diff --git a/doc/source/galera.rst b/doc/source/galera.rst new file mode 100644 index 00000000..bc9605e3 --- /dev/null +++ b/doc/source/galera.rst @@ -0,0 +1,286 @@ +.. _galera: + +================== +Mysql Galera Guide +================== + +This guide provides an overview of Galera implementation in CCP. + +Overview +~~~~~~~~ + +Galera Cluster is a synchronous multi-master database cluster, based on +synchronous replication and MySQL/InnoDB. When Galera Cluster is in use, you +can direct reads and writes to any node, and you can lose any individual node +without interruption in operations and without the need to handle complex +failover procedures. + +CCP implementaion details +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Entrypoint script +----------------- + +To handle all required logic, CCP has a dedicated entrypoint script for +Galera and its side-containers. Because of that, Galera pods are slightly +different from the rest of CCP pods. For example, Galera container still uses +CCP global entrypoint, but it executes Galera entrypoint, which is executing +MySQL and handles all required logic, like bootstrapping, fail detection, etc. + +Galera pod +---------- + +Each Galera pod consists of 3 containers: + +* galera +* galera-checker +* galera-haproxy + +**galera** - a container which runs Galera itself. + +**galera-checker** - a container with galera-checker script. It is used to +check readiness and liveness of the Galera node. + +**galera-haproxy** - a container with a haproxy instance. + +.. NOTE:: More info about each container is available in the + "Galera containers" section. + +Etcd usage +---------- + +The current implementation uses etcd to store cluster state. The default etcd +root the directory will be ``/galera/k8scluster``. + +Additional keys and directories are: + +* **leader** - key with the IP address of the current leader. Leader - is just + a single, random Galera node, which haproxy will be used as a backend. +* **nodes/** - directory with current Galera nodes. Each node key will be + named as an IP address of the node and value will be a Unix time of the key + creation. +* **queue/** - directory with current Galera nodes waiting in the recovery + queue. This is needed to ensure that all nodes are ready, before looking for + the node with the highest seqno. Each node key will be named as an IP addr + of the node and value will be a Unix time of the key creation. +* **seqno/** - directory with current Galera nodes seqno's. + Each node key will be named as an IP address of the node and its value will + be a seqno of the node's data. +* **state** - key with current cluster state. Can be "STEADY", "BUILDING" or + "RECOVERY" +* **uuid** - key with current uuid of the Galera cluster. If a new node will + have a different uuid, this will indicate that we have a split brain + situation. Nodes with the wrong uuid will be destroyed. + +Galera containers +~~~~~~~~~~~~~~~~~ + +galera +------ + +This container runs Galera daemon, plus handles all the bootstrapping, +reconnecting and recovery logic. + +At the start of the container, it checks for the ``init.ok`` file in the Galera +data directory. If this file doesn't exist, it removes all files from the +data directory, running Mysql init, to create base mysql data files, after +we're starting mysqld daemon without networking and setting needed permissions +for expected users. + +If ``init.ok`` file is found, it runs the ``mysqld_safe --wsrep-recover`` +to recover Galera related information and write it to the ``grastate.dat`` +file. + +After that, it checks the cluster state and depending on the current state +it chose required scenario. + +galera-checker +-------------- + +This container is used for liveness and readiness checks of Galera pod. + +To check if this Galera pod is ready it checks for the following things: + +#. wsrep_local_state_comment = "Synced" +#. wsrep_evs_state = "OPERATIONAL" +#. wsrep_connected = "ON" +#. wsrep_ready = "ON" +#. wsrep_cluster_state_uuid = uuid in the etcd + +To check if this Galera pod is alive we checking the following things: + +#. If current cluster state is not "STEADY" - it skips liveness check. +#. If it detects that SST sync is in progress - it skips liveness check. +#. If it detects that there is no Mysql pid file yet - it skips liveness + check. +#. If node "wsrep_cluster_state_uuid" differs from the etcd one - it kills + Galera container, since it's a "split brain" situation. +#. If "wsrep_local_state_comment" is "Joined", and the previous state was + "Joined" too - it kills Galera container since it can't finish joining + to the cluster for some reason. +#. If it caught any exception during the checks - it kills Galera container. + +If all checks passed - we're deciding that Galera pod is alive. + +galera-haproxy +-------------- + +This container is used to run haproxy daemon, which is used to send all traffic +to a single Galera pod. + +This is needed to avoid deadlocks and stale reads. It chooses the "leader" +out of all available Galera pods and once leader is chosen, all haproxy +instances update their configuration with the new leader. + +Supported scenarios +~~~~~~~~~~~~~~~~~~~ + +Initial bootstrap +----------------- + +In this scenario, there is no working Galera cluster yet. Each node trying to +get the lock in etcd, first one which can start cluster bootstrapping. After +it's done, next node gets the lock and connects to the existing cluster. + +.. NOTE:: During the bootstrap state of the cluster will be "BUILDING". It will + be changed to "STEADY" after last node connection. + +Re-connecting to the existing cluster +------------------------------------- + +In this scenario, Galera cluster is already available. In most case it will be +a node re-connection after some failures, such as node reboot. Each node tries +to get the lock in etcd, once lock acquiring node connects to the existing +cluster. + +.. NOTE:: During this scenario state of the cluster will be "STEADY". + +Recovery +-------- + +This scenario could be triggered by two possible options: + +* Operator manually sets cluster state in etcd to the "RECOVERY" +* New node does a few checks before bootstrapping, if it finds that cluster + state is "STEADY", but there is zero nodes in the cluster - it assumes that + cluster has been destroyed somehow and we need to run recovery. In that case, + it sets the state to the "RECOVERY" and starts recovery scenario. + +During the recovery scenario cluster bootstrapping is different from the +"Initial bootstrap". In this scenario, each node looks for its "seqno", which +is basically the registered number of the transactions. A node with the highest +seqno will bootstrap cluster and other nodes will join it, so in the end, we +will have the latest data available before the cluster destruction. + +.. NOTE:: During the bootstrap state of the cluster will be "RECOVERY". It will + be changed to "STEADY" after last node connection. + +There is an option to manually choose the node to recover data from. +For details please see the "force bootstrap" section in the "Advanced features" +. + +Advanced features +~~~~~~~~~~~~~~~~~ + +Cluster size +------------ + +By default, galera cluster size will be 3 nodes. This is optimal for the most +cases. If you want to change it to some custom number, you need to override +**cluster_size** variable in the **percona** tree, for example: + +:: + + configs: + percona: + cluster_size: 5 + +.. NOTE:: Cluster size should be an odd number. Cluster size with more that 5 + nodes will lead to big latency for write operations. + +Force bootstrap +--------------- + +Sometimes operators may want to manually specify Galera node which recovery +should be done from. In that case, you need to override **force_bootstrap** +variable in the **percona** tree, for example: + +:: + + configs: + percona: + force_bootstrap: + enabled: true + node: NODE_NAME + +**NODE_NAME** should be the name of the k8s node, which will run Galera node +with required data. + +Troubleshooting +~~~~~~~~~~~~~~~ + +Galera operation requires some advanced knowledge in Mysql and in some general +clustering conceptions. In most cases, we expect that Galera will "self-heal" +itself, in the worst case via restart, full resync and reconnection to the +cluster. + +Our readiness and liveness scripts should cover this, and not allow +misconfigured or non-operational node receive production traffic. + +Yet it's possible that some failure scenarios is not covered and to fix them +some manual actions could be required. + +Check the logs +-------------- + +Each container of the Galera pod writes detailed logs to the stdout. You could +read them via ``kubectl logs POD_NAME -c CONT_NAME``. Make sure you check the +``galera`` container logs and ``galera-checker`` ones. + +Additionally you should check the Mysql logs in the +``/var/log/ccp/mysql/mysql.log`` + +Check the etcd state +-------------------- + +Galera keeps its state in the etcd and it could be useful to check what is +going on in the etcd right now. Assuming that you're using the **ccp** +namespace, you could check etcd state using this command: + +:: + + etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 ls -r -p --sort /galera + etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/state + etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/leader + etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/uuid + +Node restart +------------ + +In most cases, it should be safe to restart a single Galera node. If you need +to do it for some reason, just delete the pod, via kubectl: + +:: + + kubectl delete pod POD_NAME + +Full cluster restart +-------------------- + +In some cases, you may need to restart the whole cluster. Make sure you have a +backup before doing this. To do this, set the cluster state to the "RECOVERY": + +:: + + etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 set /galera/k8scluster/state RECOVERY + +After that restart all Galera pods: + +:: + + kubectl delete pod POD1_NAME POD2_NAME POD3_NAME + +Once that done, Galera cluster will be rebuilt and should be operational. + +.. NOTE:: For more info about cluster recovery please refer to the + "Supported scenarios" section. diff --git a/doc/source/index.rst b/doc/source/index.rst index 205d76fd..8563b2ee 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -25,6 +25,7 @@ Advanced topics :maxdepth: 1 deploying_multiple_parallel_environments + galera ceph ceph_cluster using_calico_instead_of_ovs