sometimes cluster recovery didn't work
because we only look for the sequence number in the last 200 lines
of the log file.
fix this by ingesting the complete file and only register the last
sequence number we find.
Closes-Bug: 1821173
Change-Id: Iea2661c9d5d262cf99edd5f5b567f252607a0003
Signed-off-by: Sven Kieske <kieske@osism.tech>
Changes name of ansible module kolla_docker to
kolla_container.
Change-Id: I13c676ed0378aa721a21a1300f6054658ad12bc7
Signed-off-by: Martin Hiner <m.hiner@partner.samsung.com>
docker_restart_policy: no causes systemd units to not get created
and we use it in CI to disable restarts on services.
Introducing oneshot policy to not create systemd unit for oneshot
containers (those that are running bootstrap tasks, like db
bootstrap and don't need a systemd unit), but still create systemd
units for long lived containers but with Restart=No.
Change-Id: I9e0d656f19143ec2fcad7d6d345b2c9387551604
This change adds basic deployment based on Podman
container manager as an alternative to Docker.
Signed-off-by: Ivan Halomi <i.halomi@partner.samsung.com>
Signed-off-by: Martin Hiner <m.hiner@partner.samsung.com>
Signed-off-by: Petr Tuma <p.tuma@partner.samsung.com>
Change-Id: I2b52964906ba8b19b8b1098717b9423ab954fa3d
Depends-On: Ie4b4c1cf8fe6e7ce41eaa703b423dedcb41e3afc
First part of patchset:
https://review.opendev.org/c/openstack/kolla-ansible/+/799229/
in which was suggested to split patch into smaller ones.
This implements kolla_container_engine variable
in command calls of docker,so later on it can be
also used for podman without further change.
Signed-off-by: Ivan Halomi <i.halomi@partner.samsung.com>
Change-Id: Ic30b67daa2e215524096ad1f4385c569e3d41b95
Kolla-ansible is currently installing mariadb
cluster on hosts defined in group['mariadb']
and render haproxy configuration for this hosts.
This is not enough if user want to have several
service databases in several mariadb clusters (shards).
Spread service databases to multiple clusters (shards)
is usefull especially for databases with high load
(neutron,nova).
How it works ?
It works exactly same as now, but group reference 'mariadb'
is now used as group where all mariadb clusters (shards)
are located, and mariadb clusters are installed to
dynamic groups created by group_by and host variable
'mariadb_shard_id'.
It also adding special user 'shard_X' which will be used
for creating users and databases, but only if haproxy
is not used as load-balance solution.
This patch will not affect user which has all databases
on same db cluster on hosts in group 'mariadb', host
variable 'mariadb_shard_id' is set to 0 if not defined.
Mariadb's task in loadbalancer.yml (haproxy) is configuring
mariadb default shard hosts as haproxy backends. If mariadb
role is used to install several clusters (shards), only
default one is loadbalanced via haproxy.
Mariadb's backup is working only for default shard (cluster)
when using haproxy as mariadb loadbalancer, if proxysql
is used, all shards are backuped.
After this patch will be merged, there will be way for proxysql
patches which will implement L7 SQL balancing based on
users and schemas.
Example of inventory:
[mariadb]
server1
server2
server3 mariadb_shard_id=1
server4 mariadb_shard_id=1
server5 mariadb_shard_id=2
server6 mariadb_shard_id=3
Extra:
wait_for_loadbalancer is removed instead of modified as its role
is served by check already. The relevant refactor is applied as
well.
Change-Id: I933067f22ecabc03247ea42baf04f19100dffd08
Co-Authored-By: Radosław Piliszek <radoslaw.piliszek@gmail.com>
Need to consider Negative seqno to compare in some cases,
but the task does not support to do that, we need to make it work.
1.we use mariabackup to restore datas on control1, delete the
mariadb data on control2 and control3, and then use cluster recovery,
as a result that the seqno of the other two nodes will be '-1'.
2. add one more control node into our existing mariadb cluster,
and then use cluster recovery, the seqno of the new node will be '-1'.
Change-Id: Ic1ac8656f28c3835e091637014f075ac5479d390
Mariadb recovery fails if a cluster has previously been deployed, but any of
the mariadb containers do not exist.
Steps to reproduce
==================
* Deploy a mariadb galera cluster
* Remove the mariadb container from at least one host (docker rm -f mariadb)
* Run kolla-ansible mariadb_recovery
Expected results
================
The cluster is recovered, and a new container deployed where necessary.
Actual results
==============
The task 'Stop MariaDB containers' fails on any host where the container does
not exist.
Solution
========
This change fixes the issue by using the 'ignore_missing' flag for kolla_docker
with the stop_container action. This means the task does not fail when the
container does not exist. It is also necessary to swap some 'docker cp'
commands for 'cp' on the host, using the path to the volume.
Closes-Bug: #1907658
Change-Id: Ibd4a6adeb8443e12c45cbab65f501392ffb16fc7
mariadb container name variable is fixed in some places,
but in the defaults directory, mariadb container_name variable
is variable. If the mariadb container_name variable is changed
during deployment, it will not be assigned to container_name,
but a fixed 'mariadb' name.
Change-Id: Ie8efa509953d5efa5c3073c9b550be051a7f4f9b
These affected both deploy (and reconfigure) and upgrade
resulting in WSREP issues, failed deploys or need to
recover the cluster.
This patch makes sure k-a does not abruptly terminate
nodes to break cluster.
This is achieved by cleaner separation between stages
(bootstrap, restart current, deploy new) and 3 phases
for restarts (to keep the quorum).
Upgrade actions, which operate on a healthy cluster,
went to its section.
Service restart was refactored.
We no longer rely on the master/slave distinction as
all nodes are masters in Galera.
Closes-bug: #1857908
Closes-bug: #1859145
Change-Id: I83600c69141714fc412df0976f49019a857655f5
As part of the effort to implement Ansible code linting in CI
(using ansible-lint) - we need to implement recommendations from
ansible-lint output [1].
One of them is to stop using local_action in favor of delegate_to -
to increase readability and and match the style of typical ansible
tasks.
[1]: https://review.opendev.org/694779/
Partially implements: blueprint ansible-lint
Change-Id: I46c259ddad5a6aaf9c7301e6c44cd8a1d5c457d3
After performing a recovery of MariaDB, the mariadb containers are left
without a restart policy. This leaves them unable to recover from the
crash of a single galera node. There is another issue, in that the
'master' node is left in a bootstrap configuration, with the
--wsrep-new-cluster argument configured as BOOTSTRAP_ARGS.
This change fixes these issues by removing the restart policy of 'no'
from the 'slave' containers, and recreating the master container without
the restart policy or bootstrap arguments.
Change-Id: I36c875611931163ca2c29ae93b71d3af64cb197c
Closes-Bug: #1851594
Explicitly wait for the database to be accessible via the load balancer.
Sometimes it can reject connections even when all database services are up,
possibly due to the health check polling in HAProxy.
Closes-Bug: #1840145
Change-Id: I7601bb710097a78f6b29bc4018c71f2c6283eef2
Docker has no restart policy named 'never'. It has 'no'.
This has bitten us already (see [1]) and might bite us again whenever
we want to change the restart policy to 'no'.
This patch makes our docker integration honor all valid restart policies
and only valid restart policies.
All relevant docker restart policy usages are patched as well.
I added some FIXMEs around which are relevant to kolla-ansible docker
integration. They are not fixed in here to not alter behavior.
[1] https://review.opendev.org/667363
Change-Id: I1c9764fb9bbda08a71186091aced67433ad4e3d6
Signed-off-by: Radosław Piliszek <radoslaw.piliszek@gmail.com>
* Fix wsrep sequence number detection. Log message format is
'WSREP: Recovered position: <UUID>:<seqno>' but we were picking out
the UUID rather than the sequence number. This is as good as random.
* Add become: true to log file reading and removal since
I4a5ebcedaccb9261dbc958ec67e8077d7980e496 added become: true to the
'docker cp' command which creates it.
* Don't run handlers during recovery. If the config files change we
would end up restarting the cluster twice.
* Wait for wsrep recovery container completion (don't detach). This
avoids a potential race between wsrep recovery and the subsequent
'stop_container'.
* Finally, we now wait for the bootstrap host to report that it is in
an OPERATIONAL state. Without this we can see errors where the
MariaDB cluster is not ready when used by other services.
Change-Id: Iaf7862be1affab390f811fc485fd0eb6879fd583
Closes-Bug: #1834467
Many tasks that use Docker have become specified already, but
not all. This change ensures all tasks that use the following
modules have become:
* kolla_docker
* kolla_ceph_keyring
* kolla_toolbox
* kolla_container_facts
It also adds become for 'command' tasks that use docker CLI.
Change-Id: I4a5ebcedaccb9261dbc958ec67e8077d7980e496
Several config file permissions are incorrect on the host. In general,
files should be 0660, and directories and executables 0770.
Change-Id: Id276ac1864f280554e98b937f2845bb424d521de
Closes-Bug: #1821579
With the more recent versions of ansible, we should now use
"is" instead of the "|"
This should update it.
Change-Id: I6fba56fca182349972e8b0ee5452b37aa4090e0c
Add become to all tasks that use the module "kolla_docker"
Change-Id: I4309c4011687b88ec31d739fd8f834fe2326ff10
Partial-Implements: blueprint ansible-specific-task-become
Regex used to find the recover seqnum partition is not
returning the real num id rather a None.
Task fails due seqnum[0] is not iterable.
Change-Id: I1be55b6ebfc17c6d423e638662ec2a9f4b9b49a2
Closes-Bug: #1752128
This patchset implements yamllint test to all *.yml
files.
Also fixes syntax errors to make jobs to pass.
Change-Id: I3186adf9835b4d0cada272d156b17d1bc9c2b799
The purpose of this change is to improve upon
https://review.openstack.org/#/c/531122/
- Moved vars inside the defaults/main.yml file
- Made the regex for the lineinfile safer
Change-Id: Id581c0b36f3d4bd61d3627b8364b79296b967387
Closes-Bug: 1746567
Related-Bug: 1682153
In recover_cluster.yaml playbook the task to find the highest
seqno/Global Transaction ID is no longer relying only on grastate.dat
Instead it now follows the recommendations from galera cluster website
http://galeracluster.com/documentation-webpages/restartingcluster.html
Closes-Bug: 1682153
Change-Id: I5fc3eaa8baee659576c4c39aef9cfd351c8e9af7
Added 'executable' argument to the shell action in the
'Comparing seqno value' task in the cluster recovery playbook.
Change-Id: I3e96a4a76b44ffb558b9a41cde16e66a8d0fab1a
Closes-Bug: #1729603
always_run is deprecated and removed in Ansible 2.4
check_mode is introduced in Ansible 2.2 and Kolla-ansible bump Ansible to
2.2.0 so it's safe to replace always_run by check_mode now.
Change-Id: Id1028d38b7bde30a6afe17b319dcdc77907914ab
Closes-Bug: #1643633
Implements: blueprint migrate-to-ansible-2-2-0
check_mode option is included in Ansible 2.2.
Using in our playbooks mean that any other version before
Ansible 2.2 can be used
This reverts commit 529f202d00.
Change-Id: I3af96290443d760346264e6d994fd2a44de65543
Closes-Bug: #1644828
When all mariadb nodes are stopped gracefully, mariadb galera will
write it's last executed position into the grastate.dat file. Need find
the node with largest seqno number in that file and recovery from that
node.
Closes-Bug: #1627717
Change-Id: I6e97c190eec99c966bffde0698f783e519ba14bd
The lightsout recover patch broke multinode mysql. Also the lightsout
recovery didnt probably pass the --wsrep-new-cluster flag. This
updates the mariadb bootstrap to work with multinode again.
Closes-Bug: #1559480
Related-Id: I903c3bcd069af39814bcabcef37684b1f043391f
Change-Id: I1ec91a8b2144930ea8f04cc1c201b53712352e4e
This playbook only matters for multinode since AIO can recover from
power outage without additional configuration.
DocImpact
Implements: blueprint mariadb-lights-out
Change-Id: I903c3bcd069af39814bcabcef37684b1f043391f