Refactor backup and restore docs

This patch will refactor the current documentation
we have to run an Undercloud backup and restore and
will provide new documentation for the Overcloud control
plane backup and restore.

Change-Id: I8a8bbabed007fef5cbe9290db370b65464908475
This commit is contained in:
Carlos Camacho 2018-02-15 15:06:32 +01:00
parent 0d717a47c3
commit c60a6f477a
7 changed files with 436 additions and 23 deletions

View File

@ -0,0 +1,25 @@
TripleO backup and restore (Undercloud and Overcloud control plane)
===================================================================
This documentation section will describe a method to backup and restore both Undercloud and Overcloud
control plane.
The use case involved in the creation and restore of these procedures are related to the
possible failures of a minor update or major upgrade for both Undercloud and Overcloud.
The general approach to recovery from failures during the minor update or major upgrade workflow
is to fix the environment and restart services before re-running the last executed step.
There are specific cases in which rolling back to previous steps in the upgrades
workflow can lead to general failures in the system, i.e.
when executing `yum history` to rollback the upgrade of certain packages,
the dependencies resolution might select to remove critical packages like `systemd`.
.. toctree::
:maxdepth: 2
:includehidden:
01_undercloud_backup
02_overcloud_backup
03_undercloud_restore
04_overcloud_restore

View File

@ -0,0 +1,64 @@
Backing up the Undercloud
=========================
In order to backup your Undercloud you need to
make sure a set of files and databases are stored
correctly to be used in case of an issue running
the updates or upgrades workflows.
The following sections will describe how to
execute an Undercloud backup.
CLI driven backups
------------------
There is an automated way of creating an Undercloud backup,
this CLI option allows the operator to run a database and filesystem backup.
By default, all databases are included in the backup, also, the folder `/home/stack`.
The command usage is::
openstack undercloud backup [--add-path ADD_FILES_TO_BACKUP]
For example, we can run a full MySQL backup with additional paths as::
openstack undercloud backup --add-path /etc/hosts \
--add-path /var/log/ \
--add-path /var/lib/glance/images/ \
--add-path /srv/node/ \
--add-path /etc/
When executing the Undercloud backup via the OpenStack
CLI, the backup is stored in a temporary folder called
`/var/tmp/`.
After this operation, the result of the backup procedure
is stored in the swift container called `undercloud-backups`
and it will expire after 24 hours of its creation.
Manual backups
--------------
If the user needs to run the backup manually,
the following steps must be executed.
Database backups
~~~~~~~~~~~~~~~~
The operator needs to backup all databases in the Undercloud node::
mysqldump --opt --single-transaction --all-databases > /root/undercloud-all-databases.sql
Filesystem backups
~~~~~~~~~~~~~~~~~~
* MariaDB configuration file on undercloud (so we can restore databases accurately).
* All glance image data in /var/lib/glance/images.
* All swift data in /srv/node.
* All data in stack users home directory.
* Also the DB backup created in the previous step.
The following command can be used to perform a backup of all data from the undercloud node::
tar -czf undercloud-backup-`date +%F`.tar.gz /root/undercloud-all-databases.sql /etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack /etc/pki /opt/stack

View File

@ -0,0 +1,156 @@
Backing up the Overcloud control plane services
===============================================
This backup guide is meant to backup services based on a HA + containers deployment.
Prerequisites
-------------
There is a need to backup the control plane services in the Overcloud, to do so, we need
to apply the same approach from the Undercloud, which is, running a backup of the databases
and create a filesystem backup.
Databases backup
----------------
MySQL backup
~~~~~~~~~~~~
If using HA the operator can run the database backup in any controller node
using the --single-transaction option when executing the mysqldump.
If the deployment is using containers the hieradata file containing the mysql
root password is located in the folder `/var/lib/config-data/mysql/etc/puppet/hieradata/`.
The file containing the mysql root password is `service_configs.json` and the key is
`mysql::server::root_password`.
Create a temporary folder to store the backups::
sudo -i
mkdir -p /var/tmp/mysql_backup/
Store the MySQL root password to be added to further queries::
MYSQLDBPASS=$(cat /var/lib/config-data/mysql/etc/puppet/hieradata/service_configs.json | grep mysql | grep root_password | awk -F": " '{print $2}' | awk -F"\"" '{print $2}')
Execute from any controller::
mysql -uroot -p$MYSQLDBPASS -e "select distinct table_schema from information_schema.tables where engine='innodb' and table_schema != 'mysql';" \
-s -N | xargs mysqldump -uroot -p$MYSQLDBPASS --single-transaction --databases > /var/tmp/mysql_backup/openstack_databases-`date +%F`-`date +%T`.sql
This will dump a database backup called /var/tmp/mysql_backup/openstack_databases-<date>.sql
Then backup all the users and permissions information::
mysql -uroot -p$MYSQLDBPASS -e "SELECT CONCAT('\"SHOW GRANTS FOR ''',user,'''@''',host,''';\"') FROM mysql.user where (length(user) > 0 and user NOT LIKE 'root')" \
-s -N | xargs -n1 mysql -uroot -p$MYSQLDBPASS -s -N -e | sed 's/$/;/' > /var/tmp/mysql_backup/openstack_databases_grants-`date +%F`-`date +%T`.sql
This will dump a database backup called `/var/tmp/mysql_backup/openstack_databases_grants-<date>.sql`
MongoDB backup (only needed until Ocata)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since OpenStack Pike, there is no support for MongoDB, so be sure you backup the data from
your telemetry backend.
If telemetry services are used, then its needed to backup the data stored in the MongoDB instance.
Connect to any controller and get the IP of the MongoDB primary instance::
MONGOIP=$(cat /etc/mongod.conf | grep bind_ip | awk '{print $3}')
Now, create the backup::
mkdir -p /var/tmp/mongo_backup/
mongodump --oplog --host $MONGOIP --out /var/tmp/mongo_backup/
Be sure the files were created successfully.
Redis backup
~~~~~~~~~~~~~~
If telemetry services are used, then it's needed to backup the data stored in the Redis instance.
Let's get the Redis endpoint to get the backup, open `/var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg`
and get the bind IP in the `listen redis` section, should have a string of this form `bind <bind_IP:bind_port> transparent`::
grep -A1 'listen redis' /var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg
REDISIP=$(grep -A1 'listen redis' /var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg | grep bind | awk '{print $2}' | awk -F":" '{print $1}')
Let's store the master auth password to connect to the Redis cluster, the config file should be
`/var/lib/config-data/redis/etc/redis.conf` and the password under the `masterauth` parameter.
Let's store it in a variable::
REDISPASS=$(cat /var/lib/config-data/redis/etc/redis.conf | grep masterauth | grep -v \# | awk '{print $2}')
Let's check connectivity to the Redis cluster::
redis-cli -a $REDISPASS -h $REDISIP ping
Now, create a database dump by executing::
redis-cli -a $REDISPASS -h $REDISIP bgsave
Now the database backup should be stored in the
default directory `/var/lib/redis/` directory.
Filesystem backup
-----------------
We need to backup all files that can be used to recover
from a possible failure in the Overcloud controllers when
executing a minor update or a major upgrade.
The option `--ignore-failed-read` is added to the `tar`
command because the list of files to backup might be
different on each environment and we make the list of
paths to backup is as much general as possible.
The following folders should be backed up::
mkdir -p /var/tmp/filesystem_backup/
tar --ignore-failed-read \
-zcvf /var/tmp/filesystem_backup/fs_backup-`date '+%Y-%m-%d-%H-%M-%S'`.tar.gz \
/etc/nova \
/var/log/nova \
/var/lib/nova \
--exclude /var/lib/nova/instances \
/etc/glance \
/var/log/glance \
/var/lib/glance \
/etc/keystone \
/var/log/keystone \
/var/lib/keystone \
/etc/httpd \
/etc/cinder \
/var/log/cinder \
/var/lib/cinder \
/etc/heat \
/var/log/heat \
/var/lib/heat \
/var/lib/heat-config \
/var/lib/heat-cfntools \
/etc/rabbitmq \
/var/log/rabbitmq \
/var/lib/rabbitmq \
/etc/neutron \
/var/log/neutron \
/var/lib/neutron \
/etc/corosync \
/etc/haproxy \
/etc/logrotate.d/haproxy \
/var/lib/haproxy \
/etc/openvswitch \
/var/log/openvswitch \
/var/lib/openvswitch \
/etc/ceilometer \
/var/lib/redis \
/etc/sysconfig/memcached \
/etc/gnocchi \
/var/log/gnocchi \
/etc/aodh \
/var/log/aodh \
/etc/panko \
/var/log/panko \
/etc/ceilometer \
/var/log/ceilometer

View File

@ -1,31 +1,17 @@
Backing up and Restoring your Undercloud
========================================
Restoring the Undercloud
========================
Backing up your Undercloud
--------------------------
In order to backup your undercloud you need to make sure the following items are backed up
* All MariaDB databases on the undercloud node
* MariaDB configuration file on undercloud (so we can restore databases accurately)
* All glance image data in /var/lib/glance/images
* All swift data in /srv/node
* All data in stack users home directory
The following commands can be used to perform a backup of all data from the undercloud node::
mysqldump --opt --all-databases > /root/undercloud-all-databases.sql
tar -czf undercloud-backup-`date +%F`.tar.gz undercloud-all-databases.sql /etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack /etc/pki /opt/stack
Restoring a backup of your Undercloud on a Fresh Machine
--------------------------------------------------------
The following restore process assumes you are recovering from a failed undercloud node where you have to reinstall it from scratch.
It assumes that the hardware layout is the same, and the hostname and undercloud settings of the machine will be the same as well.
Once the machine is installed and is in a clean state, re-enable all the subscriptions/repositories needed to install and run TripleO.
Note that unless specified, all commands are run as root.
Then install mariadb server with::
Restoring a backup of your Undercloud on a Fresh Machine
--------------------------------------------------------
Install the MariaDB server with::
yum install -y mariadb-server
@ -59,7 +45,7 @@ We have to now install the swift and glance base packages, and then restore thei
chown -R swift: /srv/node
chown -R glance: /var/lib/glance/images
Finally we rerun the undercloud installation from the stack user, making sure to run it in the stack user home dir::
Finally, we rerun the undercloud installation from the stack user, making sure to run it in the stack user home dir::
su - stack
sudo yum install -y python-tripleoclient

View File

@ -0,0 +1,182 @@
Restoring the Overcloud control plane services
==============================================
Restoring the Overcloud control plane from a failed state
depends on the specific issue the operator is facing.
This section provides a restore method for
the backups created in previous steps.
The general strategy of restoring an Overcloud control plane
will be to have the services working back again to
re-run the update/upgrade tasks.
YUM update rollback
-------------------
Depending on the updated packages, running a yum rollback
based on the `yum history` command might not be a good idea.
In the specific case of an OpenStack minor update or a major upgrade
will be harder as there will be several dependencies and packages
to downgrade based on the number of transactions yum had to run to upgrade
all the node packages.
Also, using `yum history` to rollback transactions
can lead to target to remove packages needed for the
system to work correctly.
Database restore
----------------
In the case we have updated the packages correctly, and the user has an
issue with updating the database schemas, we might need to restore the
database cluster.
With all the services stoped in the Overcloud controllers (except MySQL), go through
the following procedure:
On all the controller nodes, drop connections to the database port via the VIP by running::
MYSQLIP=$(grep -A1 'listen mysql' /var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg | grep bind | awk '{print $2}' | awk -F":" '{print $1}')
sudo /sbin/iptables -I INPUT -d $MYSQLIP -p tcp --dport 3306 -j DROP
This will isolate all the MySQL traffic to the nodes.
On only one controller node, unmanage galera so that it is out of pacemaker's control::
pcs resource unmanage galera
Remove the wsrep_cluster_address option from `/var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf`.
This needs to be executed on all nodes::
grep wsrep_cluster_address /var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf
vi /var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf
On all the controller nodes, stop the MariaDB database::
mysqladmin -u root shutdown
On all the controller nodes, move existing MariaDB data directories and prepare new data directories::
sudo -i
mv /var/lib/mysql/ /var/lib/mysql.old
mkdir /var/lib/mysql
chown mysql:mysql /var/lib/mysql
chmod 0755 /var/lib/mysql
mysql_install_db --datadir=/var/lib/mysql --user=mysql
chown -R mysql:mysql /var/lib/mysql/
restorecon -R /var/lib/mysql
On all the controller nodes, move the root configuration to a backup file::
sudo mv /root/.my.cnf /root/.my.cnf.old
sudo mv /etc/sysconfig/clustercheck /etc/sysconfig/clustercheck.old
On the controller node we previously set to `unmanaged`, bring the galera cluster up with pacemaker::
pcs resource manage galera
pcs resource cleanup galera
Wait for the galera cluster to come up properly and run the following
command to wait and see all nodes set as masters as follows::
pcs status | grep -C3 galera
# Master/Slave Set: galera-master [galera]
# Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
NOTE: If the cleanup does not show all controller nodes as masters, re-run the following command::
pcs resource cleanup galera
On the controller node we previously set to `unmanaged` which is managed back
by pacemaker, restore the OpenStack database that was backed up in a previous section.
This will be replicated to the other controllers by Galera::
mysql -u root < openstack_database.sql
On the same controller node, restore the users and permissions::
mysql -u root < grants.sql
Pcs status will show the galera resource in error because it's now using the wrong user/password to connect to poll the database status.
On all the controller nodes, restore the root/clustercheck configuration to a backup file::
sudo mv /root/.my.cnf.old /root/.my.cnf
sudo mv /etc/sysconfig/clustercheck.old /etc/sysconfig/clustercheck
Test the clustercheck locally for each controller node::
/bin/clustercheck
Perform a cleanup in pacemaker to reprobe the state of the galera nodes::
pcs resource cleanup galera
Test clustercheck on each controller node via xinetd.d::
curl overcloud-controller-0:9200
# curl overcloud-controller-1:9200
# curl overcloud-controller-2:9200
Remove the iptables rule from each node for the services to restore access to the database::
sudo /sbin/iptables -D INPUT -d $MYSQLIP -p tcp --dport 3306 -j DROP
Filesystem restore
------------------
On all overcloud nodes, copy the backup tar file to a temporary
directory and uncompress all the data::
mkdir /var/tmp/filesystem_backup/data/
cd /var/tmp/filesystem_backup/data/
mv <path_to_the_backup_file> .
tar -xvzf <backup_faile>.tar.gz
NOTE: Untarring directly on the / directory will
override your current files. Its recommended to
untar the file in a different directory.
Cleanup the redis resource
--------------------------
Run::
pcs resource cleanup redis
Start up the services on all the controller nodes
-------------------------------------------------
The operator must check that all services are starting correctly,
the services installed in the controllers depend on the operator
needs so the following commands might not apply completely.
The goal of this section is to show that all services must be
started correctly before proceeding to retry an update, upgrade or
use the Overcloud on a regular basis.
Non containerized environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Command to start services::
sudo -i ;systemctl start openstack-ceilometer-central; systemctl start memcached; pcs resource enable rabbitmq; systemctl start openstack-nova-scheduler; systemctl start openstack-heat-api; systemctl start mongod; systemctl start redis; systemctl start httpd; systemctl start neutron-ovs-cleanup
Once all the controller nodes are up, start the compute node services on all the compute nodes::
sudo -i; systemctl start openstack-ceilometer-compute.service; systemctl start openstack-nova-compute.service
Containerized environment
~~~~~~~~~~~~~~~~~~~~~~~~~
The operator must check all containerized services are running correctly, please identify those stopped services by running::
sudo docker ps
Once the operator finds a stopped service, proceed to start it by running::
sudo docker start <service name>

View File

@ -16,6 +16,7 @@ TripleO Install Guide
advanced_deployment/baremetal_nodes
advanced_deployment/backends
advanced_deployment/custom
controlplane_backup_restore/00_index
troubleshooting/troubleshooting
validations/validations
mistral-api/mistral-api

View File

@ -16,6 +16,5 @@ In this chapter you will find advanced management of various |project| areas.
upgrade
build_single_image
upload_single_image
backup_restore_undercloud
update_undercloud_ssh_keys
fernet_key_rotation