Refactor backup and restore docs

This patch will refactor the current documentation we have to run an Undercloud backup and restore and will provide new documentation for the Overcloud control plane backup and restore. Change-Id: I8a8bbabed007fef5cbe9290db370b65464908475
2018-02-15 15:06:32 +01:00 · 2018-02-15 15:06:32 +01:00 · c60a6f477a
parent 0d717a47c3
commit c60a6f477a
7 changed files with 436 additions and 23 deletions
--- a/doc/source/install/controlplane_backup_restore/00_index.rst
+++ b/doc/source/install/controlplane_backup_restore/00_index.rst
@ -0,0 +1,25 @@
+TripleO backup and restore (Undercloud and Overcloud control plane)
+===================================================================
+
+This documentation section will describe a method to backup and restore both Undercloud and Overcloud
+control plane.
+
+The use case involved in the creation and restore of these procedures are related to the
+possible failures of a minor update or major upgrade for both Undercloud and Overcloud.
+
+The general approach to recovery from failures during the minor update or major upgrade workflow
+is to fix the environment and restart services before re-running the last executed step.
+
+There are specific cases in which rolling back to previous steps in the upgrades
+workflow can lead to general  failures in the system, i.e.
+when executing `yum history` to rollback the upgrade of certain packages,
+the dependencies resolution might select to remove critical packages like `systemd`.
+
+.. toctree::
+  :maxdepth: 2
+  :includehidden:
+
+  01_undercloud_backup
+  02_overcloud_backup
+  03_undercloud_restore
+  04_overcloud_restore
--- a/doc/source/install/controlplane_backup_restore/01_undercloud_backup.rst
+++ b/doc/source/install/controlplane_backup_restore/01_undercloud_backup.rst
@ -0,0 +1,64 @@
+Backing up the Undercloud
+=========================
+
+In order to backup your Undercloud you need to
+make sure a set of files and databases are stored
+correctly to be used in case of an issue running
+the updates or upgrades workflows.
+
+The following sections will describe how to
+execute an Undercloud backup.
+
+CLI driven backups
+------------------
+
+There is an automated way of creating an Undercloud backup,
+this CLI option allows the operator to run a database and filesystem backup.
+By default, all databases are included in the backup, also, the folder `/home/stack`.
+
+The command usage is::
+
+  openstack undercloud backup [--add-path ADD_FILES_TO_BACKUP]
+
+For example, we can run a full MySQL backup with additional paths as::
+
+  openstack undercloud backup --add-path /etc/hosts \
+                              --add-path /var/log/ \
+                              --add-path /var/lib/glance/images/ \
+                              --add-path /srv/node/ \
+                              --add-path /etc/
+
+When executing the Undercloud backup via the OpenStack
+CLI, the backup is stored in a temporary folder called
+`/var/tmp/`.
+After this operation, the result of the backup procedure
+is stored in the swift container called `undercloud-backups`
+and it will expire after 24 hours of its creation.
+
+Manual backups
+--------------
+
+If the user needs to run the backup manually,
+the following steps must be executed.
+
+Database backups
+~~~~~~~~~~~~~~~~
+
+The operator needs to backup all databases in the Undercloud node::
+
+  mysqldump --opt --single-transaction --all-databases > /root/undercloud-all-databases.sql
+
+Filesystem backups
+~~~~~~~~~~~~~~~~~~
+
+* MariaDB configuration file on undercloud (so we can restore databases accurately).
+* All glance image data in /var/lib/glance/images.
+* All swift data in /srv/node.
+* All data in stack users home directory.
+* Also the DB backup created in the previous step.
+
+The following command can be used to perform a backup of all data from the undercloud node::
+
+  tar -czf undercloud-backup-`date +%F`.tar.gz /root/undercloud-all-databases.sql /etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack /etc/pki /opt/stack
+
+
--- a/doc/source/install/controlplane_backup_restore/02_overcloud_backup.rst
+++ b/doc/source/install/controlplane_backup_restore/02_overcloud_backup.rst
@ -0,0 +1,156 @@
+Backing up the Overcloud control plane services
+===============================================
+
+This backup guide is meant to backup services based on a HA + containers deployment.
+
+Prerequisites
+-------------
+
+There is a need to backup the control plane services in the Overcloud, to do so, we need
+to apply the same approach from the Undercloud, which is, running a backup of the databases
+and create a filesystem backup.
+
+Databases backup
+----------------
+
+MySQL backup
+~~~~~~~~~~~~
+
+If using HA the operator can run the database backup in any controller node
+using the --single-transaction option when executing the mysqldump.
+
+If the deployment is using containers the hieradata file containing the mysql
+root password is located in the folder `/var/lib/config-data/mysql/etc/puppet/hieradata/`.
+
+The file containing the mysql root password is `service_configs.json` and the key is
+`mysql::server::root_password`.
+
+Create a temporary folder to store the backups::
+
+  sudo -i
+  mkdir -p /var/tmp/mysql_backup/
+
+Store the MySQL root password to be added to further queries::
+
+  MYSQLDBPASS=$(cat /var/lib/config-data/mysql/etc/puppet/hieradata/service_configs.json | grep mysql | grep root_password | awk -F": " '{print $2}' | awk -F"\"" '{print $2}')
+
+Execute from any controller::
+
+  mysql -uroot -p$MYSQLDBPASS -e "select distinct table_schema from information_schema.tables where engine='innodb' and table_schema != 'mysql';" \
+        -s -N | xargs mysqldump -uroot -p$MYSQLDBPASS --single-transaction --databases > /var/tmp/mysql_backup/openstack_databases-`date +%F`-`date +%T`.sql
+
+This will dump a database backup called /var/tmp/mysql_backup/openstack_databases-<date>.sql
+
+Then backup all the users and permissions information::
+
+  mysql -uroot -p$MYSQLDBPASS -e "SELECT CONCAT('\"SHOW GRANTS FOR ''',user,'''@''',host,''';\"') FROM mysql.user where (length(user) > 0 and user NOT LIKE 'root')" \
+        -s -N | xargs -n1 mysql -uroot -p$MYSQLDBPASS -s -N -e | sed 's/$/;/' > /var/tmp/mysql_backup/openstack_databases_grants-`date +%F`-`date +%T`.sql
+
+This will dump a database backup called `/var/tmp/mysql_backup/openstack_databases_grants-<date>.sql`
+
+MongoDB backup (only needed until Ocata)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since OpenStack Pike, there is no support for MongoDB, so be sure you backup the data from
+your telemetry backend.
+
+If telemetry services are used, then its needed to backup the data stored in the MongoDB instance.
+Connect to any controller and get the IP of the MongoDB primary instance::
+
+  MONGOIP=$(cat /etc/mongod.conf | grep bind_ip | awk '{print $3}')
+
+Now, create the backup::
+
+  mkdir -p /var/tmp/mongo_backup/
+  mongodump --oplog --host $MONGOIP --out /var/tmp/mongo_backup/
+
+Be sure the files were created successfully.
+
+Redis backup
+~~~~~~~~~~~~~~
+
+If telemetry services are used, then it's needed to backup the data stored in the Redis instance.
+
+Let's get the Redis endpoint to get the backup, open `/var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg`
+and get the bind IP in the `listen redis` section, should have a string of this form `bind <bind_IP:bind_port> transparent`::
+
+  grep -A1 'listen redis' /var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg
+  REDISIP=$(grep -A1 'listen redis' /var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg | grep bind | awk '{print $2}' | awk -F":" '{print $1}')
+
+Let's store the master auth password to connect to the Redis cluster, the config file should be
+`/var/lib/config-data/redis/etc/redis.conf` and the password under the `masterauth` parameter.
+Let's store it in a variable::
+
+  REDISPASS=$(cat /var/lib/config-data/redis/etc/redis.conf | grep masterauth | grep -v \# | awk '{print $2}')
+
+Let's check connectivity to the Redis cluster::
+
+  redis-cli -a $REDISPASS -h $REDISIP ping
+
+Now, create a database dump by executing::
+
+  redis-cli -a $REDISPASS -h $REDISIP bgsave
+
+Now the database backup should be stored in the
+default directory `/var/lib/redis/` directory.
+
+Filesystem backup
+-----------------
+
+We need to backup all files that can be used to recover
+from a possible failure in the Overcloud controllers when
+executing a minor update or a major upgrade.
+
+The option `--ignore-failed-read` is added to the `tar`
+command because the list of files to backup might be
+different on each environment and we make the list of
+paths to backup is as much general as possible.
+
+The following folders should be backed up::
+
+  mkdir -p /var/tmp/filesystem_backup/
+  tar --ignore-failed-read \
+      -zcvf /var/tmp/filesystem_backup/fs_backup-`date '+%Y-%m-%d-%H-%M-%S'`.tar.gz \
+      /etc/nova \
+      /var/log/nova \
+      /var/lib/nova \
+      --exclude /var/lib/nova/instances \
+      /etc/glance \
+      /var/log/glance \
+      /var/lib/glance \
+      /etc/keystone \
+      /var/log/keystone \
+      /var/lib/keystone \
+      /etc/httpd \
+      /etc/cinder \
+      /var/log/cinder \
+      /var/lib/cinder \
+      /etc/heat \
+      /var/log/heat \
+      /var/lib/heat \
+      /var/lib/heat-config \
+      /var/lib/heat-cfntools \
+      /etc/rabbitmq \
+      /var/log/rabbitmq \
+      /var/lib/rabbitmq \
+      /etc/neutron \
+      /var/log/neutron \
+      /var/lib/neutron \
+      /etc/corosync \
+      /etc/haproxy \
+      /etc/logrotate.d/haproxy \
+      /var/lib/haproxy \
+      /etc/openvswitch \
+      /var/log/openvswitch \
+      /var/lib/openvswitch \
+      /etc/ceilometer \
+      /var/lib/redis \
+      /etc/sysconfig/memcached \
+      /etc/gnocchi \
+      /var/log/gnocchi \
+      /etc/aodh \
+      /var/log/aodh \
+      /etc/panko \
+      /var/log/panko \
+      /etc/ceilometer \
+      /var/log/ceilometer
--- a/doc/source/install/controlplane_backup_restore/03_undercloud_restore.rst
+++ b/doc/source/install/controlplane_backup_restore/03_undercloud_restore.rst
@ -1,31 +1,17 @@
-Backing up and Restoring your Undercloud
-========================================
+Restoring the Undercloud
+========================

-Backing up your Undercloud
--------------------------
-
-In order to backup your undercloud you need to make sure the following items are backed up
-
-* All MariaDB databases on the undercloud node
-* MariaDB configuration file on undercloud (so we can restore databases accurately)
-* All glance image data in /var/lib/glance/images
-* All swift data in /srv/node
-* All data in stack users home directory
-
-The following commands can be used to perform a backup of all data from the undercloud node::
-
-  mysqldump --opt --all-databases > /root/undercloud-all-databases.sql
-  tar -czf undercloud-backup-`date +%F`.tar.gz undercloud-all-databases.sql /etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack /etc/pki /opt/stack
-
-Restoring a backup of your Undercloud on a Fresh Machine
--------------------------------------------------------
 The following restore process assumes you are recovering from a failed undercloud node where you have to reinstall it from scratch.
 It assumes that the hardware layout is the same, and the hostname and undercloud settings of the machine will be the same as well.
 Once the machine is installed and is in a clean state, re-enable all the subscriptions/repositories needed to install and run TripleO.

 Note that unless specified, all commands are run as root.

-Then install mariadb server with::
+
+Restoring a backup of your Undercloud on a Fresh Machine
+--------------------------------------------------------
+
+Install the MariaDB server with::

  yum install -y mariadb-server

@ -59,7 +45,7 @@ We have to now install the swift and glance base packages, and then restore thei
  chown -R swift: /srv/node
  chown -R glance: /var/lib/glance/images

-Finally we rerun the undercloud installation from the stack user, making sure to run it in the stack user home dir::
+Finally, we rerun the undercloud installation from the stack user, making sure to run it in the stack user home dir::

  su - stack
  sudo yum install -y python-tripleoclient
--- a/doc/source/install/controlplane_backup_restore/04_overcloud_restore.rst
+++ b/doc/source/install/controlplane_backup_restore/04_overcloud_restore.rst
@ -0,0 +1,182 @@
+Restoring the Overcloud control plane services
+==============================================
+
+Restoring the Overcloud control plane from a failed state
+depends on the specific issue the operator is facing.
+
+This section provides a restore method for
+the backups created in previous steps.
+
+The general strategy of restoring an Overcloud control plane
+will be to have the services working back again to
+re-run the update/upgrade tasks.
+
+YUM update rollback
+-------------------
+
+Depending on the updated packages, running a yum rollback
+based on the `yum history` command might not be a good idea.
+In the specific case of an OpenStack minor update or a major upgrade
+will be harder as there will be several dependencies and packages
+to downgrade based on the number of transactions yum had to run to upgrade
+all the node packages.
+Also, using `yum history` to rollback transactions
+can lead to target to remove packages needed for the
+system to work correctly.
+
+
+Database restore
+----------------
+
+In the case we have updated the packages correctly, and the user has an
+issue with updating the database schemas, we might need to restore the
+database cluster.
+
+With all the services stoped in the Overcloud controllers (except MySQL), go through
+the following procedure:
+
+On all the controller nodes, drop connections to the database port via the VIP by running::
+
+  MYSQLIP=$(grep -A1 'listen mysql' /var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg | grep bind | awk '{print $2}' | awk -F":" '{print $1}')
+  sudo /sbin/iptables -I INPUT -d $MYSQLIP -p tcp --dport 3306 -j DROP
+
+This will isolate all the MySQL traffic to the nodes.
+
+On only one controller node, unmanage galera so that it is out of pacemaker's control::
+
+  pcs resource unmanage galera
+
+Remove the wsrep_cluster_address option from `/var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf`.
+This needs to be executed on all nodes::
+
+  grep wsrep_cluster_address /var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf
+  vi /var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf
+
+On all the controller nodes, stop the MariaDB database::
+
+  mysqladmin -u root shutdown
+
+On all the controller nodes, move existing MariaDB data directories and prepare new data directories::
+
+  sudo -i
+  mv /var/lib/mysql/ /var/lib/mysql.old
+  mkdir /var/lib/mysql
+  chown mysql:mysql /var/lib/mysql
+  chmod 0755 /var/lib/mysql
+  mysql_install_db --datadir=/var/lib/mysql --user=mysql
+  chown -R mysql:mysql /var/lib/mysql/
+  restorecon -R /var/lib/mysql
+
+On all the controller nodes, move the root configuration to a backup file::
+
+  sudo mv /root/.my.cnf /root/.my.cnf.old
+  sudo mv /etc/sysconfig/clustercheck /etc/sysconfig/clustercheck.old
+
+On the controller node we previously set to `unmanaged`, bring the galera cluster up with pacemaker::
+
+  pcs resource manage galera
+  pcs resource cleanup galera
+
+Wait for the galera cluster to come up properly and run the following
+command to wait and see all nodes set as masters as follows::
+
+  pcs status | grep -C3 galera
+  # Master/Slave Set: galera-master [galera]
+  # Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
+
+NOTE: If the cleanup does not show all controller nodes as masters, re-run the following command::
+
+  pcs resource cleanup galera
+
+On the controller node we previously set to `unmanaged` which is managed back
+by pacemaker, restore the OpenStack database that was backed up in a previous section.
+This will be replicated to the other controllers by Galera::
+
+  mysql -u root < openstack_database.sql
+
+On the same controller node, restore the users and permissions::
+
+  mysql -u root < grants.sql
+
+Pcs status will show the galera resource in error because it's now using the wrong user/password to connect to poll the database status.
+On all the controller nodes, restore the root/clustercheck configuration to a backup file::
+
+  sudo mv /root/.my.cnf.old /root/.my.cnf
+  sudo mv /etc/sysconfig/clustercheck.old /etc/sysconfig/clustercheck
+
+Test the clustercheck locally for each controller node::
+
+  /bin/clustercheck
+
+Perform a cleanup in pacemaker to reprobe the state of the galera nodes::
+
+  pcs resource cleanup galera
+
+Test clustercheck on each controller node via xinetd.d::
+
+  curl overcloud-controller-0:9200
+  # curl overcloud-controller-1:9200
+  # curl overcloud-controller-2:9200
+
+Remove the iptables rule from each node for the services to restore access to the database::
+
+  sudo /sbin/iptables -D INPUT -d $MYSQLIP -p tcp --dport 3306 -j DROP
+
+Filesystem restore
+------------------
+
+On all overcloud nodes, copy the backup tar file to a temporary
+directory and uncompress all the data::
+
+  mkdir /var/tmp/filesystem_backup/data/
+  cd /var/tmp/filesystem_backup/data/
+  mv <path_to_the_backup_file> .
+  tar -xvzf <backup_faile>.tar.gz
+
+NOTE: Untarring directly on the / directory will
+override your current files. Its recommended to
+untar the file in a different directory.
+
+Cleanup the redis resource
+--------------------------
+
+Run::
+
+  pcs resource cleanup redis
+
+Start up the services on all the controller nodes
+-------------------------------------------------
+
+The operator must check that all services are starting correctly,
+the services installed in the controllers depend on the operator
+needs so the following commands might not apply completely.
+The goal of this section is to show that all services must be
+started correctly before proceeding to retry an update, upgrade or
+use the Overcloud on a regular basis.
+
+Non containerized environment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Command to start services::
+
+  sudo -i ;systemctl start openstack-ceilometer-central; systemctl start memcached; pcs resource enable rabbitmq; systemctl start openstack-nova-scheduler; systemctl start openstack-heat-api; systemctl start mongod; systemctl start redis; systemctl start httpd; systemctl start neutron-ovs-cleanup
+
+Once all the controller nodes are up, start the compute node services on all the compute nodes::
+
+  sudo -i; systemctl start openstack-ceilometer-compute.service; systemctl start openstack-nova-compute.service
+
+Containerized environment
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The operator must check all containerized services are running correctly, please identify those stopped services by running::
+
+  sudo docker ps
+
+Once the operator finds a stopped service, proceed to start it by running::
+
+  sudo docker start <service name>
+
+
+
+
+
--- a/doc/source/install/index.rst
+++ b/doc/source/install/index.rst
@ -16,6 +16,7 @@ TripleO Install Guide
  advanced_deployment/baremetal_nodes
  advanced_deployment/backends
  advanced_deployment/custom
+  controlplane_backup_restore/00_index
  troubleshooting/troubleshooting
  validations/validations
  mistral-api/mistral-api
--- a/doc/source/install/post_deployment/post_deployment.rst
+++ b/doc/source/install/post_deployment/post_deployment.rst
@ -16,6 +16,5 @@ In this chapter you will find advanced management of various |project| areas.
   upgrade
   build_single_image
   upload_single_image
-   backup_restore_undercloud
   update_undercloud_ssh_keys
   fernet_key_rotation