New notes on advanced upgrade/fallback for cluster

For clustered Keystone nodes it's possible to optimize the upgrade
process to minimize downtime. Additionally a process to quickly restore
after a failed upgrade are described.

Change-Id: I746d46a968dd887b0559644f91ae207055146390
This commit is contained in:
Matt Fischer 2016-08-25 14:02:07 -04:00 committed by Steve Martinelli
parent 5caf5212d8
commit 9418f2609f
1 changed files with 77 additions and 2 deletions

View File

@ -34,7 +34,7 @@ Before you begin
Plan your upgrade:
* Read and ensure you understand the `release notes
http://docs.openstack.org/releasenotes/keystone/`_ for the next release.
<http://docs.openstack.org/releasenotes/keystone/>`_ for the next release.
* Resolve any outstanding deprecation warnings in your logs. Some deprecation
cycles are as short as a single release, so it's possible to break a
@ -77,6 +77,81 @@ from functioning normally.
#. Start all keystone processes.
Upgrading with minimal downtime
-------------------------------
If you run a multi-node keystone cluster that uses a replicated database, like
a Galera cluster, it is possible to upgrade with minimal downtime. This method
also optimizes recovery time from a failed upgrade. This section assumes
familiarity with the base case (`Upgrading with downtime`_) outlined above.
In these steps the nodes will be divided into ``first`` and ``other`` nodes.
#. Backup your database. There is no way to rollback the upgrade of keystone
and this is your worst-case fallback option.
#. Disable keystone on all nodes but the ``first`` node. This can be done via a
variety of mechanisms that will depend on the deployment. If you are unable
to disable a service or place a service into maintenance mode in your load
balancer, you can stop the keystone processes.
#. Stop the database service on one of the ``other`` nodes in the cluster. This
will isolate the old dataset on a single node in the cluster. In the event
of a failed update this data can be used to rebuild the cluster without
having to restore from backup.
#. Update the configuration files on the ``first`` node.
#. Upgrade keystone on the ``first`` node. keystone is now down for your cloud.
#. Run ``keystone-manage db_sync`` on the ``first`` node. As soon as this
finishes, keystone is now working again on a single node in the cluster.
#. keystone is now upgraded on a single node. Your load balancers will be
sending all traffic to this single node. This is your chance to run
ensure keystone up and running, and not broken. If keystone is broken, see
the `Rollback after a failed upgrade`_ section below.
#. Once you have verified that keystone is up and running, begin the upgrade on
the ``other`` nodes. This entails updating configuration files and upgrading
the code. The ``db_sync`` does not need to be run again.
#. On the node where you stopped the database service, be sure to restart
it and ensure that it properly rejoins the cluster.
Using this model, the outage window is minimized because the only time
when your cluster is totally offline is between loading the newer version
of keystone and running the ``db_sync`` command. Typically the outage with
this method can be measured in tens of seconds especially if automation is
used.
Rollback after a failed upgrade
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the upgrade fails, only a single node has been affected. This makes the
recovery simpler and quicker. If issues are not discovered until the entire
cluster is upgraded, a full shutdown and restore from backup will be required.
That will take much longer than just fixing a single node with an old copy of
the database still available. This process will be dependent on your
architecture and it is highly recommended that you've practiced this in a
development environment before trying to use it for the first time.
#. Isolate the bad node. Shutdown keystone and the database services
on the upgraded "bad" node.
#. Bootstrap the database cluster from the node holding the old data.
This may require wiping the data first on any nodes who are not
holding old data.
#. Enable keystone on the old nodes in your load balancer or if
the processes were stopped, restart them.
#. Validate that keystone is working.
#. Downgrade the code and config files on the bad node.
This process should be doable in a matter of minutes and will minimize cloud
downtime if it is required.
Upgrading without downtime
--------------------------
@ -88,7 +163,7 @@ this upgrade process, end users will still be able to authenticate to receive
tokens normally, and other OpenStack services will still be able to
authenticate requests normally.
#. Make a backup of your database. Keystone does not support downgrading the
#. Make a backup of your database. keystone does not support downgrading the
database, so restoring from a full backup is your only option for recovery
in the event of an upgrade failure.