rolling upgrades

this spec proposes workflow to address the requirement
of enabling users to upgrade ceilometer with minimal/zero
downtime.

see content for details.

Change-Id: I94ef8cc0e705b79ccccc8428820c8cf09c8eca78
This commit is contained in:
gordon chung 2015-11-05 15:46:21 -05:00
parent b0167711aa
commit 988a1d284c
1 changed files with 187 additions and 0 deletions

View File

@ -0,0 +1,187 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================
Rolling Upgrades
================
https://blueprints.launchpad.net/ceilometer/+spec/rolling-upgrades
As Ceilometer matures through each iteration and adds new features, users
will be required to upgrade their existing environments while minimising
the potential downtime required.
Problem description
===================
Ceilometer currently provides 4 discrete services: polling agents, notification
agents, collector service, and an api service. Each of them provide their
own functionality and the flow of data and purpose of each component can often
be lost on operators when upgrading services.
To ensure a smooth upgrade experience, we need to properly describe the upgrade
path of the components.
Proposed change
===============
Fortunately, due to the unilateral design of Ceilometer where work is done and
handed off without worry, upgrading is actually a simple procedure and only
requires proper ordering of upgrades.
Using the simple mantra of 'never remove, never alter, only add', we can ensure
a new schema change is understandable by both old and new consumers.
There are two upgrade paths to handle -- both require no code change:
1. Full upgrade of services
1. The database is upgraded using the above mantra.
2. The collector must be first taken offline. The new collector, that knows
how to interpret the new payload, can then be started. It will
disregard any historical attributes.
3. The notification agent can then be taken offline and upgraded with the
same conditions.
4. The polling agents can be taken offline and upgraded. In this path, you'll
want to take down agents on all hosts before starting. After starting
first agent, you should verify that data is again being polled.
5. The api service can be taken offline and upgraded at any point.
2. Partial upgrade of services
1. The database is upgraded using the above mantra.
2. The new collector services can be started alongside the old collectors and
must know how to interpret the new payload. It will disregard any
historical attributes.
3. The new notification agent can be started alongside the old agent if no
workload_partioning is enabled OR if it has the same pipeline
configuration. If not, the old agents must be loaded with the same
pipeline configuration first to ensure the notification agents all work
against same pipeline sets.
4. The new polling agent can be started alongside the old agent only if
no new pollsters were added. If not, new polling agents must start only
in it's own partitioning group and poll only the new pollsters. After
all old agents are upgraded, the polling agents can be changed to poll
both new pollsters AND the old ones.
5. API service management is handled by WSGI so there is only ever one
version of API service running
.. note::
Upgrade ordering does not matter in partial upgrade path. The only
requirement is that the database be upgraded first. It is advisable to
upgrade following the same ordering as currently described: database,
collector, notification agent, polling agent, api.
Regarding new models, they will be pushed into their own unique queue similar
to how Events and Samples are processed on completely separate queues today.
The above procedures will be added to OpenStack documentation.
Alternatives
------------
1. oslo.versionedobjects - this seems like overkill and will add processing
overhead to each and every sample
2. versioned queues - this does not have o.vo overhead but will require more
memory to handle additional queues.
3. versioned payloads - this may simplify payload as we don't need to carry
historical fields but requires agents to understand each unique version
Data model impact
-----------------
Depending on the amount of changes, this may increase the size of the model
as we need to capture old attributes. We can choose to define a drop period
if this becomes an issue where we stop support attributes from EOL builds.
REST API impact
---------------
None.
Security impact
---------------
None.
Pipeline impact
---------------
None.
Other end user impact
---------------------
None.
Performance/Scalability Impacts
-------------------------------
Same amount of processing but **potentially** larger payloads.
Other deployer impact
---------------------
For those not consuming data from API, they will need to be aware of
model changes if they want the latest and greatest.
Developer impact
----------------
None.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
gordc
Ongoing maintainer:
everyone
Work Items
----------
* Add the above conditions to docs
* Add testing support
Future lifecycle
================
If service requirements change, the above assumptions may not be enough.
Dependencies
============
None.
Testing
=======
* multi-node grenade testing
* migration testing - https://review.openstack.org/#/c/234686/
Documentation Impact
====================
This proposal is nothing but documentation.
References
==========
[1] https://etherpad.openstack.org/p/mitaka-telemetry-upgrades