rolling upgrades
this spec proposes workflow to address the requirement of enabling users to upgrade ceilometer with minimal/zero downtime. see content for details. Change-Id: I94ef8cc0e705b79ccccc8428820c8cf09c8eca78
This commit is contained in:
parent
b0167711aa
commit
988a1d284c
|
@ -0,0 +1,187 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
================
|
||||||
|
Rolling Upgrades
|
||||||
|
================
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/ceilometer/+spec/rolling-upgrades
|
||||||
|
|
||||||
|
As Ceilometer matures through each iteration and adds new features, users
|
||||||
|
will be required to upgrade their existing environments while minimising
|
||||||
|
the potential downtime required.
|
||||||
|
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
Ceilometer currently provides 4 discrete services: polling agents, notification
|
||||||
|
agents, collector service, and an api service. Each of them provide their
|
||||||
|
own functionality and the flow of data and purpose of each component can often
|
||||||
|
be lost on operators when upgrading services.
|
||||||
|
|
||||||
|
To ensure a smooth upgrade experience, we need to properly describe the upgrade
|
||||||
|
path of the components.
|
||||||
|
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Fortunately, due to the unilateral design of Ceilometer where work is done and
|
||||||
|
handed off without worry, upgrading is actually a simple procedure and only
|
||||||
|
requires proper ordering of upgrades.
|
||||||
|
|
||||||
|
Using the simple mantra of 'never remove, never alter, only add', we can ensure
|
||||||
|
a new schema change is understandable by both old and new consumers.
|
||||||
|
|
||||||
|
There are two upgrade paths to handle -- both require no code change:
|
||||||
|
|
||||||
|
1. Full upgrade of services
|
||||||
|
|
||||||
|
1. The database is upgraded using the above mantra.
|
||||||
|
2. The collector must be first taken offline. The new collector, that knows
|
||||||
|
how to interpret the new payload, can then be started. It will
|
||||||
|
disregard any historical attributes.
|
||||||
|
3. The notification agent can then be taken offline and upgraded with the
|
||||||
|
same conditions.
|
||||||
|
4. The polling agents can be taken offline and upgraded. In this path, you'll
|
||||||
|
want to take down agents on all hosts before starting. After starting
|
||||||
|
first agent, you should verify that data is again being polled.
|
||||||
|
5. The api service can be taken offline and upgraded at any point.
|
||||||
|
|
||||||
|
2. Partial upgrade of services
|
||||||
|
|
||||||
|
1. The database is upgraded using the above mantra.
|
||||||
|
2. The new collector services can be started alongside the old collectors and
|
||||||
|
must know how to interpret the new payload. It will disregard any
|
||||||
|
historical attributes.
|
||||||
|
3. The new notification agent can be started alongside the old agent if no
|
||||||
|
workload_partioning is enabled OR if it has the same pipeline
|
||||||
|
configuration. If not, the old agents must be loaded with the same
|
||||||
|
pipeline configuration first to ensure the notification agents all work
|
||||||
|
against same pipeline sets.
|
||||||
|
4. The new polling agent can be started alongside the old agent only if
|
||||||
|
no new pollsters were added. If not, new polling agents must start only
|
||||||
|
in it's own partitioning group and poll only the new pollsters. After
|
||||||
|
all old agents are upgraded, the polling agents can be changed to poll
|
||||||
|
both new pollsters AND the old ones.
|
||||||
|
5. API service management is handled by WSGI so there is only ever one
|
||||||
|
version of API service running
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Upgrade ordering does not matter in partial upgrade path. The only
|
||||||
|
requirement is that the database be upgraded first. It is advisable to
|
||||||
|
upgrade following the same ordering as currently described: database,
|
||||||
|
collector, notification agent, polling agent, api.
|
||||||
|
|
||||||
|
Regarding new models, they will be pushed into their own unique queue similar
|
||||||
|
to how Events and Samples are processed on completely separate queues today.
|
||||||
|
|
||||||
|
The above procedures will be added to OpenStack documentation.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
1. oslo.versionedobjects - this seems like overkill and will add processing
|
||||||
|
overhead to each and every sample
|
||||||
|
2. versioned queues - this does not have o.vo overhead but will require more
|
||||||
|
memory to handle additional queues.
|
||||||
|
3. versioned payloads - this may simplify payload as we don't need to carry
|
||||||
|
historical fields but requires agents to understand each unique version
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Depending on the amount of changes, this may increase the size of the model
|
||||||
|
as we need to capture old attributes. We can choose to define a drop period
|
||||||
|
if this becomes an issue where we stop support attributes from EOL builds.
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
Pipeline impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
Performance/Scalability Impacts
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
Same amount of processing but **potentially** larger payloads.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
For those not consuming data from API, they will need to be aware of
|
||||||
|
model changes if they want the latest and greatest.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
gordc
|
||||||
|
|
||||||
|
Ongoing maintainer:
|
||||||
|
everyone
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Add the above conditions to docs
|
||||||
|
* Add testing support
|
||||||
|
|
||||||
|
|
||||||
|
Future lifecycle
|
||||||
|
================
|
||||||
|
|
||||||
|
If service requirements change, the above assumptions may not be enough.
|
||||||
|
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
* multi-node grenade testing
|
||||||
|
* migration testing - https://review.openstack.org/#/c/234686/
|
||||||
|
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
This proposal is nothing but documentation.
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
[1] https://etherpad.openstack.org/p/mitaka-telemetry-upgrades
|
Loading…
Reference in New Issue