Metrics retention policy enhancement
Support differentiable metrics retention policy based on metrics type. Also outline alternatives. Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0 story: 2001576
This commit is contained in:
parent
4aa92c0caa
commit
a3f82ddd4a
|
@ -0,0 +1,380 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
================================================
|
||||||
|
Metric Retention Policy
|
||||||
|
================================================
|
||||||
|
|
||||||
|
Story board: https://storyboard.openstack.org/#!/story/2001576
|
||||||
|
|
||||||
|
Metric retention policy must be in place to avoid disk being filled up.
|
||||||
|
Retention period should be adjustable for different types of metrics, e.g.,
|
||||||
|
monitoring vs. metering or aggregate vs. raw meters.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
In a cloud of 200 compute hosts, there can be up to one billion metrics
|
||||||
|
generated daily. The time series database disks will be filled up in months
|
||||||
|
if not weeks if old metric data is not purged regularly. The retention
|
||||||
|
requirement can be different based on the type of the metrics and the usage
|
||||||
|
model. For example, the customer may want to preserve the metering metrics
|
||||||
|
for months or years, while s/he has no interest in more than a week old
|
||||||
|
monitoring metrics. Some customers' billing system may pull the metering data
|
||||||
|
on a daily base which could eliminate the need of longer retention of metering
|
||||||
|
metrics. Monasca needs to support metric retention policy that can be tailored
|
||||||
|
per metric or metric type.
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
---------
|
||||||
|
|
||||||
|
- Use case 1:
|
||||||
|
Installer sets a default TTL value in configuration. At installation time,
|
||||||
|
a default TTL (time to live) value is specified in the configuration for
|
||||||
|
monasca-api and is used as the default retention policy.
|
||||||
|
|
||||||
|
The default retention policy is applied if a metric doesn't match another
|
||||||
|
retention policy. This default retention is generally a shorter period of
|
||||||
|
time and may be used for the common monitoring metrics.
|
||||||
|
|
||||||
|
- Use case 2:
|
||||||
|
Installer loads a set of metric to TTL mappings (retention policies), which
|
||||||
|
is stored in the Monasca API data store (mysql database). These mappings may
|
||||||
|
be provided in a JSON structure. This is intended to be useful for bootstrap
|
||||||
|
or restore from backup.
|
||||||
|
|
||||||
|
- Use case 3:
|
||||||
|
Monasca API receives new metric (regardless of source). Metric is mapped to
|
||||||
|
a dictionary to determine TTL (or default value used if no match). TTL is
|
||||||
|
passed with metric value on to the Persister for storage in TSDB.
|
||||||
|
|
||||||
|
Note that the use cases for monasca-agent to post metrics are unchanged, just
|
||||||
|
the processing at Monasca API then the API to Persister message.]
|
||||||
|
|
||||||
|
The Monasca Persister then stores the metric and specifies the TTL to the
|
||||||
|
TSDB configured (i.e. InfluxDB or Cassandra).
|
||||||
|
|
||||||
|
- Use case 4:
|
||||||
|
Operator uses Monasca CLI to specify (or modify) a TTL value for a metric
|
||||||
|
match string. Match string could be specific, such as "cpu.user_perc" or a
|
||||||
|
wildcard string, such as "image.*". CLI posts request to Monasca TTL API,
|
||||||
|
where it is validated then stored in database.
|
||||||
|
|
||||||
|
- Use case 5:
|
||||||
|
Operator uses Monasca CLI to GET the dictionary of metric:TTL mappings.
|
||||||
|
This can be used to export the list for backup or verification.
|
||||||
|
|
||||||
|
- Use case 6 (optional):
|
||||||
|
Operator uses Monasca UI to accomplish use case 4 or 5
|
||||||
|
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
1. Monasca API:
|
||||||
|
Add a new API for managing the mapping of metrics to TTL values.
|
||||||
|
See the `REST API impact`_ section below.
|
||||||
|
|
||||||
|
Add storage for the mapping in the MySQL database. This is to allow
|
||||||
|
all instances of Monasca API to share the configuration dynamically.
|
||||||
|
*TBD* - Create a schema for storing the metric:TTL dictionary.
|
||||||
|
|
||||||
|
A policy precedence needs to be defined. It is possible that more than
|
||||||
|
one retention policy may apply to a given meter, so a clear precedence
|
||||||
|
needs to be defined to determine which TTL value to apply.
|
||||||
|
*TBD* - a few concrete examples.
|
||||||
|
|
||||||
|
2. Monasca Persister:
|
||||||
|
Persister reads the default retention policy setting from the service
|
||||||
|
configuration file in the influxDbConfiguration and cassandraDbConfiguration
|
||||||
|
section.
|
||||||
|
::
|
||||||
|
|
||||||
|
# Retention policy may be left blank to indicate default policy.
|
||||||
|
# Unit is days
|
||||||
|
retentionPolicy: 7
|
||||||
|
|
||||||
|
It may be convenient to allow specifying a unit with the policy value. For
|
||||||
|
example "7d" for 7 days or "3m" for 3 months.
|
||||||
|
|
||||||
|
It will retrieve the TTL property in the incoming metric message. If not set,
|
||||||
|
the TTL value from the default retention policy will used instead.
|
||||||
|
|
||||||
|
It is expected with the addition of this Metrics Retention feature that the
|
||||||
|
default retentionPolicy value would be set to a low value, and that metrics
|
||||||
|
that are to be kept longer would be called out specifically through the
|
||||||
|
Retention API and appropriate values set.
|
||||||
|
|
||||||
|
The TTL is set in the parameterized database query when persisting the metrics
|
||||||
|
into the time series database, including both Cassandra and InfluxDB.
|
||||||
|
*TBD* - exact call structures for each TSDB.
|
||||||
|
|
||||||
|
Note that this does mean that each storage back end would need to have code
|
||||||
|
customized in the persister to support passing the TTL value. This may also
|
||||||
|
be possible for ElasticSearch, though that is not part of this initial spec.
|
||||||
|
|
||||||
|
3. Monasca CLI (optional):
|
||||||
|
A new CLI feature could be created to simplify getting the list of TTL
|
||||||
|
mappings or posting an update to a TTL mapping. This would need Keystone
|
||||||
|
authentication, and would use the existing 'monasca' CLI authentication.
|
||||||
|
|
||||||
|
4. Monasca UI (optional):
|
||||||
|
A new feature could be added to the Monasca UI that would allow a Cloud
|
||||||
|
Operator to view and edit the list of TTL mappings.
|
||||||
|
Bonus points for allowing the UI to have sample metrics and simulate the
|
||||||
|
mapping on the page.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
The original proposal was to have monasca-transform, monasca-ceilometer, and
|
||||||
|
monasca-agent each keep a TTL default setting and have a property to allow
|
||||||
|
specifying a TTL per metric. This would have also required a change to the
|
||||||
|
Monasca API to add an optional TTL to the metric POST listener.
|
||||||
|
|
||||||
|
While this would have been simpler to implement in the Monasca API, the
|
||||||
|
additional work to change all the services that originate metrics made this
|
||||||
|
alternative not as appealing.
|
||||||
|
|
||||||
|
|
||||||
|
Another alternative would be to implement a new Monasca Retention API as
|
||||||
|
outlined, but not include dimensions for the metrics. This would allow a much
|
||||||
|
simpler data structure of key:value pairs, with the key being the unique match
|
||||||
|
string and the value the standardized TTL value. While the implementation
|
||||||
|
would be much simpler, it is felt that the additional power of having match
|
||||||
|
dimensions would be beneficial.
|
||||||
|
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
The Monasca API data model will need to be extended to store the metric to
|
||||||
|
TTL mappings (retention policies).
|
||||||
|
*TBD* - schema
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
A new metric retention API endpoint would be added to Monasca API.
|
||||||
|
|
||||||
|
URL: /v2.0/metrics-retention
|
||||||
|
|
||||||
|
Method: GET
|
||||||
|
A GET request will return the current list of metric retention policies.
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
Empty list (default retention used for all metrics)
|
||||||
|
[]
|
||||||
|
|
||||||
|
Simple list
|
||||||
|
[
|
||||||
|
{
|
||||||
|
match: "cpu.user_perc",
|
||||||
|
dimensions: {"host": "node1"},
|
||||||
|
retentionPolicy: "7d"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
match: "cpu.stolen_perc",
|
||||||
|
dimensions: {},
|
||||||
|
retentionPolicy: "7d"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
Method: PUT
|
||||||
|
The PUT method is used for all create/update/delete methods on the metric
|
||||||
|
retention policy list. Any list of metrics PUT to the API will be merged
|
||||||
|
with the existing list. Single entries will also be supported.
|
||||||
|
|
||||||
|
JSON structure for PUT/GET to Retention API::
|
||||||
|
|
||||||
|
{
|
||||||
|
match: "cpu.user_perc",
|
||||||
|
dimensions: {},
|
||||||
|
retentionPolicy: "7d"
|
||||||
|
}
|
||||||
|
|
||||||
|
TBD: do we support adding a character for time unit? Will it be confusing to
|
||||||
|
PUT "1d" and GET back "86400"?
|
||||||
|
|
||||||
|
Special case: to delete a retention policy, give a retentionPolicy value of
|
||||||
|
None and it will be removed from the list.
|
||||||
|
::
|
||||||
|
|
||||||
|
{
|
||||||
|
match: "cpu.user_time",
|
||||||
|
dimensions: {},
|
||||||
|
retentionPolicy: None
|
||||||
|
}
|
||||||
|
|
||||||
|
Additionally, a list of retention policy items may be PUT, with the format
|
||||||
|
matching the response from GET. Each item in the list will be compared to
|
||||||
|
existing metric policies (match string and dimensions). If an exact match is
|
||||||
|
found, the retentionPolicy value will be replaced. Otherwise, the new item is
|
||||||
|
added to the list.
|
||||||
|
(This is intended to make bootstrap or restore from backup easier)
|
||||||
|
|
||||||
|
|
||||||
|
The communication from Monasca API to Persister would have the TTL value
|
||||||
|
added as a parameter.
|
||||||
|
|
||||||
|
NOTE: Care should be taken in defining the REST API path, as Gnocchi uses
|
||||||
|
"/metric", which may be confusing to some users.
|
||||||
|
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None. Security measures already in place for the Monasca API would remain.
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None for most users, as access to the Monasca Metrics API is restricted to
|
||||||
|
Cloud Operators.
|
||||||
|
A Cloud Operator would have a new responsibility to configure retention for
|
||||||
|
the metrics.
|
||||||
|
|
||||||
|
A future discussion could be had about whether a tenant user should be granted
|
||||||
|
the ability to set their own retention policies, but generally the Cloud
|
||||||
|
Operator is responsible for ensuring there are sufficient resources to meet the
|
||||||
|
retention requirements.
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
This feature has no direct impact on the write throughput. However, it allows
|
||||||
|
the user to enable shorter retention period for monitoring metrics which
|
||||||
|
can potentially improve the read performance for the queries that involves
|
||||||
|
search, grouping and filtering when there are less metrics in the table. This
|
||||||
|
improves the storage footprint.
|
||||||
|
|
||||||
|
Depending on how complex the metric retention match string gets there could be
|
||||||
|
some performance impact. *TBD*
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
No change in deployment of the services.
|
||||||
|
The service could be deployed with simply a default TTL value in configuration.
|
||||||
|
If the operator desires, at install time a complete list of TTL values could
|
||||||
|
be loaded as part of the installation process once the Monasca API is running.
|
||||||
|
|
||||||
|
For planning, the user now has the option to specify a shorter retention period
|
||||||
|
for monitoring metrics or even per metric or metric category. The disk size
|
||||||
|
should be calculated based upon the retention policy accordingly.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Monasca agent plugin developers should be aware of the new TTL property
|
||||||
|
now available to them. It is an optional property that is only needed if a
|
||||||
|
different TTL value than the default retention policy in the Persister service
|
||||||
|
is needed.
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Contributors are welcome!
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Add new metrics-retention API endpoint to Monasca API
|
||||||
|
|
||||||
|
* Add code to match all incoming metrics to the Monasca API with the appropriate
|
||||||
|
retention policy (or default)
|
||||||
|
|
||||||
|
* Add TTL in seconds as a parameter to the request from Monasca API to
|
||||||
|
Persister
|
||||||
|
|
||||||
|
* Create a CLI
|
||||||
|
|
||||||
|
* PUT of updated retention policy(ies)
|
||||||
|
* GET of the list
|
||||||
|
|
||||||
|
* Determine correct precedence for retention policies that overlap, and clearly
|
||||||
|
document with examples.
|
||||||
|
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
Dependent on retention policy support in the TSDB storage. Both Cassandra
|
||||||
|
and InfluxDB support specifying a retention policy.
|
||||||
|
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
Unit testing
|
||||||
|
Unit tests in the Monasca API should be written for the scenarios of defining
|
||||||
|
a TTL for each metric.
|
||||||
|
|
||||||
|
* Metric received, no matching retention policy found, default policy used
|
||||||
|
* Metric received, one exact matching metric retention policy found, matching
|
||||||
|
policy parameter passed to Persister call
|
||||||
|
* Metric received, more than one matching policy, correct precedent determined
|
||||||
|
and appropriate policy parameter passed to Persister call
|
||||||
|
|
||||||
|
Monasca Persister will also need unit tests to verify the passed-in value is
|
||||||
|
passed on to the TSDB retention method call, and to handle the case of a missing
|
||||||
|
TTL parameter. We may decide that the TTL parameter is optional then a global
|
||||||
|
default TTL value should be used.
|
||||||
|
|
||||||
|
Functional testing
|
||||||
|
Functional testing is more involved, as one way to test would be to trigger some
|
||||||
|
metrics, have them stored in the TSDB, then wait for the TTL value to expire and
|
||||||
|
verify the metric is removed correctly. More thought and definition is needed
|
||||||
|
to define what is appropriate and possible (i.e. to not retest features of the
|
||||||
|
TSDB).
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
Operators who use Monasca would need documentation to describe the format of
|
||||||
|
the new API and recommended usage. This may include guidelines on how to set
|
||||||
|
a low default and to choose which metrics should be kept longer. The default
|
||||||
|
TTL value as set in a config file should also be documented.
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
* Links
|
||||||
|
|
||||||
|
* Stein PTG discussion - https://etherpad.openstack.org/p/monasca-ptg-stein
|
||||||
|
|
||||||
|
* Glossary
|
||||||
|
|
||||||
|
* TTL - short for Time to Live, a setting in TSDB that defines when an item
|
||||||
|
(in this case a metric) will be cleaned out.
|
||||||
|
|
||||||
|
* TSDB - Time Series Database, such as InfluxDB or Cassandra.
|
||||||
|
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
||||||
|
|
||||||
|
Optional section intended to be used each time the spec is updated to describe
|
||||||
|
new design, API or any database schema updated. Useful to let reader understand
|
||||||
|
what's happened along the time.
|
||||||
|
|
||||||
|
.. list-table:: Revisions
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Release Name
|
||||||
|
- Description
|
||||||
|
* - Queens
|
||||||
|
- Introduced
|
Loading…
Reference in New Issue