Metrics retention policy enhancement, python persister perf

Support differentiable metrics retention policy based on metrics
type.  Also outline alternatives.

This commit also includes a spec for Python Persiseter Performance
metric collection, still Work in Progress.

Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0
story: 2001576
This commit is contained in:
James Gu 2018-02-22 17:49:49 -08:00 committed by Joseph Davis
parent 4aa92c0caa
commit b2ae09065c
2 changed files with 573 additions and 0 deletions

View File

@ -0,0 +1,374 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================================================
Metric Retention Policy
================================================
Story board: https://storyboard.openstack.org/#!/story/2001576
Metric retention policy must be in place to avoid disk being filled up.
Retention period should be adjustable for different types of metrics, e.g.,
monitoring vs. metering or aggregate vs. raw meters.
Problem description
===================
In a cloud of 200 compute hosts, there can be up to one billion metrics
generated daily. The time series database disks will be filled up in months
if not weeks if old metric data is not purged regularly. The retention
requirement can be different based on the type of the metrics and the usage
model. For example, the customer may want to preserve the metering metrics
for months or years, while s/he has no interest in more than a week old
monitoring metrics. Some customers' billing system may pull the metering data
on a daily base which could eliminate the need of longer retention of metering
metrics. Monasca needs to support metric retention policy that can be tailored
per metric or metric type.
Use Cases
---------
- Use case 1
Installer sets a default TTL value in configuration. At installation time,
a default TTL (time to live) value is specified in the configuration for
monasca-api and is used as the default retention policy.
The default retention policy is applied if a metric doesn't match another
retention policy. This default retention is generally a shorter period of
time and may be used for the common monitoring metrics.
- Use case 2
Installer loads a set of metric to TTL mappings (retention policies), which
is stored in the Monasca API data store (mysql database). These mappings may
be provided in a JSON structure. This is intended to be useful for bootstrap
or restore from backup.
- Use case 3
Monasca API receives new metric (regardless of source). Metric is mapped to
a dictionary to determine TTL (or default value used if no match). TTL is
passed with metric value on to the Persister for storage in TSDB.
Note that the use cases for monasca-agent to post metrics are unchanged, just
the processing at Monasca API then the API to Persister message.]
The Monasca Persister then stores the metric and specifies the TTL to the
TSDB configured (i.e. InfluxDB or Cassandra).
- Use case 4
Operator uses Monasca CLI to specify (or modify) a TTL value for a metric
match string. Match string could be specific, such as "cpu.user_perc" or a
wildcard string, such as "image.*". CLI posts request to Monasca TTL API,
where it is validated then stored in database.
- Use case 5
Operator uses Monasca CLI to GET the dictionary of metric:TTL mappings.
This can be used to export the list for backup or verification.
- Use case 6 (optional)
Operator uses Monasca UI to accomplish use case 4 or 5
Proposed change
===============
1. Monasca API
A. Add a new API for managing the mapping of metrics to TTL values.
TBD - API structure
B. Add storage for the mapping in the MySQL database. This is to allow
all instances of Monasca API to share the configuration dynamically.
Create a schema for storing the metric:TTL dictionary.
C. A policy precedence needs to be defined. It is possible that more than
one retention policy may apply to a given meter, so a clear precedence
needs to be defined to determine which TTL value to apply.
TBD: examples
2. Monasca Persister
Persister reads the default retention policy setting from the service
configuration file in the influxDbConfiguration and cassandraDbConfiguration
section.
::
# Retention policy may be left blank to indicate default policy.
# Unit is days
retentionPolicy: 7
It may be convenient to allow specifying a unit with the policy value. For
example "7d" for 7 days or "3m" for 3 months.
It will retrieve the TTL property in the incoming metric message. If not set,
the TTL value from the default retention policy will used instead.
It is expected with the addition of this Metrics Retention feature that the
default retentionPolicy value would be set to a low value, and that metrics
that are to be kept longer would be called out specifically through the
Retention API and appropriate values set.
The TTL is set in the parameterized database query when persisting the metrics
into the time series database, including both Cassandra and InfluxDB.
TBD - exact call structures for each TSDB.
Note that this does mean that each storage back end would need to have code
customized in the persister to support passing the TTL value. This may also
be possible for ElasticSearch, though that is not part of this initial spec.
3. Monasca CLI (optional)
A new CLI feature could be created to simplify getting the list of TTL
mappings or posting an update to a TTL mapping. This would need Keystone
authentication, as does the existing 'monasca' CLI, and could be added to it.
TBD: whether the current monasca CLI could handle ingesting a json structure.
4. Monasca UI (optional)
A new feature could be added to the Monasca UI that would allow a Cloud
Operator to view and edit the list of TTL mappings.
Bonus points for allowing the UI to have sample metrics and simulate the
mapping on the page.
Alternatives
------------
The original proposal was to have monasca-transform, monasca-ceilometer, and
monasca-agent each keep a TTL default setting and have a property to allow
specifying a TTL per metric. This would have also required a change to the
Monasca API to add an optional TTL to the metric POST listener.
While this would have been simpler to implement in the Monasca API, the
additional work to change all the services that originate metrics made this
alternative not as appealing.
Another alternative would be to implement a new Monasca Retention API as
outlined, but not include dimensions for the metrics. This would allow a much
simpler data structure of key:value pairs, with the key being the unique match
string and the value the standardized TTL value. While the implementation
would be much simpler, it is felt that the additional power of having match
dimensions would be beneficial.
Data model impact
-----------------
The Monasca API data model will need to be extended to store the metric to
TTL mappings (retention policies).
TBD - schema
REST API impact
---------------
A new metric retention API endpoint would be added to Monasca API.
URL: /v2.0/metrics-retention
Method: GET
A GET request will return the current list of metric retention policies.
Examples::
Empty list (default retention used for all metrics)
[]
Simple list
[
{
match: "cpu.user_perc",
dimensions: {"host": "node1"},
retentionPolicy: "7d"
},
{
match: "cpu.stolen_perc",
dimensions: {},
retentionPolicy: "7d"
}
]
Method: PUT
The PUT method is used for all create/update/delete methods on the metric
retention policy list. Any list of metrics PUT to the API will be merged
with the existing list. Single entries will also be supported.
JSON structure for PUT/GET to Retention API::
{
match: "cpu.user_perc",
dimensions: {},
retentionPolicy: "7d"
}
TBD: do we support adding a character for time unit? Will it be confusing to
PUT "1d" and GET back "86400"?
Special case: to delete a retention policy, give a retentionPolicy value of
None and it will be removed from the list.
::
{
match: "cpu.user_time",
dimensions: {},
retentionPolicy: None
}
Additionally, a list of retention policy items may be PUT, with the format
matching the response from GET. Each item in the list will be compared to
existing metric policies (match string and dimensions). If an exact match is
found, the retentionPolicy value will be replaced. Otherwise, the new item is
added to the list.
(This is intended to make bootstrap or restore from backup easier)
The communication from Monasca API to Persister would have the TTL value
added as a parameter.
NOTE: Care should be taken in defining the REST API path, as Gnocchi uses
"/metric", which may be confusing to some users.
Security impact
---------------
None. Security measures already in place for the Monasca API would remain.
Other end user impact
---------------------
None for most users, as access is restricted to Cloud Operators.
A Cloud Operator would have a new responsibility to configure retention for
the metrics.
A future discussion could be had about whether a tenant user should be granted
the ability to set their own retention policies, but generally the Cloud
Operator is responsible for ensuring there are sufficient resources to meet the
retention requirements.
Performance Impact
------------------
This feature has no direct impact on the write throughput. However, it allows
the user to enable shorter retention period for monitoring metrics which
can potentially improve the read performance for the queries that involves
search, grouping and filtering when there are less metrics in the table. This
improves the storage footprint.
Depending on how complex the metric retention match string gets there could be
some performance impact. TBD
Other deployer impact
---------------------
No change in deployment of the services.
The service could be deployed with simply a default TTL value in configuration.
If the operator desires, at install time a complete list of TTL values could
be loaded as part of the installation process once the Monasca API is running.
For planning, the user now has the option to specify a shorter retention period
for monitoring metrics or even per metric or metric category. The disk size
should be calculated based upon the retention policy accordingly.
Developer impact
----------------
Monasca agent plugin developers should be aware of the new TTL property
now available to them. It is an optional property that is only needed if a
different TTL value than the default retention policy in the Persister service
is needed.
Implementation
==============
Assignee(s)
-----------
Contributors are welcome!
Primary assignee:
Other contributors:
Work Items
----------
* Add new metrics-retention API endpoint to Monasca API
* Add code to match all incoming metrics to the Monasca API with the appropriate
retention policy (or default)
* Add TTL in seconds as a parameter to the request from Monasca API to
Persister
* Create a CLI
* PUT of updated retention policy(ies)
* GET of the list
* Determine correct precedence for retention policies that overlap, and clearly
document with examples.
Dependencies
============
Dependent on retention policy support in the TSDB storage. Both Cassandra
and InfluxDB support specifying a retention policy.
Testing
=======
~Please discuss the important scenarios needed to test here, as well as
specific edge cases we should be ensuring work correctly. For each
scenario please specify if this requires specialized hardware, a full
openstack environment, or can be simulated inside the Monasca tree.~
~Please discuss how the change will be tested. We especially want to know what
tempest tests will be added. It is assumed that unit test coverage will be
added so that doesn't need to be mentioned explicitly, but discussion of why
you think unit tests are sufficient and we don't need to add more tempest
tests would need to be included.~
~Is this untestable in gate given current limitations (specific hardware /
software configurations available)? If so, are there mitigation plans (3rd
party testing, gate enhancements, etc).~
TBD
Documentation Impact
====================
Operators who use Monasca would need documentation to describe the format of
the new API and recommended usage. This may include guidelines on how to set
a low default and to choose which metrics should be kept longer. The default
TTL value as set in a config file should also be documented.
References
==========
* Links
* Stein PTG discussion - https://etherpad.openstack.org/p/monasca-ptg-stein
* Glossary
* TTL - short for Time to Live, a setting in TSDB that defines when an item
(in this case a metric) will be cleaned out.
* TSDB - Time Series Database, such as InfluxDB or Cassandra.
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Queens
- Introduced

View File

@ -0,0 +1,199 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================================================
Python Persister Performance Metrics Collection (WIP)
=====================================================
Story board: https://storyboard.openstack.org/#!/story/2001576
This defines the list of measurements for the metric upsert processing time and
throughput in Python Persister and provides a rest api to retrieve those
measurements.
Problem description
===================
The Java Persister, built on top of the DropWizard framework, provides a list
of internal performance related metrics, e.g., the total number of metric
messages that have been processed since the last service start up, the average,
min and max metric processing time etc. The Python Persister, on the other
hand, lacks such instrumentation. This presents a challenge to the operator
who wants to monitor, triage, and tune the Persister performance and to the
Persister performance testing tool that was introduced in Queens release. The
Cassandra Python Persister plugin depends on this feature for performance
tuning.
Use Cases
---------
- Use case 1: The developer instruments the defined performance metrics.
There are two approaches towards the internal performance metrics. The first
approach is in memory metering similar to the Java implementation. The data
collection starts when the Persister service starts up and is not persisted
through service restart. The second approach is to treat such measurement
exactly the same as the "normal" metrics Monasca collects. The advantage is
that such metrics will be persisted and rest apis are already available to
retrieve the metrics.
The list of Persister metrics includes:
1. Total number of metrics upsert request received and completed on a given
Persister service instance in the given period of time
2. Total number of metrics upsert request received and completed on a
process or thread in a given period of time (P2)
3. The average, min, max metric request processing time in a given period of
time for a given Persister service instance and process/thread.
- Use case 2: Retrieves persister performance metrics through rest api.
The performance metrics can be retrieved using the list metrics api in the
Monasca API service.
Proposed change
===============
1. Monasca Persister
- Python Persister integrates with monasca-statsd to send count and timer
metrics
- Persister conf to add properties for statsd
2. Persister performance benchmark tool adds support to retrieve the metrics
from Monasca rest api source in addition to the DropWizard admin api.
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
TBD, The statsd call to update counter and timer is expected to have small
performance impact.
Other deployer impact
---------------------
No change in deployment of the services.
Developer impact
----------------
None.
Implementation
==============
Assignee(s)
-----------
Contributors are welcome!
Primary assignee:
jgu
Other contributors:
Work Items
----------
Work items or tasks -- break the feature up into the things that need to be
done to implement it. Those parts might end up being done by different people,
but we're mostly trying to understand the timeline for implementation.
Dependencies
============
None
Testing
=======
Please discuss the important scenarios needed to test here, as well as
specific edge cases we should be ensuring work correctly. For each
scenario please specify if this requires specialized hardware, a full
openstack environment, or can be simulated inside the Monasca tree.
Please discuss how the change will be tested. We especially want to know what
tempest tests will be added. It is assumed that unit test coverage will be
added so that doesn't need to be mentioned explicitly, but discussion of why
you think unit tests are sufficient and we don't need to add more tempest
tests would need to be included.
Is this untestable in gate given current limitations (specific hardware /
software configurations available)? If so, are there mitigation plans (3rd
party testing, gate enhancements, etc).
Documentation Impact
====================
Which audiences are affected most by this change, and which documentation
titles on docs.openstack.org should be updated because of this change? Don't
repeat details discussed above, but reference them here in the context of
documentation for multiple audiences. For example, the Operations Guide targets
cloud operators, and the End User Guide would need to be updated if the change
offers a new feature available through the CLI or dashboard. If a config option
changes or is deprecated, note here that the documentation needs to be updated
to reflect this specification's change.
References
==========
Please add any useful references here. You are not required to have any
reference. Moreover, this specification should still make sense when your
references are unavailable. Examples of what you could include are:
* Links to mailing list or IRC discussions
* Links to notes from a summit session
* Links to relevant research, if appropriate
* Related specifications as appropriate (e.g. if it's an EC2 thing, link the
EC2 docs)
* Anything else you feel it is worthwhile to refer to
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Queens
- Introduced