Metrics retention policy enhancement

Support differentiable metrics retention policy based on metrics
type.

Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0
story: 2001576
This commit is contained in:
James Gu 2018-02-22 17:49:49 -08:00 committed by Joseph Davis
parent 4aa92c0caa
commit d3d4f84a0b
2 changed files with 488 additions and 0 deletions

View File

@ -0,0 +1,289 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================================================
Metric Retention Policy
================================================
Story board: https://storyboard.openstack.org/#!/story/2001576
Metric retention policy must be in place to avoid disk being filled up.
Retention period should be adjustable for different types of metrics, e.g.,
monitoring vs. metering or aggregate vs. raw meters.
Problem description
===================
In a cloud of 200 compute hosts, there can be up to one billion metrics
generated daily. The time series database disks will be filled up in months
if not weeks if old metric data is not purged regularly. The retention
requirement can be different based on the type of the metrics and the usage
model. For example, the customer may want to preserve the metering metrics
for months or years, while s/he has no interest in more than a week old
monitoring metrics. Some customers' billing system may pull the metering data
on a daily base which could eliminate the need of longer retention of metering
metrics. Monasca needs to support metric retention policy that can be tailored
per metric or metric type.
Use Cases
---------
- Use case 1
Operator configures default metric retention in the persister configuration.
The default retention policy is applied if a metric doesn't specify its
retention policy. This default retention is generally shorter period of time
and is targeted to the monitoring metrics.
- Use case 2
Operator configures the retention policy for the roll up metrics in the
Monasca transform. Roll up metrics generally require a longer retention
period.
- Use case 3
Operator configures the retention policy for Ceilometer metrics in the
pipeline and mapping configuration file. Metering metrics generally require
longer retention period.
- Use case 4
The metric agent plugin sets retention policy when generates a new metric.
This is mostly a means to overwrite the default retention policy for
monitoring metrics.
Proposed change
===============
** Posting to get preliminary feedback on the scope of this spec. **
1. Monasca API
Add an optional metric property "TTL" in the create metrics api. TTL is the
number of seconds before the metric sets to expire. If set, the TTL property
will be included When posting a new metric message to Kafka.
2. Monasca Persister
Persister reads the default retention policy setting from the service
configuration file in the influxDbConfiguration and cassandraDbConfiguration
section.
::
# Retention policy may be left blank to indicate default policy.
retentionPolicy: 7
It may makes more sense to move this property to metricConfiguration section
and convert to use the unit of seconds instead of days.
It will retrieve the TTL property in the incoming metric message. If not set,
the TTL value from the default retention policy will used instead.
The TTL is set in the parameterized database query when persisting the metrics
into the time series database, including both Cassandra and InfluxDB.
3. Monasca Ceilometer (aka Ceilosca)
Add TTL property in the pipeline-api.yaml
::
- name: image_source
interval: 30
# expires after 90 days
TTL: 7776000
meters:
- "image"
- "image.size"
- "image.update"
- "image.upload"
- "image.delete"
sinks:
- meter_sink
Monasca-ceilometer implementation will parse the new property and set the TTL
when posting new metric message.
4. Monasca Transform
Add TTL property in transform_specs.json
::
{"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming","usage":"fetch_quantity","setters":["rollup_quantity","set_aggregated_metric_name","set_aggregated_period"],"insert":["prepare_data","insert_data_pre_hourly"]},"aggregated_metric_name":"vm.mem.total_mb_agg","aggregation_period":"hourly", TTL:"7776000", "aggregation_group_by_list": ["host", "metric_id", "tenant_id", "resource_uuid"],"usage_fetch_operation": "avg","filter_by_list": [],"setter_rollup_group_by_list":["tenant_id"],"setter_rollup_operation": "sum","dimension_list":["aggregation_period","host","project_id"],"pre_hourly_operation":"avg","pre_hourly_group_by_list":["default"]},"metric_group":"vm_mem_total_mb_project","metric_id":"vm_mem_total_mb_project"}
Monasca-transform implementation will parse the new property and set the TTL
when posting new rolled up metric messages.
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
Each API method which is either added or changed should have the following
* Specification change for the create metric api
* Create metrics
* Method type: POST
* Normal http response code(s): No change
* Expected error http response code(s): no change
* URL: /v2.0/metrics
* Parameters: no change
* Request body: Consists of a single metric object or an array of metric
objects. A metric has the following properties:
* name (string(255), required) - The name of the metric.
* dimensions ({string(255): string(255)}, optional) - A dictionary
consisting of (key, value) pairs used to uniquely identify a metric.
* timestamp (string, required) - The timestamp in milliseconds from the
Epoch.
* value (float, required) - Value of the metric. Values with base-10
exponents greater than 126 or less than -130 are truncated.
* value_meta ({string(255): string}(2048), optional) - A dictionary
consisting of (key, value) pairs used to add information about the value.
Value_meta key value combinations must be 2048 characters or less
including '{"":""}' 7 characters total from every json string.
* TTL - time to live in seconds.
* Example use case including typical API samples for both data supplied
by the caller and the response
Security impact
---------------
None. Security measures already in place for the Monasca API would remain.
Other end user impact
---------------------
None
Performance Impact
------------------
This feature has no direct impact on the write throughput. However, it allows
the user to enable shorter retention period for monitoring metrics which
can potentially improve the read performance for the queries that involves
search, grouping and filtering when there are less metrics in the table. This
improves the storage footprint.
Other deployer impact
---------------------
No change in deployment of the services.
For planning, the user now has the option to specify a shorter retention period
for monitoring metrics or even per metric or metric category. The disk size
should be calculated based upon the retention policy accordingly.
Developer impact
----------------
Monasca agent plugin developers should be aware of the new TTL property
now available to them. It is an optional property that is only needed if a
different TTL value than the default retention policy in the persister service
is needed.
Implementation
==============
Assignee(s)
-----------
Contributors are welcome!
Primary assignee:
Other contributors:
Work Items
----------
Work items or tasks -- break the feature up into the things that need to be
done to implement it. Those parts might end up being done by different people,
but we're mostly trying to understand the timeline for implementation.
Dependencies
============
Dependent on retention policy support in the TSDB storage. Both Cassandra
and InfluxDB support specifying a retention policy.
Testing
=======
~Please discuss the important scenarios needed to test here, as well as
specific edge cases we should be ensuring work correctly. For each
scenario please specify if this requires specialized hardware, a full
openstack environment, or can be simulated inside the Monasca tree.~
~Please discuss how the change will be tested. We especially want to know what
tempest tests will be added. It is assumed that unit test coverage will be
added so that doesn't need to be mentioned explicitly, but discussion of why
you think unit tests are sufficient and we don't need to add more tempest
tests would need to be included.~
~Is this untestable in gate given current limitations (specific hardware /
software configurations available)? If so, are there mitigation plans (3rd
party testing, gate enhancements, etc).~
Documentation Impact
====================
~Which audiences are affected most by this change, and which documentation
titles on docs.openstack.org should be updated because of this change? Don't
repeat details discussed above, but reference them here in the context of
documentation for multiple audiences. For example, the Operations Guide targets
cloud operators, and the End User Guide would need to be updated if the change
offers a new feature available through the CLI or dashboard. If a config option
changes or is deprecated, note here that the documentation needs to be updated
to reflect this specification's change.~
References
==========
~Please add any useful references here. You are not required to have any
reference. Moreover, this specification should still make sense when your
references are unavailable. Examples of what you could include are:~
* ~Links to mailing list or IRC discussions~
* ~Links to notes from a summit session~
* ~Links to relevant research, if appropriate~
* ~Related specifications as appropriate (e.g. if it's an EC2 thing, link the
EC2 docs)~
* ~Anything else you feel it is worthwhile to refer to~
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Queens
- Introduced

View File

@ -0,0 +1,199 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================================================
Python Persister Performance Metrics Collection (WIP)
=====================================================
Story board: https://storyboard.openstack.org/#!/story/2001576
This defines the list of measurements for the metric upsert processing time and
throughput in Python Persister and provides a rest api to retrieve those
measurements.
Problem description
===================
The Java Persister, built on top of the DropWizard framework, provides a list
of internal performance related metrics, e.g., the total number of metric
messages that have been processed since the last service start up, the average,
min and max metric processing time etc. The Python Persister, on the other
hand, lacks such instrumentation. This presents a challenge to the operator
who wants to monitor, triage, and tune the Persister performance and to the
Persister performance testing tool that was introduced in Queens release. The
Cassandra Python Persister plugin depends on this feature for performance
tuning.
Use Cases
---------
- Use case 1: The developer instruments the defined performance metrics.
There are two approaches towards the internal performance metrics. The first
approach is in memory metering similar to the Java implementation. The data
collection starts when the Persister service starts up and is not persisted
through service restart. The second approach is to treat such measurement
exactly the same as the "normal" metrics Monasca collects. The advantage is
that such metrics will be persisted and rest apis are already available to
retrieve the metrics.
The list of Persister metrics includes:
1. Total number of metrics upsert request received and completed on a given
Persister service instance in the given period of time
2. Total number of metrics upsert request received and completed on a
process or thread in a given period of time (P2)
3. The average, min, max metric request processing time in a given period of
time for a given Persister service instance and process/thread.
- Use case 2: Retrieves persister performance metrics through rest api.
The performance metrics can be retrieved using the list metrics api in the
Monasca API service.
Proposed change
===============
1. Monasca Persister
- Python Persister integrates with monasca-statsd to send count and timer
metrics
- Persister conf to add properties for statsd
2. Persister performance benchmark tool adds support to retrieve the metrics
from Monasca rest api source in addition to the DropWizard admin api.
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
TBD, The statsd call to update counter and timer is expected to have small
performance impact.
Other deployer impact
---------------------
No change in deployment of the services.
Developer impact
----------------
None.
Implementation
==============
Assignee(s)
-----------
Contributors are welcome!
Primary assignee:
jgu
Other contributors:
Work Items
----------
Work items or tasks -- break the feature up into the things that need to be
done to implement it. Those parts might end up being done by different people,
but we're mostly trying to understand the timeline for implementation.
Dependencies
============
None
Testing
=======
Please discuss the important scenarios needed to test here, as well as
specific edge cases we should be ensuring work correctly. For each
scenario please specify if this requires specialized hardware, a full
openstack environment, or can be simulated inside the Monasca tree.
Please discuss how the change will be tested. We especially want to know what
tempest tests will be added. It is assumed that unit test coverage will be
added so that doesn't need to be mentioned explicitly, but discussion of why
you think unit tests are sufficient and we don't need to add more tempest
tests would need to be included.
Is this untestable in gate given current limitations (specific hardware /
software configurations available)? If so, are there mitigation plans (3rd
party testing, gate enhancements, etc).
Documentation Impact
====================
Which audiences are affected most by this change, and which documentation
titles on docs.openstack.org should be updated because of this change? Don't
repeat details discussed above, but reference them here in the context of
documentation for multiple audiences. For example, the Operations Guide targets
cloud operators, and the End User Guide would need to be updated if the change
offers a new feature available through the CLI or dashboard. If a config option
changes or is deprecated, note here that the documentation needs to be updated
to reflect this specification's change.
References
==========
Please add any useful references here. You are not required to have any
reference. Moreover, this specification should still make sense when your
references are unavailable. Examples of what you could include are:
* Links to mailing list or IRC discussions
* Links to notes from a summit session
* Links to relevant research, if appropriate
* Related specifications as appropriate (e.g. if it's an EC2 thing, link the
EC2 docs)
* Anything else you feel it is worthwhile to refer to
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Queens
- Introduced