Metrics retention policy enhancement, python persister perf

Support differentiable metrics retention policy based on metrics type. Also outline alternatives. This commit also includes a spec for Python Persiseter Performance metric collection, still Work in Progress. Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0 story: 2001576
2018-02-22 17:49:49 -08:00 · 2018-02-22 17:49:49 -08:00 · b2ae09065c
parent 4aa92c0caa
commit b2ae09065c
2 changed files with 573 additions and 0 deletions
--- a/specs/rocky/approved/metrics-retention.rst
+++ b/specs/rocky/approved/metrics-retention.rst
@ -0,0 +1,374 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+================================================
+Metric Retention Policy
+================================================
+
+Story board: https://storyboard.openstack.org/#!/story/2001576
+
+Metric retention policy must be in place to avoid disk being filled up.
+Retention period should be adjustable for different types of metrics, e.g.,
+monitoring vs. metering or aggregate vs. raw meters.
+
+Problem description
+===================
+
+In a cloud of 200 compute hosts, there can be up to one billion metrics
+generated daily. The time series database disks will be filled up in months
+if not weeks if old metric data is not purged regularly. The retention
+requirement can be different based on the type of the metrics and the usage
+model. For example, the customer may want to preserve the metering metrics
+for months or years, while s/he has no interest in more than a week old
+monitoring metrics. Some customers' billing system may pull the metering data
+on a daily base which could eliminate the need of longer retention of metering
+metrics. Monasca needs to support metric retention policy that can be tailored
+per metric or metric type.
+
+Use Cases
+---------
+
+- Use case 1
+  Installer sets a default TTL value in configuration.  At installation time,
+  a default TTL (time to live) value is specified in the configuration for
+  monasca-api and is used as the default retention policy.
+
+  The default retention policy is applied if a metric doesn't match another
+  retention policy. This default retention is generally a shorter period of
+  time and may be used for the common monitoring metrics.
+
+- Use case 2
+  Installer loads a set of metric to TTL mappings (retention policies), which
+  is stored in the Monasca API data store (mysql database).  These mappings may
+  be provided in a JSON structure.  This is intended to be useful for bootstrap
+  or restore from backup.
+
+- Use case 3
+  Monasca API receives new metric (regardless of source).  Metric is mapped to
+  a dictionary to determine TTL (or default value used if no match).  TTL is
+  passed with metric value on to the Persister for storage in TSDB.
+
+  Note that the use cases for monasca-agent to post metrics are unchanged, just
+  the processing at Monasca API then the API to Persister message.]
+
+  The Monasca Persister then stores the metric and specifies the TTL to the
+  TSDB configured (i.e. InfluxDB or Cassandra).
+
+- Use case 4
+  Operator uses Monasca CLI to specify (or modify) a TTL value for a metric
+  match string. Match string could be specific, such as "cpu.user_perc" or a
+  wildcard string, such as "image.*".  CLI posts request to Monasca TTL API,
+  where it is validated then stored in database.
+
+- Use case 5
+  Operator uses Monasca CLI to GET the dictionary of metric:TTL mappings.
+  This can be used to export the list for backup or verification.
+
+- Use case 6 (optional)
+  Operator uses Monasca UI to accomplish use case 4 or 5
+
+
+Proposed change
+===============
+
+1. Monasca API
+   A. Add a new API for managing the mapping of metrics to TTL values.
+      TBD - API structure
+
+   B. Add storage for the mapping in the MySQL database. This is to allow
+      all instances of Monasca API to share the configuration dynamically.
+      Create a schema for storing the metric:TTL dictionary.
+
+   C. A policy precedence needs to be defined.  It is possible that more than
+      one retention policy may apply to a given meter, so a clear precedence
+      needs to be defined to determine which TTL value to apply.
+      TBD: examples
+
+2. Monasca Persister
+   Persister reads the default retention policy setting from the service
+   configuration file in the influxDbConfiguration and cassandraDbConfiguration
+   section.
+   ::
+
+     # Retention policy may be left blank to indicate default policy.
+     # Unit is days
+     retentionPolicy: 7
+
+   It may be convenient to allow specifying a unit with the policy value.  For
+   example "7d" for 7 days or "3m" for 3 months.
+
+   It will retrieve the TTL property in the incoming metric message. If not set,
+   the TTL value from the default retention policy will used instead.
+
+   It is expected with the addition of this Metrics Retention feature that the
+   default retentionPolicy value would be set to a low value, and that metrics
+   that are to be kept longer would be called out specifically through the
+   Retention API and appropriate values set.
+
+   The TTL is set in the parameterized database query when persisting the metrics
+   into the time series database, including both Cassandra and InfluxDB.
+   TBD - exact call structures for each TSDB.
+
+   Note that this does mean that each storage back end would need to have code
+   customized in the persister to support passing the TTL value.  This may also
+   be possible for ElasticSearch, though that is not part of this initial spec.
+
+3. Monasca CLI (optional)
+   A new CLI feature could be created to simplify getting the list of TTL
+   mappings or posting an update to a TTL mapping.  This would need Keystone
+   authentication, as does the existing 'monasca' CLI, and could be added to it.
+   TBD: whether the current monasca CLI could handle ingesting a json structure.
+
+4. Monasca UI (optional)
+   A new feature could be added to the Monasca UI that would allow a Cloud
+   Operator to view and edit the list of TTL mappings.
+   Bonus points for allowing the UI to have sample metrics and simulate the
+   mapping on the page.
+
+Alternatives
+------------
+
+The original proposal was to have monasca-transform, monasca-ceilometer, and
+monasca-agent each keep a TTL default setting and have a property to allow
+specifying a TTL per metric.  This would have also required a change to the
+Monasca API to add an optional TTL to the metric POST listener.
+
+While this would have been simpler to implement in the Monasca API, the
+additional work to change all the services that originate metrics made this
+alternative not as appealing.
+
+
+Another alternative would be to implement a new Monasca Retention API as
+outlined, but not include dimensions for the metrics. This would allow a much
+simpler data structure of key:value pairs, with the key being the unique match
+string and the value the standardized TTL value.  While the implementation
+would be much simpler, it is felt that the additional power of having match
+dimensions would be beneficial.
+
+
+Data model impact
+-----------------
+
+The Monasca API data model will need to be extended to store the metric to
+TTL mappings (retention policies).
+TBD - schema
+
+REST API impact
+---------------
+
+A new metric retention API endpoint would be added to Monasca API.
+
+URL: /v2.0/metrics-retention
+
+Method: GET
+  A GET request will return the current list of metric retention policies.
+  Examples::
+
+  Empty list (default retention used for all metrics)
+  []
+
+  Simple list
+  [
+    {
+      match: "cpu.user_perc",
+      dimensions: {"host": "node1"},
+      retentionPolicy: "7d"
+    },
+    {
+      match: "cpu.stolen_perc",
+      dimensions: {},
+      retentionPolicy: "7d"
+    }
+  ]
+
+Method: PUT
+  The PUT method is used for all create/update/delete methods on the metric
+  retention policy list.  Any list of metrics PUT to the API will be merged
+  with the existing list.  Single entries will also be supported.
+
+  JSON structure for PUT/GET to Retention API::
+
+    {
+      match: "cpu.user_perc",
+      dimensions: {},
+      retentionPolicy: "7d"
+    }
+
+  TBD: do we support adding a character for time unit?  Will it be confusing to
+  PUT "1d" and GET back "86400"?
+
+  Special case: to delete a retention policy, give a retentionPolicy value of
+  None and it will be removed from the list.
+  ::
+
+    {
+      match: "cpu.user_time",
+      dimensions: {},
+      retentionPolicy: None
+    }
+
+  Additionally, a list of retention policy items may be PUT, with the format
+  matching the response from GET. Each item in the list will be compared to
+  existing metric policies (match string and dimensions). If an exact match is
+  found, the retentionPolicy value will be replaced. Otherwise, the new item is
+  added to the list.
+  (This is intended to make bootstrap or restore from backup easier)
+
+
+The communication from Monasca API to Persister would have the TTL value
+added as a parameter.
+
+NOTE: Care should be taken in defining the REST API path, as Gnocchi uses
+"/metric", which may be confusing to some users.
+
+
+Security impact
+---------------
+
+None.  Security measures already in place for the Monasca API would remain.
+
+Other end user impact
+---------------------
+
+None for most users, as access is restricted to Cloud Operators.
+A Cloud Operator would have a new responsibility to configure retention for
+the metrics.
+
+A future discussion could be had about whether a tenant user should be granted
+the ability to set their own retention policies, but generally the Cloud
+Operator is responsible for ensuring there are sufficient resources to meet the
+retention requirements.
+
+Performance Impact
+------------------
+
+This feature has no direct impact on the write throughput. However, it allows
+the user to enable shorter retention period for monitoring metrics which
+can potentially improve the read performance for the queries that involves
+search, grouping and filtering when there are less metrics in the table.  This
+improves the storage footprint.
+
+Depending on how complex the metric retention match string gets there could be
+some performance impact. TBD
+
+Other deployer impact
+---------------------
+
+No change in deployment of the services.
+The service could be deployed with simply a default TTL value in configuration.
+If the operator desires, at install time a complete list of TTL values could
+be loaded as part of the installation process once the Monasca API is running.
+
+For planning, the user now has the option to specify a shorter retention period
+for monitoring metrics or even per metric or metric category. The disk size
+should be calculated based upon the retention policy accordingly.
+
+Developer impact
+----------------
+
+Monasca agent plugin developers should be aware of the new TTL property
+now available to them. It is an optional property that is only needed if a
+different TTL value than the default retention policy in the Persister service
+is needed.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Contributors are welcome!
+
+Primary assignee:
+
+
+Other contributors:
+
+
+Work Items
+----------
+
+* Add new metrics-retention API endpoint to Monasca API
+
+* Add code to match all incoming metrics to the Monasca API with the appropriate
+  retention policy (or default)
+
+* Add TTL in seconds as a parameter to the request from Monasca API to
+  Persister
+
+* Create a CLI
+  * PUT of updated retention policy(ies)
+  * GET of the list
+
+* Determine correct precedence for retention policies that overlap, and clearly
+  document with examples.
+
+
+Dependencies
+============
+
+Dependent on retention policy support in the TSDB storage.  Both Cassandra
+and InfluxDB support specifying a retention policy.
+
+
+Testing
+=======
+
+~Please discuss the important scenarios needed to test here, as well as
+specific edge cases we should be ensuring work correctly. For each
+scenario please specify if this requires specialized hardware, a full
+openstack environment, or can be simulated inside the Monasca tree.~
+
+~Please discuss how the change will be tested. We especially want to know what
+tempest tests will be added. It is assumed that unit test coverage will be
+added so that doesn't need to be mentioned explicitly, but discussion of why
+you think unit tests are sufficient and we don't need to add more tempest
+tests would need to be included.~
+
+~Is this untestable in gate given current limitations (specific hardware /
+software configurations available)? If so, are there mitigation plans (3rd
+party testing, gate enhancements, etc).~
+
+TBD
+
+Documentation Impact
+====================
+
+Operators who use Monasca would need documentation to describe the format of
+the new API and recommended usage.  This may include guidelines on how to set
+a low default and to choose which metrics should be kept longer.  The default
+TTL value as set in a config file should also be documented.
+
+References
+==========
+
+* Links
+
+  * Stein PTG discussion - https://etherpad.openstack.org/p/monasca-ptg-stein
+
+* Glossary
+
+  * TTL - short for Time to Live, a setting in TSDB that defines when an item
+    (in this case a metric) will be cleaned out.
+
+  * TSDB - Time Series Database, such as InfluxDB or Cassandra.
+
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader understand
+what's happened along the time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced
--- a/specs/rocky/approved/python-persister-metrics.rst
+++ b/specs/rocky/approved/python-persister-metrics.rst
@ -0,0 +1,199 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=====================================================
+Python Persister Performance Metrics Collection (WIP)
+=====================================================
+
+Story board: https://storyboard.openstack.org/#!/story/2001576
+
+This defines the list of measurements for the metric upsert processing time and
+throughput in Python Persister and provides a rest api to retrieve those 
+measurements. 
+
+Problem description
+===================
+
+The Java Persister, built on top of the DropWizard framework, provides a list
+of internal performance related metrics, e.g., the total number of metric
+messages that have been processed since the last service start up, the average,
+min and max metric processing time etc. The Python Persister, on the other
+hand, lacks such instrumentation. This presents a challenge to the operator
+who wants to monitor, triage, and tune the Persister performance and to the
+Persister performance testing tool that was introduced in Queens release. The
+Cassandra Python Persister plugin depends on this feature for performance
+tuning.
+ 
+Use Cases
+---------
+
+- Use case 1: The developer instruments the defined performance metrics.
+
+  There are two approaches towards the internal performance metrics. The first
+  approach is in memory metering similar to the Java implementation. The data
+  collection starts when the Persister service starts up and is not persisted
+  through service restart. The second approach is to treat such measurement
+  exactly the same as the "normal" metrics Monasca collects. The advantage is
+  that such metrics will be persisted and rest apis are already available to
+  retrieve the metrics.
+  The list of Persister metrics includes:
+
+  1. Total number of metrics upsert request received and completed on a given
+     Persister service instance in the given period of time
+  2. Total number of metrics upsert request received and completed on a
+     process or thread in a given period of time (P2)
+  3. The average, min, max metric request processing time in a given period of
+     time for a given Persister service instance and process/thread.
+
+- Use case 2: Retrieves persister performance metrics through rest api.
+
+  The performance  metrics can be retrieved using the list metrics api in the
+  Monasca API service.
+
+Proposed change
+===============
+
+1. Monasca Persister
+
+   - Python Persister integrates with monasca-statsd to send count and timer
+     metrics
+   - Persister conf to add properties for statsd
+     
+2. Persister performance benchmark tool adds support to retrieve the metrics
+   from Monasca rest api source in addition to the DropWizard admin api.
+
+Alternatives
+------------
+
+None
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+TBD, The statsd call to update counter and timer is expected to have small
+performance impact.
+
+Other deployer impact
+---------------------
+
+No change in deployment of the services.
+
+Developer impact
+----------------
+
+None.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Contributors are welcome!
+
+Primary assignee:
+  jgu
+
+Other contributors:
+  
+
+Work Items
+----------
+
+Work items or tasks -- break the feature up into the things that need to be
+done to implement it. Those parts might end up being done by different people,
+but we're mostly trying to understand the timeline for implementation.
+
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+Please discuss the important scenarios needed to test here, as well as
+specific edge cases we should be ensuring work correctly. For each
+scenario please specify if this requires specialized hardware, a full
+openstack environment, or can be simulated inside the Monasca tree.
+
+Please discuss how the change will be tested. We especially want to know what
+tempest tests will be added. It is assumed that unit test coverage will be
+added so that doesn't need to be mentioned explicitly, but discussion of why
+you think unit tests are sufficient and we don't need to add more tempest
+tests would need to be included.
+
+Is this untestable in gate given current limitations (specific hardware /
+software configurations available)? If so, are there mitigation plans (3rd
+party testing, gate enhancements, etc).
+
+
+Documentation Impact
+====================
+
+Which audiences are affected most by this change, and which documentation
+titles on docs.openstack.org should be updated because of this change? Don't
+repeat details discussed above, but reference them here in the context of
+documentation for multiple audiences. For example, the Operations Guide targets
+cloud operators, and the End User Guide would need to be updated if the change
+offers a new feature available through the CLI or dashboard. If a config option
+changes or is deprecated, note here that the documentation needs to be updated
+to reflect this specification's change.
+
+References
+==========
+
+Please add any useful references here. You are not required to have any
+reference. Moreover, this specification should still make sense when your
+references are unavailable. Examples of what you could include are:
+
+* Links to mailing list or IRC discussions
+
+* Links to notes from a summit session
+
+* Links to relevant research, if appropriate
+
+* Related specifications as appropriate (e.g.  if it's an EC2 thing, link the
+  EC2 docs)
+
+* Anything else you feel it is worthwhile to refer to
+
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader understand
+what's happened along the time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced