.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ================================================ Add pluggable metrics backend for Ironic and IPA ================================================ https://bugs.launchpad.net/ironic/+bug/1526219 This proposes the addition of metric data reporting features to Ironic, and Ironic Python Agent (IPA). Initially, this will include a statsd reference implementation, but will be sufficiently generic to permit the creation of alternative backends. Problem description =================== Software metrics are extremely useful to operators for recognizing and diagnosing problems in running software, and can be used to monitor the real time and historical performance of Ironic and IPA in a production environment. Metrics can be used to determine how quickly (or slowly) parts of the system are running, how often errors (such as API error responses or BMC failures) occur, or the performance impact of a given change. Currently, neither Ironic nor IPA report any application metrics. Proposed change =============== * Design a shared pluggable metric reporting system. * Implement a generic MetricsLogger which includes: * Gauges (generic numerical data). * Counters (increment/decrement a counter). * Timers (time something). * Decorators, and context managers for same. * Implement a StatsdMetricsLogger as the reference backend [1]. * Instrument Ironic to report metric data including: * Counting and timing of API requests. This may be accomplished by hooking into Pecan. * Counting and timing of RPCs. * Counting and timing of most worker functions in ConductorManager. * Counting and timing of important driver functions. * Count and time node state changes. By inspecting provision_updated_at during a state change, the time the node spent in that state can be calculated. * Instrument IPA to report metric data including, but not limited to: * Image download/write counts and times. * Deploy/cleaning counts and times. Example code follows (based on Python logging module naming conventions): .. code:: python METRICS = metrics.getLogger(__name__) class Foo(object): def func1(self): # Emit gauge metric with value 1 METRICS.send_gauge("one.fish", 1) # Increment counter metric by two METRICS.send_counter("two.fish", 2) # Decrement counter metric by one METRICS.send_counter("red.fish", -1) # Randomly sample the data (emit metric 10% of the time) METRICS.send_counter("blue.fish", 42, sample_rate=0.1) # Emit a timer metric with value of 125 (milliseconds) METRICS.send_timer("black.fish", 125) # Randomly sample the data (emit metric 1% of the time) METRICS.send_timer("blue.fish", 125, sample_rate=0.01) @METRICS.counter("func2.count") @METRICS.timer("func2.time", sample_rate=0.1) def func2(self): pass # Context managers for counting and timing code blocks def func3(self): with METRICS.counter("func3.thing_one.count", sample_rate=0.25): thing_one() with METRICS.timer("func3.thing_two.time"): thing_two() Metric names follow this convention (optional parts indicated by []): ``[global_prefix.][host_name.]prefix.metric_name`` If `--metrics-agent-prepend-host-reverse` is set, then ``host.example.com`` becomes ``com.example.host`` to assist with hierarchical data representation. For example, using the Statsd backend, and relevant config options, ``METRICS.send_timer("blue.fish", 125, sample_rate=0.25)`` is emitted to statsd as ``globalprefix.com.example.host.moduleprefix.blue.fish:1|ms@0.25``. Alternatives ------------ Alternatively, we could implement a Ceilometer backend. Although Ironic already reports some measurements (such as IPMI sensor data) to Ceilometer, the metrics that are proposed in this spec do not fit with the Ceilometer project mission, which is to "...collect measurements of the utilization of the physical and virtual resources comprising deployed clouds..." [2] Instead, this spec proposes that we instrument parts of the Ironic/IPA codebase itself to report metrics and statistics about how/when the code is run, and the performance of the code thereof. This data is not directly related to "physical and virtual resources comprising deployed clouds." Therefore, we do not propose the addition of a Ceilometer backend, nor do we propose that the existing Ceilometer measurements be converted to this system, as they represent fundamentally different types of data. Data model impact ----------------- None State Machine Impact -------------------- None. REST API impact --------------- To support agent drivers, a config field will be added to the response for the ``/drivers//vendor_passthru/lookup`` endpoint in the Ironic API. This field will contain the agent-related config options that an agent can use to configure itself to report metric data. For example: statsd host and statsd port. Client (CLI) impact ------------------- None. RPC API impact -------------- None. Driver API impact ----------------- None. Nova driver impact ------------------ None. Ramdisk impact -------------- N/A .. NOTE: This section was not present at the time this spec was approved. Security impact --------------- The statsd daemon [3] has no authentication, and consequently anyone who is able to send UDP datagrams to the daemon can send arbitrary metric data. However, the statsd daemon is typically configured to listen only on a local interface, which partially mitigates security concerns. Other end user impact --------------------- None. Scalability impact ------------------ Deployers must ensure that their statsd infrastructure is scaled correctly relative to the size of their deployment. However, even if the statsd daemon is overloaded, Ironic will not be negatively affected (statsd UDP datagrams are non-blocking, and will simply not be processed). Performance Impact ------------------ By default, metrics reporting will be disabled, reducing, but not totally eliminating, the performance impact for users who do not wish to collect metrics. At the very least, a conditional must be checked at each place where a metric could be reported. Furthermore, depending on exactly how and where the conditional checking occurs, arguments may be evaluated even if the metric data aren't actually sent. Reporting metrics via statsd affects performance minimally. The overhead of sending a single piece of metric data is very small--in particular, statsd metrics are sent via UDP (non-blocking) to a daemon [2] that aggregates the metrics before forwarding them to one of its supported backends. Should this backend become unresponsive or overloaded, then metric data will be lost, but without other performance effects. After the metric data are aggregated by a local statsd daemon, they are periodically flushed to one of statsd's configured backends, usually Graphite [4]. Other deployer impact --------------------- There are two different sets of configuration options to be added: These options will be set in the ironic-lib metrics library, and will be used by any ironic service implementing metrics: .. code:: [metrics] # Backend options are "statsd" and "noop" backend="noop" statsd_host="localhost" statsd_port=8125 # See proposed changes section for detailed description of how these are used prepend_host=false prepend_host_reverse=false global_prefix="" Additionally, the following options will be added to the ironic-conductor and used to configure the ironic-python-agent for metrics on lookup: .. code:: # Backend options are "statsd" and "noop" agent_backend="noop" agent_statsd_host="localhost" agent_statsd_port=8125 # See proposed changes section for detailed description of how these are used agent_prepend_host=false agent_prepend_host_reverse=false agent_prepend_uuid=false agent_global_prefix="" If the statsd metrics backend is enabled, then deployers must install and configure statsd, as well as any other metrics software that they wish to use (such as Graphite [3]). Additionally, if deployers wish to emit metrics from ironic-python-agent as well, the statsd backend must be accessible from networks that agents run on. Developer impact ---------------- None. Implementation ============== Assignee(s) ----------- Primary assignee: alineb Other contributors: JayF Work Items ---------- * Design/implement metric reporting into ironic-lib. * Implement statsd backend. * Instrument Ironic code to report metrics. * Instrument IPA code to report metrics. Dependencies ============ None. Testing ======= Additional care may be required to test the statsd network code. Upgrades and Backwards Compatibility ==================================== None. Documentation Impact ==================== Appropriate documentation must be written. References ========== For more on why metrics are useful to operators, and why the statsd project began: https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ [1] https://github.com/etsy/statsd/blob/master/docs/metric_types.md [2] https://wiki.openstack.org/wiki/Ceilometer [3] https://github.com/etsy/statsd/ [4] https://graphite.readthedocs.org/en/latest/faq.html