On controller nodes the increasing number of the AFD filters puts too
much load on the Heka pipeline and can generate "idle packs" errors.
It was observed that a poolsize value of 200 solves the issue.
Change-Id: I1d5f9fea352e16e15b37828bc525906a06fadd0e
This change adds a filter plugin that monitors the kernel log messages
for hard drive errors and reports the number of errors per second
as 'hdd_errors_rate'. The filter is configured for all nodes,
irrespective of their roles. An alarm is also added that triggers
a CRITICAL alert when the metric value is greater than 0.
DocImpact
Change-Id: I485f5692a3e5facf0f7ea019ccdbd70683a7dd4e
In some environments (especially using slow HDD drives), the
Elasticsearch backends may fail to ingest logs fast enough. As a result
the log_collector service running on the controller nodes are blocked.
To alleviate this issue, this change increases the bulk size for nodes
that generate lots of logs:
- controllers which run OpenStack API services in addition to Pacemaker.
- all nodes when the environment's log level is set to debug.
In such cases, the flush_count parameter is increased to 100 (instead of
10 by default).
Change-Id: Ifdfbcb8ff0292f695dee4deab45560f126bde242
Closes-Bug: #1617211
This removes duplication of code and limitations we had to deal with
because the collectd Puppet resources don't play well when they are
created at different times from several manifests.
Change-Id: I52fabb1fb5795a33f552168553a148b1520fc496
This change adds a collectd plugin that gets metrics from the Pacemaker
cluster:
- cluster's metrics
- node's metrics
- resource's metrics
Most of the metrics are only collected from the node that is the
designated controller except pacemaker_resource_local_active and
pacemaker_dc_local_active.
The plugin also removes the 'pacemaker_resource' plugin by providing the
exact same metrics and notifications for the other collectd plugins.
Finally the plugin is also installed on the standalone-rabbitmq and
standalone-database nodes if they are present.
Change-Id: I8b5b987704f69c6a60b13e8ea982f27924f488d1
This change adds the RabbitMQ -> log collector ordering constraint that
has been dropped during the upgrade to MOS 9.
Change-Id: I0cec5d5ba0b13f3c4cf06bbb007c7fe7dca5b66e
The patch uses the management API to retrieve metrics instead of
executing rabbitmqctl command.
A side effect is that all metrics per-queues are not collected anymore.
Change-Id: I5dab785321e369ec0e1a69a79e0700b276810925
Closes-bug: #1594337
This change also removes the previous hack that cleans up the collector
resources at the end of the deployment.
Change-Id: I1ca237181d30802035bf6a0526cdd41f83e39acd
Closes-Bug: #1593137
Without this parameter being set, Pacemaker doesn't cleanup and restart
the resource if it fails too many times.
Change-Id: I185e317969aec389e883e575c120d3a902d677e7
Closes-Bug: #1593137
This change removes the migration-threshold and failure-timeout
parameters for the collector resources. Otherwise Pacemaker will forbid
the resource from the node if it fails too many times. And the
declaration of the resources in Pacemaker needs to happen after the
services have been installed too.
Change-Id: Ia8d96ccce4a25e4a1919419cba9b415bd06c65d1
Closes-Bug: #1593137
MySQL collectd plugins are configured to use Grafana credentials to
monitor the database when no controller is deployed within the
environement.
This assumes that MySQL is deployed with the detach database plugin.
Change-Id: Ida8d9f1995f87ca16b35ba59c3debd32f4410f97
Fixes-bug: #1578133
This patch removes default parameters for InfluxDB/Elasticsearch HTTP port
and address. These parameters are always provided by callers and that the way
to go.
Change-Id: I5e346b71a7d639475f2fba92126f8d191f8cd5fd
A script is installed on all nodes to collect various information and perform
basic tests regarding LMA components.
All information gathered are stored locally in /var/lma_diagnostics.
From the Fuel master node, the contrib/tools/diagnostic.sh script launches the
diagnostic script on all nodes and downloads all data into /var/lma_diagnostics.
Fixes-bug: #1547084
Change-Id: I37e36df23bc98109b7a86db63e5243cc264d2f95
This patches avoids to collect logs and notifications when
both Elasticsearch and InfluxDB are not (yet) deployed.
This is useless and leads to lose all logs and notifications produced
before backends are deployed.
Change-Id: I30a39d65f7a732251def32ccfb8202c34d6408c5
The resource monitoring never happens without 'interval' value (default to 0)
and the resource (metric_collector) is never restarted if it stopped
for some reason.
Change-Id: Ia8bc3d5115a7270038a0f9e7928fe5bca787b599
This allows to support several deployment scenarii where backends are not
deployed initialy, for instance when using the 'virt' nodes to deploy
LMA backends.
The patch factorizes manifests by moving all the configuration data of
InfluxDB and Elasticsearch into hiera.
DocImpact
Fixes-bug: #1570386
Change-Id: I8688bbd10d88bc8ef68b5d31e9edd62a764dc23d
Since HTTP metrics are now aggregated, the root cause of heka's pipeline
deadlock is gone away.
This change reduces the memory footprint of hekad by ~80MB
Change-Id: Ie1eda9b02dd1f4b01154a4441cbec245712eb8d1
This change separates the processing of the logs/notifications and
metric/alerting into 2 dedicated hekad processes, these services are
named 'log_collector' and 'metric_collector'.
Both services are managed by Pacemaker on controller nodes and by Upstart on
other nodes.
All metrics computed by log_collector (HTTP response times and creation time
for instances and volumes) are sent directly to the metric_collector via TCP.
Elasticsearch output (log_collector) uses full_action='block' and the
TCP output uses full_action='drop'.
All outputs of metric_collector (InfluxDB, HTTP and TCP) use
full_action='drop'.
The buffer size configurations are:
* metric_collector:
- influxdb-output buffer size is increased to 1Gb.
- aggregator-output (tcp) buffer size is decreased to 256Mb (vs 1Gb).
- nagios outputs (x3) buffer size are decreased to 1Mb.
* log_collector:
- elasticsearch-output buffer size is decreased to 256Mb (vs 1Gb).
- tcp-output buffer size is set to 256Mb.
Implements: blueprint separate-lma-collector-pipelines
Fixes-bug: #1566748
Change-Id: Ieadb93b89f81e944e21cf8e5a65f4d683fd0ffb8
This patch aligns the Puppet manifests with the refactoring of
pacemaker resources.
Since MOS 9.0 we need to replace the old cs_* resources by
pcmk_* resources from the new pacemaker module.
Change-Id: Ic2435618779cdec286f4032c993d62ea80e01ead
On controller nodes, the Heka poolsize must be increased to handle the load
generated by derived metrics from logs otherwise a deadlock
can happen in the filter plugins and block heka.
Fixes-bug: #1557388
Change-Id: I74362011d32d413f244c6cdb6e4625ed96759df0
This commit moves the Pacemaker/Corosync Puppet code from the
lma_collector module to the Fuel-specific base.pp manifest.
This involves the following changes:
* Fuel's "pacemaker_wrappers::service" define is now used in base.pp
to configure the LMA service resource to using the "pacemaker"
provider.
* To configure "pacemaker_wrappers::service" we need to know the Heka
user. So to avoid hacks where we'd use private variables from the
lma_collector and heka modules to determine the Heka user the
lma_collector and heka modules are changed to make the Heka user
configurable. For this the "heka" class "run_as_root" parameter is
removed in favor of a "user" parameter.
* In other manifests we use a resource collector to make sure that
the LMA service resource is not re-configured with the default
provider. This part is a bit hackish, but we haven't been able to
come up with a better way to address the issue.
Change-Id: I0ed0bddb245dc3a65b034e5caec14a65cfa908cb
Implements: blueprint lma-without-fuel
Do not use lma_collector::collectd::controller anymore, and use the
new atomic classes/defines instead.
Change-Id: Id708314180a91cab91d55a57371811c348735b3c
The patch adds a new manifest executed on both influxdb and elasticsearch
nodes to configure collectd specially for them and also move here related
configuration from base.pp.
Implements: blueprint elasticsearch-clustering
Change-Id: I0e75446dd97e8c7108be87513a2b13e6909fcf44