Commit Graph

94 Commits

Author SHA1 Message Date
Andreas Jaeger c929899400 Retire repository
Fuel repositories are all retired in openstack namespace, retire
remaining fuel repos in x namespace since they are unused now.

This change removes all content from the repository and adds the usual
README file to point out that the repository is retired following the
process from
https://docs.openstack.org/infra/manual/drivers.html#retiring-a-project

See also
http://lists.openstack.org/pipermail/openstack-discuss/2019-December/011675.html

A related change is: https://review.opendev.org/699752 .

Change-Id: I8aded54f1b9f3b79f3a4bf8f607d3695b92f528b
2019-12-18 19:39:39 +01:00
Swann Croiset 5b65f279ce Disable Heka "self-monitoring"
Change-Id: If548c132d5847b8223284a2bb0ad288c695d9ec3
Related-bug: #1643280
2017-01-03 16:33:36 +00:00
Swann Croiset 022b8b4b00 Increase Heka poolsize for the metric_collector
On controller nodes the increasing number of the AFD filters puts too
much load on the Heka pipeline and can  generate "idle packs" errors.
It was observed that a poolsize value of 200 solves the issue.

Change-Id: I1d5f9fea352e16e15b37828bc525906a06fadd0e
2016-10-03 07:48:45 +00:00
Jenkins ea9338ab8a Merge "Add monitoring of HDD errors" 2016-09-07 13:51:36 +00:00
Ildar Svetlov 99e2863c14 Add monitoring of HDD errors
This change adds a filter plugin that monitors the kernel log messages
for hard drive errors and reports the number of errors per second
as 'hdd_errors_rate'. The filter is configured for all nodes,
irrespective of their roles. An alarm is also added that triggers
a CRITICAL alert when the metric value is greater than 0.

DocImpact

Change-Id: I485f5692a3e5facf0f7ea019ccdbd70683a7dd4e
2016-09-06 11:47:59 +03:00
Jenkins c02cb15a5b Merge "Increase the Elasticsearch bulk size when required" 2016-08-29 15:35:30 +00:00
Swann Croiset 83db24f549 Increase the Elasticsearch bulk size when required
In some environments (especially using slow HDD drives), the
Elasticsearch backends may fail to ingest logs fast enough. As a result
the log_collector service running on the controller nodes are blocked.

To alleviate this issue, this change increases the bulk size for nodes
that generate lots of logs:
- controllers which run OpenStack API services in addition to Pacemaker.
- all nodes when the environment's log level is set to debug.

In such cases, the flush_count parameter is increased to 100 (instead of
10 by default).

Change-Id: Ifdfbcb8ff0292f695dee4deab45560f126bde242
Closes-Bug: #1617211
2016-08-29 15:17:44 +00:00
Simon Pasquier 38ec02fe46 Add a dedicated manifest to configure collectd
This removes duplication of code and limitations we had to deal with
because the collectd Puppet resources don't play well when they are
created at different times from several manifests.

Change-Id: I52fabb1fb5795a33f552168553a148b1520fc496
2016-08-26 15:59:04 +02:00
Swann Croiset 7f1f3bd59f Configure AFD alarms against 'mysql_check' metric
Change-Id: Ib15fea4ab041243e44a61c9d54d1f154b02d34af
2016-08-26 15:23:07 +02:00
Simon Pasquier 3a3ef6f2e3 Add Pacemaker collectd plugin
This change adds a collectd plugin that gets metrics from the Pacemaker
cluster:

  - cluster's metrics
  - node's metrics
  - resource's metrics

Most of the metrics are only collected from the node that is the
designated controller except pacemaker_resource_local_active and
pacemaker_dc_local_active.

The plugin also removes the 'pacemaker_resource' plugin by providing the
exact same metrics and notifications for the other collectd plugins.

Finally the plugin is also installed on the standalone-rabbitmq and
standalone-database nodes if they are present.

Change-Id: I8b5b987704f69c6a60b13e8ea982f27924f488d1
2016-08-11 14:53:43 +02:00
Simon Pasquier f62fbb622b Add missing ordering constraint on MOS 9
This change adds the RabbitMQ -> log collector ordering constraint that
has been dropped during the upgrade to MOS 9.

Change-Id: I0cec5d5ba0b13f3c4cf06bbb007c7fe7dca5b66e
2016-08-04 10:09:45 +00:00
Jenkins cb45a312ef Merge "Install the OCF script beforehand" 2016-07-13 14:53:05 +00:00
Swann Croiset 1ae8829823 Use RabbitMQ management API
The patch uses the management API to retrieve metrics instead of
executing rabbitmqctl command.

A side effect is that all metrics per-queues are not collected anymore.

Change-Id: I5dab785321e369ec0e1a69a79e0700b276810925
Closes-bug: #1594337
2016-07-13 09:19:06 +02:00
Simon Pasquier 882584043f Install the OCF script beforehand
This change also removes the previous hack that cleans up the collector
resources at the end of the deployment.

Change-Id: I1ca237181d30802035bf6a0526cdd41f83e39acd
Closes-Bug: #1593137
2016-07-12 13:50:41 +02:00
Swann Croiset 9ae5de5e43 Set migration-threshold to 3 for collectors
Related-bug: #1593137

Change-Id: I7b3808afdfb43d0dcc74debb0333ae1d1942029f
2016-07-08 14:49:38 +00:00
Simon Pasquier 295e76f457 Restore failure-timeout parameter for collectors
Without this parameter being set, Pacemaker doesn't cleanup and restart
the resource if it fails too many times.

Change-Id: I185e317969aec389e883e575c120d3a902d677e7
Closes-Bug: #1593137
2016-06-24 14:02:49 +00:00
Simon Pasquier 1e578546a9 Fix the Pacemaker resources on MOS 9
This change removes the migration-threshold and failure-timeout
parameters for the collector resources. Otherwise Pacemaker will forbid
the resource from the node if it fails too many times. And the
declaration of the resources in Pacemaker needs to happen after the
services have been installed too.

Change-Id: Ia8d96ccce4a25e4a1919419cba9b415bd06c65d1
Closes-Bug: #1593137
2016-06-21 08:22:26 +00:00
Swann Croiset 16fbc107dc Use Grafana credentials to monitor MySQL on dedicated environement
MySQL collectd plugins are configured to use Grafana credentials to
monitor the database when no controller is deployed within the
environement.
This assumes that MySQL is deployed with the detach database plugin.

Change-Id: Ida8d9f1995f87ca16b35ba59c3debd32f4410f97
Fixes-bug: #1578133
2016-06-13 11:34:45 +00:00
Swann Croiset 7a47c34424 Store node's roles into Hiera
Change-Id: Idbd2a353fb90131e77cb9f21820f61b4d6413e64
2016-06-13 11:34:30 +00:00
Swann Croiset c679b05be7 Install explicit package version of Heka
Change-Id: Ica6a6936cfd8f959758988f97af29d6489734484
Fixes-bug: #1590013
2016-06-08 07:51:28 +00:00
Swann Croiset b2bb3f3ea9 Remove some default lma_collector::params
This patch removes default parameters for InfluxDB/Elasticsearch HTTP port
and address. These parameters are always provided by callers and that the way
to go.

Change-Id: I5e346b71a7d639475f2fba92126f8d191f8cd5fd
2016-06-01 09:42:28 +02:00
Swann Croiset 5ead1d3e74 Clean base.pp
This removes useless variables

Change-Id: I5b34403d900a009b01f029679a0c9fc2f1fa024a
2016-05-31 09:55:49 +00:00
Jenkins fbf42932f3 Merge "Add a simple diagnostic script" 2016-05-27 12:24:31 +00:00
Swann Croiset 373640672d Add a simple diagnostic script
A script is installed on all nodes to collect various information and perform
basic tests regarding LMA components.
All information gathered are stored locally in /var/lma_diagnostics.

From the Fuel master node, the contrib/tools/diagnostic.sh script launches the
diagnostic script on all nodes and downloads all data into /var/lma_diagnostics.

Fixes-bug: #1547084
Change-Id: I37e36df23bc98109b7a86db63e5243cc264d2f95
2016-05-27 14:10:57 +02:00
Swann Croiset 9d7efe4161 Avoid collecting logs and notifications uselessly
This patches avoids to collect logs and notifications when
both Elasticsearch and InfluxDB are not (yet) deployed.
This is useless and leads to lose all logs and notifications produced
before backends are deployed.

Change-Id: I30a39d65f7a732251def32ccfb8202c34d6408c5
2016-05-26 09:27:16 +02:00
Jenkins 41cb740832 Merge "Use hiera_hash for network data" 2016-05-25 14:39:07 +00:00
Simon Pasquier 1f759e7f3d Use hiera_hash for network data
Change-Id: I171d08e974d635d85b391c3bc29366f0f4dd7b59
Closes-Bug: #1585350
2016-05-25 10:44:31 +02:00
Swann Croiset 8213ce490b Monitor metric_collector by Pacemaker on controller nodes
The resource monitoring never happens without 'interval' value (default to 0)
and the resource (metric_collector) is never restarted if it stopped
for some reason.

Change-Id: Ia8bc3d5115a7270038a0f9e7928fe5bca787b599
2016-05-25 08:37:36 +00:00
Swann Croiset debe1883d7 Allow deployment without InfluxDB and Elasticsearch
This allows to support several deployment scenarii where backends are not
deployed initialy, for instance when using the 'virt' nodes to deploy
LMA backends.

The patch factorizes manifests by moving all the configuration data of
InfluxDB and Elasticsearch into hiera.

DocImpact

Fixes-bug: #1570386
Change-Id: I8688bbd10d88bc8ef68b5d31e9edd62a764dc23d
2016-05-23 13:29:50 +02:00
Swann Croiset d2eafcb750 Remove all lma_collector:params references in manifests
Change-Id: I3af0eb48aca1aeb0f4d3bb6b0798a7a343bd072e
2016-05-12 13:43:49 +02:00
Swann Croiset 0a47c4cc40 Remove deprecated hiera('nodes')
Change-Id: Id562146895cd19a86caf89b0e6791a24cd09f846
Fixes-bug: #1550253
2016-05-10 14:36:04 +02:00
Guillaume Thouvenin 2d72b53784 Support for the detach-database plugin
This patch adds the support when the database is deployed on a dedicated
node [1].

[1] https://github.com/openstack/fuel-plugin-detach-database

Change-Id: If800d9d09204a1456640863a3ed3c5dc66d29017
Closes-Bug: #1547089
2016-05-10 09:00:44 +02:00
Guillaume Thouvenin 6dc87065fb Support for the detach-rabbitmq plugin
This patch adds the support when the RabbitMQ cluster is deployed on
dedicated nodes [1].

[1] https://github.com/openstack/fuel-plugin-detach-rabbitmq

Change-Id: Icc337e48d9a836ccab85dfc0b8ca86ff58c5cd4d
Closes-Bug: #1547086
Closes-Bug: #1575046
2016-05-10 09:00:44 +02:00
Swann Croiset 13d1801c65 Prevent using init script to start Heka on controller nodes
Change-Id: I3b01ac021f9e89ef74fbd82d7abc103a2f34399d
Fixes-bug: #1570839
2016-05-04 14:34:39 +02:00
Swann Croiset d89fa4ad92 Decrease the heka poolsize to its default 100 for log_collector
Since HTTP metrics are now aggregated, the root cause of heka's pipeline
deadlock is gone away.
This change reduces the memory footprint of hekad by ~80MB

Change-Id: Ie1eda9b02dd1f4b01154a4441cbec245712eb8d1
2016-05-04 14:34:39 +02:00
Swann Croiset ebac150f8a Separate the (L)og of the LMA collector
This change separates the processing of the logs/notifications and
metric/alerting into 2 dedicated hekad processes, these services are
named 'log_collector' and 'metric_collector'.

Both services are managed by Pacemaker on controller nodes and by Upstart on
other nodes.

All metrics computed by log_collector (HTTP response times and creation time
for instances and volumes) are sent directly to the metric_collector via TCP.
Elasticsearch output (log_collector) uses full_action='block' and the
TCP output uses full_action='drop'.

All outputs of metric_collector (InfluxDB, HTTP and TCP) use
full_action='drop'.

The buffer size configurations are:
* metric_collector:
  - influxdb-output buffer size is increased to 1Gb.
  - aggregator-output (tcp) buffer size is decreased to 256Mb (vs 1Gb).
  - nagios outputs (x3) buffer size are decreased to 1Mb.
* log_collector:
  - elasticsearch-output buffer size is decreased to 256Mb (vs 1Gb).
  - tcp-output buffer size is set to 256Mb.

Implements: blueprint separate-lma-collector-pipelines
Fixes-bug: #1566748

Change-Id: Ieadb93b89f81e944e21cf8e5a65f4d683fd0ffb8
2016-05-04 14:34:14 +02:00
Guillaume Thouvenin 9f80252538 Add notice to identify StackLight manifests in puppet logs
Change-Id: Ieeb7d32c77166b234940f4bf0e57202312a62f0c
2016-04-15 15:53:17 +02:00
Guillaume Thouvenin 42b2fbaba0 Use pcmk_* resources from the new pacemaker module
This patch aligns the Puppet manifests with the refactoring of
pacemaker resources.

Since MOS 9.0 we need to replace the old cs_* resources by
pcmk_* resources from the new pacemaker module.

Change-Id: Ic2435618779cdec286f4032c993d62ea80e01ead
2016-04-11 08:18:37 +00:00
Swann Croiset 96df47af73 Increase the Heka poolsize on controllers
On controller nodes, the Heka poolsize must be increased to handle the load
generated by derived metrics from logs otherwise a deadlock
can happen in the filter plugins and block heka.

Fixes-bug: #1557388

Change-Id: I74362011d32d413f244c6cdb6e4625ed96759df0
2016-04-05 18:34:17 +02:00
Vladimir Kuklin 590590b907 Remove usage of deprecated filter_nodes function
filter_nodes function is deprecated - do not use it anymore

Change-Id: Ibacaf4e0aa263d2fe2b98df19a5e7d1e62a14c51
Partial-bug: #1550253
2016-03-21 09:01:20 +00:00
Swann Croiset 98441edea0 Do not purge the collectd package configuration by default
This avoids to purge collectd configuration by the last manifest applied.

Fixes-bug: #1546091

Change-Id: Ib6c22910f4c9259920bb9ce079a0135deff31544
2016-02-17 09:45:41 +01:00
Jenkins 87f8aac174 Merge "Add environment_label tag to the InfluxDB metrics" 2016-02-02 08:36:45 +00:00
Éric Lemoine bb29729252 Remove unused code in base.pp manifest
This commit removes unused Puppet code and variables in the base.pp
manifest.

Change-Id: Ibbed24e65d6d15bf5b88ed04e342fde85271dc47
2016-01-29 12:50:57 +01:00
Éric Lemoine ccdba23158 Move Pacemaker/Corosync code out of lma_collector
This commit moves the Pacemaker/Corosync Puppet code from the
lma_collector module to the Fuel-specific base.pp manifest.

This involves the following changes:

* Fuel's "pacemaker_wrappers::service" define is now used in base.pp
  to configure the LMA service resource to using the "pacemaker"
  provider.

* To configure "pacemaker_wrappers::service" we need to know the Heka
  user. So to avoid hacks where we'd use private variables from the
  lma_collector and heka modules to determine the Heka user the
  lma_collector and heka modules are changed to make the Heka user
  configurable. For this the "heka" class "run_as_root" parameter is
  removed in favor of a "user" parameter.

* In other manifests we use a resource collector to make sure that
  the LMA service resource is not re-configured with the default
  provider. This part is a bit hackish, but we haven't been able to
  come up with a better way to address the issue.

Change-Id: I0ed0bddb245dc3a65b034e5caec14a65cfa908cb
Implements: blueprint lma-without-fuel
2016-01-29 12:50:57 +01:00
Simon Pasquier 19d37573e2 Add environment_label tag to the InfluxDB metrics
Change-Id: I8e9a0540a53a23bde677ff3ab8275f4fc0667ee2
2016-01-29 10:26:45 +01:00
Simon Pasquier ef4c99b199 Set a default environment label if empty
When the operator doesn't define an environment label, it will default
to "env-<environment id>".

Change-Id: Ied260dc7b65c1c08d922858a1ef620cb43d58609
2016-01-28 11:27:24 +01:00
Éric Lemoine be6575a8a4 Do not use lma_collector::collectd::controller
Do not use lma_collector::collectd::controller anymore, and use the
new atomic classes/defines instead.

Change-Id: Id708314180a91cab91d55a57371811c348735b3c
2016-01-19 08:42:29 +01:00
Guillaume Thouvenin 7a5300ca5d Use the VIP of the InfluxDB cluster
Change-Id: I55f8611142cc2d9262e6c0f949d5ed4c9eb2cf52
Implements: blueprint influxdb-clustering
2016-01-11 17:26:16 +01:00
Swann Croiset 3479f09192 Monitor Elasticsearch cluster
The patch adds a new manifest executed on both influxdb and elasticsearch
nodes to configure collectd specially for them and also move here related
configuration from base.pp.

Implements: blueprint elasticsearch-clustering

Change-Id: I0e75446dd97e8c7108be87513a2b13e6909fcf44
2016-01-11 12:49:17 +00:00
Swann Croiset 2f492e4960 Monitor LMA backends on primary role nodes
Change-Id: I9d96205265f169c48fd5a53b20788bc701d69fc2
2016-01-06 19:50:05 +00:00