fuel-plugin-lma-collector

Commit Graph

Author	SHA1	Message	Date
Andreas Jaeger	c929899400	Retire repository Fuel repositories are all retired in openstack namespace, retire remaining fuel repos in x namespace since they are unused now. This change removes all content from the repository and adds the usual README file to point out that the repository is retired following the process from https://docs.openstack.org/infra/manual/drivers.html#retiring-a-project See also http://lists.openstack.org/pipermail/openstack-discuss/2019-December/011675.html A related change is: https://review.opendev.org/699752 . Change-Id: I8aded54f1b9f3b79f3a4bf8f607d3695b92f528b	2019-12-18 19:39:39 +01:00
Swann Croiset	5b65f279ce	Disable Heka "self-monitoring" Change-Id: If548c132d5847b8223284a2bb0ad288c695d9ec3 Related-bug: #1643280	2017-01-03 16:33:36 +00:00
Swann Croiset	022b8b4b00	Increase Heka poolsize for the metric_collector On controller nodes the increasing number of the AFD filters puts too much load on the Heka pipeline and can generate "idle packs" errors. It was observed that a poolsize value of 200 solves the issue. Change-Id: I1d5f9fea352e16e15b37828bc525906a06fadd0e	2016-10-03 07:48:45 +00:00
Jenkins	ea9338ab8a	Merge "Add monitoring of HDD errors"	2016-09-07 13:51:36 +00:00
Ildar Svetlov	99e2863c14	Add monitoring of HDD errors This change adds a filter plugin that monitors the kernel log messages for hard drive errors and reports the number of errors per second as 'hdd_errors_rate'. The filter is configured for all nodes, irrespective of their roles. An alarm is also added that triggers a CRITICAL alert when the metric value is greater than 0. DocImpact Change-Id: I485f5692a3e5facf0f7ea019ccdbd70683a7dd4e	2016-09-06 11:47:59 +03:00
Jenkins	c02cb15a5b	Merge "Increase the Elasticsearch bulk size when required"	2016-08-29 15:35:30 +00:00
Swann Croiset	83db24f549	Increase the Elasticsearch bulk size when required In some environments (especially using slow HDD drives), the Elasticsearch backends may fail to ingest logs fast enough. As a result the log_collector service running on the controller nodes are blocked. To alleviate this issue, this change increases the bulk size for nodes that generate lots of logs: - controllers which run OpenStack API services in addition to Pacemaker. - all nodes when the environment's log level is set to debug. In such cases, the flush_count parameter is increased to 100 (instead of 10 by default). Change-Id: Ifdfbcb8ff0292f695dee4deab45560f126bde242 Closes-Bug: #1617211	2016-08-29 15:17:44 +00:00
Simon Pasquier	38ec02fe46	Add a dedicated manifest to configure collectd This removes duplication of code and limitations we had to deal with because the collectd Puppet resources don't play well when they are created at different times from several manifests. Change-Id: I52fabb1fb5795a33f552168553a148b1520fc496	2016-08-26 15:59:04 +02:00
Swann Croiset	7f1f3bd59f	Configure AFD alarms against 'mysql_check' metric Change-Id: Ib15fea4ab041243e44a61c9d54d1f154b02d34af	2016-08-26 15:23:07 +02:00
Simon Pasquier	3a3ef6f2e3	Add Pacemaker collectd plugin This change adds a collectd plugin that gets metrics from the Pacemaker cluster: - cluster's metrics - node's metrics - resource's metrics Most of the metrics are only collected from the node that is the designated controller except pacemaker_resource_local_active and pacemaker_dc_local_active. The plugin also removes the 'pacemaker_resource' plugin by providing the exact same metrics and notifications for the other collectd plugins. Finally the plugin is also installed on the standalone-rabbitmq and standalone-database nodes if they are present. Change-Id: I8b5b987704f69c6a60b13e8ea982f27924f488d1	2016-08-11 14:53:43 +02:00
Simon Pasquier	f62fbb622b	Add missing ordering constraint on MOS 9 This change adds the RabbitMQ -> log collector ordering constraint that has been dropped during the upgrade to MOS 9. Change-Id: I0cec5d5ba0b13f3c4cf06bbb007c7fe7dca5b66e	2016-08-04 10:09:45 +00:00
Jenkins	cb45a312ef	Merge "Install the OCF script beforehand"	2016-07-13 14:53:05 +00:00
Swann Croiset	1ae8829823	Use RabbitMQ management API The patch uses the management API to retrieve metrics instead of executing rabbitmqctl command. A side effect is that all metrics per-queues are not collected anymore. Change-Id: I5dab785321e369ec0e1a69a79e0700b276810925 Closes-bug: #1594337	2016-07-13 09:19:06 +02:00
Simon Pasquier	882584043f	Install the OCF script beforehand This change also removes the previous hack that cleans up the collector resources at the end of the deployment. Change-Id: I1ca237181d30802035bf6a0526cdd41f83e39acd Closes-Bug: #1593137	2016-07-12 13:50:41 +02:00
Swann Croiset	9ae5de5e43	Set migration-threshold to 3 for collectors Related-bug: #1593137 Change-Id: I7b3808afdfb43d0dcc74debb0333ae1d1942029f	2016-07-08 14:49:38 +00:00
Simon Pasquier	295e76f457	Restore failure-timeout parameter for collectors Without this parameter being set, Pacemaker doesn't cleanup and restart the resource if it fails too many times. Change-Id: I185e317969aec389e883e575c120d3a902d677e7 Closes-Bug: #1593137	2016-06-24 14:02:49 +00:00
Simon Pasquier	1e578546a9	Fix the Pacemaker resources on MOS 9 This change removes the migration-threshold and failure-timeout parameters for the collector resources. Otherwise Pacemaker will forbid the resource from the node if it fails too many times. And the declaration of the resources in Pacemaker needs to happen after the services have been installed too. Change-Id: Ia8d96ccce4a25e4a1919419cba9b415bd06c65d1 Closes-Bug: #1593137	2016-06-21 08:22:26 +00:00
Swann Croiset	16fbc107dc	Use Grafana credentials to monitor MySQL on dedicated environement MySQL collectd plugins are configured to use Grafana credentials to monitor the database when no controller is deployed within the environement. This assumes that MySQL is deployed with the detach database plugin. Change-Id: Ida8d9f1995f87ca16b35ba59c3debd32f4410f97 Fixes-bug: #1578133	2016-06-13 11:34:45 +00:00
Swann Croiset	7a47c34424	Store node's roles into Hiera Change-Id: Idbd2a353fb90131e77cb9f21820f61b4d6413e64	2016-06-13 11:34:30 +00:00
Swann Croiset	c679b05be7	Install explicit package version of Heka Change-Id: Ica6a6936cfd8f959758988f97af29d6489734484 Fixes-bug: #1590013	2016-06-08 07:51:28 +00:00
Swann Croiset	b2bb3f3ea9	Remove some default lma_collector::params This patch removes default parameters for InfluxDB/Elasticsearch HTTP port and address. These parameters are always provided by callers and that the way to go. Change-Id: I5e346b71a7d639475f2fba92126f8d191f8cd5fd	2016-06-01 09:42:28 +02:00
Swann Croiset	5ead1d3e74	Clean base.pp This removes useless variables Change-Id: I5b34403d900a009b01f029679a0c9fc2f1fa024a	2016-05-31 09:55:49 +00:00
Jenkins	fbf42932f3	Merge "Add a simple diagnostic script"	2016-05-27 12:24:31 +00:00
Swann Croiset	373640672d	Add a simple diagnostic script A script is installed on all nodes to collect various information and perform basic tests regarding LMA components. All information gathered are stored locally in /var/lma_diagnostics. From the Fuel master node, the contrib/tools/diagnostic.sh script launches the diagnostic script on all nodes and downloads all data into /var/lma_diagnostics. Fixes-bug: #1547084 Change-Id: I37e36df23bc98109b7a86db63e5243cc264d2f95	2016-05-27 14:10:57 +02:00
Swann Croiset	9d7efe4161	Avoid collecting logs and notifications uselessly This patches avoids to collect logs and notifications when both Elasticsearch and InfluxDB are not (yet) deployed. This is useless and leads to lose all logs and notifications produced before backends are deployed. Change-Id: I30a39d65f7a732251def32ccfb8202c34d6408c5	2016-05-26 09:27:16 +02:00
Jenkins	41cb740832	Merge "Use hiera_hash for network data"	2016-05-25 14:39:07 +00:00
Simon Pasquier	1f759e7f3d	Use hiera_hash for network data Change-Id: I171d08e974d635d85b391c3bc29366f0f4dd7b59 Closes-Bug: #1585350	2016-05-25 10:44:31 +02:00
Swann Croiset	8213ce490b	Monitor metric_collector by Pacemaker on controller nodes The resource monitoring never happens without 'interval' value (default to 0) and the resource (metric_collector) is never restarted if it stopped for some reason. Change-Id: Ia8bc3d5115a7270038a0f9e7928fe5bca787b599	2016-05-25 08:37:36 +00:00
Swann Croiset	debe1883d7	Allow deployment without InfluxDB and Elasticsearch This allows to support several deployment scenarii where backends are not deployed initialy, for instance when using the 'virt' nodes to deploy LMA backends. The patch factorizes manifests by moving all the configuration data of InfluxDB and Elasticsearch into hiera. DocImpact Fixes-bug: #1570386 Change-Id: I8688bbd10d88bc8ef68b5d31e9edd62a764dc23d	2016-05-23 13:29:50 +02:00
Swann Croiset	d2eafcb750	Remove all lma_collector:params references in manifests Change-Id: I3af0eb48aca1aeb0f4d3bb6b0798a7a343bd072e	2016-05-12 13:43:49 +02:00
Swann Croiset	0a47c4cc40	Remove deprecated hiera('nodes') Change-Id: Id562146895cd19a86caf89b0e6791a24cd09f846 Fixes-bug: #1550253	2016-05-10 14:36:04 +02:00
Guillaume Thouvenin	2d72b53784	Support for the detach-database plugin This patch adds the support when the database is deployed on a dedicated node [1]. [1] https://github.com/openstack/fuel-plugin-detach-database Change-Id: If800d9d09204a1456640863a3ed3c5dc66d29017 Closes-Bug: #1547089	2016-05-10 09:00:44 +02:00
Guillaume Thouvenin	6dc87065fb	Support for the detach-rabbitmq plugin This patch adds the support when the RabbitMQ cluster is deployed on dedicated nodes [1]. [1] https://github.com/openstack/fuel-plugin-detach-rabbitmq Change-Id: Icc337e48d9a836ccab85dfc0b8ca86ff58c5cd4d Closes-Bug: #1547086 Closes-Bug: #1575046	2016-05-10 09:00:44 +02:00
Swann Croiset	13d1801c65	Prevent using init script to start Heka on controller nodes Change-Id: I3b01ac021f9e89ef74fbd82d7abc103a2f34399d Fixes-bug: #1570839	2016-05-04 14:34:39 +02:00
Swann Croiset	d89fa4ad92	Decrease the heka poolsize to its default 100 for log_collector Since HTTP metrics are now aggregated, the root cause of heka's pipeline deadlock is gone away. This change reduces the memory footprint of hekad by ~80MB Change-Id: Ie1eda9b02dd1f4b01154a4441cbec245712eb8d1	2016-05-04 14:34:39 +02:00
Swann Croiset	ebac150f8a	Separate the (L)og of the LMA collector This change separates the processing of the logs/notifications and metric/alerting into 2 dedicated hekad processes, these services are named 'log_collector' and 'metric_collector'. Both services are managed by Pacemaker on controller nodes and by Upstart on other nodes. All metrics computed by log_collector (HTTP response times and creation time for instances and volumes) are sent directly to the metric_collector via TCP. Elasticsearch output (log_collector) uses full_action='block' and the TCP output uses full_action='drop'. All outputs of metric_collector (InfluxDB, HTTP and TCP) use full_action='drop'. The buffer size configurations are: * metric_collector: - influxdb-output buffer size is increased to 1Gb. - aggregator-output (tcp) buffer size is decreased to 256Mb (vs 1Gb). - nagios outputs (x3) buffer size are decreased to 1Mb. * log_collector: - elasticsearch-output buffer size is decreased to 256Mb (vs 1Gb). - tcp-output buffer size is set to 256Mb. Implements: blueprint separate-lma-collector-pipelines Fixes-bug: #1566748 Change-Id: Ieadb93b89f81e944e21cf8e5a65f4d683fd0ffb8	2016-05-04 14:34:14 +02:00
Guillaume Thouvenin	9f80252538	Add notice to identify StackLight manifests in puppet logs Change-Id: Ieeb7d32c77166b234940f4bf0e57202312a62f0c	2016-04-15 15:53:17 +02:00
Guillaume Thouvenin	42b2fbaba0	Use pcmk_* resources from the new pacemaker module This patch aligns the Puppet manifests with the refactoring of pacemaker resources. Since MOS 9.0 we need to replace the old cs_* resources by pcmk_* resources from the new pacemaker module. Change-Id: Ic2435618779cdec286f4032c993d62ea80e01ead	2016-04-11 08:18:37 +00:00
Swann Croiset	96df47af73	Increase the Heka poolsize on controllers On controller nodes, the Heka poolsize must be increased to handle the load generated by derived metrics from logs otherwise a deadlock can happen in the filter plugins and block heka. Fixes-bug: #1557388 Change-Id: I74362011d32d413f244c6cdb6e4625ed96759df0	2016-04-05 18:34:17 +02:00
Vladimir Kuklin	590590b907	Remove usage of deprecated filter_nodes function filter_nodes function is deprecated - do not use it anymore Change-Id: Ibacaf4e0aa263d2fe2b98df19a5e7d1e62a14c51 Partial-bug: #1550253	2016-03-21 09:01:20 +00:00
Swann Croiset	98441edea0	Do not purge the collectd package configuration by default This avoids to purge collectd configuration by the last manifest applied. Fixes-bug: #1546091 Change-Id: Ib6c22910f4c9259920bb9ce079a0135deff31544	2016-02-17 09:45:41 +01:00
Jenkins	87f8aac174	Merge "Add environment_label tag to the InfluxDB metrics"	2016-02-02 08:36:45 +00:00
Éric Lemoine	bb29729252	Remove unused code in base.pp manifest This commit removes unused Puppet code and variables in the base.pp manifest. Change-Id: Ibbed24e65d6d15bf5b88ed04e342fde85271dc47	2016-01-29 12:50:57 +01:00
Éric Lemoine	ccdba23158	Move Pacemaker/Corosync code out of lma_collector This commit moves the Pacemaker/Corosync Puppet code from the lma_collector module to the Fuel-specific base.pp manifest. This involves the following changes: * Fuel's "pacemaker_wrappers::service" define is now used in base.pp to configure the LMA service resource to using the "pacemaker" provider. * To configure "pacemaker_wrappers::service" we need to know the Heka user. So to avoid hacks where we'd use private variables from the lma_collector and heka modules to determine the Heka user the lma_collector and heka modules are changed to make the Heka user configurable. For this the "heka" class "run_as_root" parameter is removed in favor of a "user" parameter. * In other manifests we use a resource collector to make sure that the LMA service resource is not re-configured with the default provider. This part is a bit hackish, but we haven't been able to come up with a better way to address the issue. Change-Id: I0ed0bddb245dc3a65b034e5caec14a65cfa908cb Implements: blueprint lma-without-fuel	2016-01-29 12:50:57 +01:00
Simon Pasquier	19d37573e2	Add environment_label tag to the InfluxDB metrics Change-Id: I8e9a0540a53a23bde677ff3ab8275f4fc0667ee2	2016-01-29 10:26:45 +01:00
Simon Pasquier	ef4c99b199	Set a default environment label if empty When the operator doesn't define an environment label, it will default to "env-<environment id>". Change-Id: Ied260dc7b65c1c08d922858a1ef620cb43d58609	2016-01-28 11:27:24 +01:00
Éric Lemoine	be6575a8a4	Do not use lma_collector::collectd::controller Do not use lma_collector::collectd::controller anymore, and use the new atomic classes/defines instead. Change-Id: Id708314180a91cab91d55a57371811c348735b3c	2016-01-19 08:42:29 +01:00
Guillaume Thouvenin	7a5300ca5d	Use the VIP of the InfluxDB cluster Change-Id: I55f8611142cc2d9262e6c0f949d5ed4c9eb2cf52 Implements: blueprint influxdb-clustering	2016-01-11 17:26:16 +01:00
Swann Croiset	3479f09192	Monitor Elasticsearch cluster The patch adds a new manifest executed on both influxdb and elasticsearch nodes to configure collectd specially for them and also move here related configuration from base.pp. Implements: blueprint elasticsearch-clustering Change-Id: I0e75446dd97e8c7108be87513a2b13e6909fcf44	2016-01-11 12:49:17 +00:00
Swann Croiset	2f492e4960	Monitor LMA backends on primary role nodes Change-Id: I9d96205265f169c48fd5a53b20788bc701d69fc2	2016-01-06 19:50:05 +00:00

1 2

94 Commits