This prevents to restart all collectors by side effect when
'/etc/init.d/heka stop' is run.
Change-Id: I4f743e765e5895f3c97505a166140bcd80f7ce34
Partial-bug: #1570850
This change separates the processing of the logs/notifications and
metric/alerting into 2 dedicated hekad processes, these services are
named 'log_collector' and 'metric_collector'.
Both services are managed by Pacemaker on controller nodes and by Upstart on
other nodes.
All metrics computed by log_collector (HTTP response times and creation time
for instances and volumes) are sent directly to the metric_collector via TCP.
Elasticsearch output (log_collector) uses full_action='block' and the
TCP output uses full_action='drop'.
All outputs of metric_collector (InfluxDB, HTTP and TCP) use
full_action='drop'.
The buffer size configurations are:
* metric_collector:
- influxdb-output buffer size is increased to 1Gb.
- aggregator-output (tcp) buffer size is decreased to 256Mb (vs 1Gb).
- nagios outputs (x3) buffer size are decreased to 1Mb.
* log_collector:
- elasticsearch-output buffer size is decreased to 256Mb (vs 1Gb).
- tcp-output buffer size is set to 256Mb.
Implements: blueprint separate-lma-collector-pipelines
Fixes-bug: #1566748
Change-Id: Ieadb93b89f81e944e21cf8e5a65f4d683fd0ffb8
On controller nodes, the Heka poolsize must be increased to handle the load
generated by derived metrics from logs otherwise a deadlock
can happen in the filter plugins and block heka.
Fixes-bug: #1557388
Change-Id: I74362011d32d413f244c6cdb6e4625ed96759df0
This change rotates the hekad logs more frequently. It also rotates the
log file when it reaches a certain size.
Fixes-bug: #1561603
Change-Id: Ic08831b8abadd0e1f846e0f401dc74b15dd46b3c
This commit moves the Pacemaker/Corosync Puppet code from the
lma_collector module to the Fuel-specific base.pp manifest.
This involves the following changes:
* Fuel's "pacemaker_wrappers::service" define is now used in base.pp
to configure the LMA service resource to using the "pacemaker"
provider.
* To configure "pacemaker_wrappers::service" we need to know the Heka
user. So to avoid hacks where we'd use private variables from the
lma_collector and heka modules to determine the Heka user the
lma_collector and heka modules are changed to make the Heka user
configurable. For this the "heka" class "run_as_root" parameter is
removed in favor of a "user" parameter.
* In other manifests we use a resource collector to make sure that
the LMA service resource is not re-configured with the default
provider. This part is a bit hackish, but we haven't been able to
come up with a better way to address the issue.
Change-Id: I0ed0bddb245dc3a65b034e5caec14a65cfa908cb
Implements: blueprint lma-without-fuel
This change introduces the first Anomaly and Fault Detection (AFD)
filter plugins. These plugins return AFD events on the availability of
the API endpoints, the API backends (as reported by HAProxy), and the
service workers (eg nova-scheduler, nova-conductor, ...).
Change-Id: I75bfb433e4e174659900f885040a1c2032efd470
Implements: blueprint alerting-lma-collector
The deployment on CentOS is broken since 254eda4
This patch creates always the 'heka' user defined in heka::params:user
even if the Hekad process run as 'root'.
This way should works for both MOS 6.1 and 7.
Change-Id: I9ec690735b10f149d4477f0b8a7ca3a7d0cc54c1
This change configures Pacemaker to manage the LMA collector service
with proper ordering regarding the local RabbitMQ service.
This also means that I removed the wrapper script that took care of
checking the RabbitMQ availability before launching the hekad process
on the controllers.
Change-Id: I4e747083fb9876f06fde9914b626970e37d0b429
Implements: blueprint lma-aggregator-in-ha-mode
This change installs the latest version of Heka (0.10.0b0). This version
of Heka is required because it comes with updated Lua plugins and
modules for InfluxDB.
Change-Id: I4cbb65603cc8e49679c1a89c5a3792c977e44b7a
Implements: blueprint upgrade-influxdb-grafana
This change installs the latest version of Heka (0.10.0b0). This
version of Heka is required because it comes with updated Lua plugins
and modules for InfluxDB.
Change-Id: Ibcb51909658d908979c9f13bdec6a754e2698df2
Implements: blueprint upgrade-influxdb-grafana
This add these Heka configuration options in global.toml
If not provided, use the Heka default values which are currently:
* max_process_inject = 1
* max_timer_inject = 10
Change-Id: If1995fa505aec6ff3000af33c548730dd06d1046
The maximum size observed during a load test with 50 nodes is 158Kb,
the default size is 64Kb.
This is required by elasticsearch buffered output which can hit the
limited size and finally lose messages.
The Heka log:
Plugin 'elasticsearch_output' error: Message too big, requires 161024
(MAX_MESSAGE_SIZE = 65536)
Change-Id: I8970435e2f710889e4b5d2c55a53572c042ef647
This change adds a cron job that sends SIGUSR1 to the hekad process
every hour. Heka will dump an internal report which is available in
/var/log/lma_collector.log eventually.
Change-Id: I7e164a85a8222f60e7a625d1277528b819a17661
If we start lma_collector before the availability of rabbitmq cluster it
will fail to connect to the lma queues and then, it will fail to start.
It may take several long minutes before pacemaker starts the service.
So we need to be sure that rabbitmq cluster is up and running before
starting lma_collector.
Change-Id: Ia254b744f4173f64ee3ab8200b2896ecc412d06f
This change moves away from the big monolithic Puppet manifest. Instead
we introduce separate tasks for each role that the plugin supports.
Change-Id: I370c9e8267f86da742f5cca48f1fec8bc3d9c4a9
This is an import of the initial LMA PoC code. For now, it only covers
the collection of logs (notifications will be added in a subsequent
commit).
There's been a bit of rewrite to:
- decouple the Heka configuration from the LMA collector.
- run the Heka service as non-root when possible (Ubuntu only for now
due to file permission issues on CentOS [1]).
- adapt to version 0.9 of Heka.
[1] https://bugs.launchpad.net/fuel/+bug/1425954
Change-Id: I4472b49a25e18e06984b5b29bdce18f917137bc8