Commit Graph

28 Commits

Author SHA1 Message Date
Nobuto Murata fb32621831 Don't expect a static job name
A job name passed via the prometheus_scrape library doesn't end up as a
static job name in the prometheus configuration file in the COS world
even though COS expects a fixed string. Practically we cannot have a
static job name like job=ceph in any of the alert rules in COS since the
charms will convert the string "ceph" into:

> juju_MODELNAME_ID_APPNAME_prometheus_scrape_JOBNAME(ceph)-N

Let's give up the possibility of the static job name and use "up{}" so
it will be annotated with the model name/ID, etc. without any specific
job related condition. It will break the alert rules when one unit have
more than one scraping endpoint because there will be no way to
distinguish multiple scraping jobs. Ceph MON only has one prometheus
endpoint for the time being so this change shouldn't cause an immediate
issue. Overall, it's not ideal but at least better than the current
status, which is an alert error out of the box.

The following alert rule:
> up{} == 0
will be converted and annotated as:
> up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="UUID"} == 0

Closes-Bug: #2044062

Change-Id: I0df8bc0238349b5f03179dfb8f4da95da48140c7
2024-03-18 15:29:49 +09:00
Peter Sabaini 35f9af8c96 Fixup: multisite alert rule help texts
Change-Id: I558804c8bbd162a15bd97a023ac612d32fd96b02
2024-01-19 19:12:40 +01:00
Peter Sabaini 24fccea832 Add alerting rules for RGW multisite deployments
Add default prometheus alerting rules for RadosGW multisite deployments based
on the built-in Ceph RGW multisite metrics.

Note that the included prometheus_alerts.yml.default rule file
is included for reference only. The ceph-mon charm will utilize the
resource file from https://charmhub.io/ceph-mon/resources/alert-rules
for deployment so that operators can easily customize these rules.

Change-Id: I5a12162d73686963132a952bddd85ec205964de4
2024-01-17 16:50:37 +01:00
Danny Cocks 8d7c1060aa Add nagios check for radosgw-admin sync status
This duplicates the check performed for ceph status and specialises it for
radosgw-admin sync status instead.

The config options available are:
- nagios_rgw_zones: this is which zones are expected to be connected
- nagios_rgw_additional_checks: this is equivalent to nagios_additional_checks
and allows for a configurable set of strings to grep for as critical alerts.

Change-Id: Ideb35587693feaf1cc0736e981005332e91ca861
2024-01-10 10:42:24 +11:00
Nobuto Murata c9389a8cd0 Revert "Create NRPE check to verify ceph daemons versions"
This reverts commit dfbda68e1a.

Reason for revert:

The Ceph version check seems to be missing a consideration of users to
execute the nrpe check. It actually fails to get keyrings to execute the
command as it's run by a non-root user.

$ juju run-action --wait nrpe/0 run-nrpe-check name=check-ceph-daemons-versions
unit-nrpe-0:
  UnitId: nrpe/0
  id: "20"
  results:
    Stderr: |
      2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
      a keyring on
      /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
      (2) No such file or directory
      2023-02-01T03:03:09.556+0000 7f4677361700 -1
      AuthRegistry(0x7f467005f540) no keyring found at
      /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
      disabling cephx
      2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
      a keyring on
      /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
      (2) No such file or directory
      2023-02-01T03:03:09.556+0000 7f4677361700 -1
      AuthRegistry(0x7f4670064d88) no keyring found at
      /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
      disabling cephx
      2023-02-01T03:03:09.560+0000 7f4677361700 -1 auth: unable to find
      a keyring on
      /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
      (2) No such file or directory
      2023-02-01T03:03:09.560+0000 7f4677361700 -1
      AuthRegistry(0x7f4677360000) no keyring found at
      /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
      disabling cephx
      [errno 2] RADOS object not found (error connecting to the cluster)
    check-output: 'UNKNOWN: could not determine OSDs versions, error: Command ''[''ceph'',
      ''versions'']'' returned non-zero exit status 1.'
  status: completed
  timing:
    completed: 2023-02-01 03:03:10 +0000 UTC
    enqueued: 2023-02-01 03:03:09 +0000 UTC
    started: 2023-02-01 03:03:09 +0000 UTC

Related-Bug: #1943628
Change-Id: I84b306e84661e6664e8a69fa93dfdb02fa4f1e7e
2023-02-01 12:31:16 +09:00
Edin Sarajlic b8af44aefa Add nagios check for expected number of OSDs
This check does not require manually setting the number of expected
OSDs.

Initially, the charm sets the count (per-host) to that of what's
present in the OSD tree. The count will be updated (on a per-host
basis) when the number of OSDs grows, but not when it shrinks. There
is a charm action to reset the expected count using information from
the OSD tree.

Closes-Bug: #1952985
Change-Id: Ia6a060bf151908c1d4159e6bdffa7bfe1f0a7988
2022-10-05 13:02:54 +00:00
Peter Sabaini 9c7101f573 Implement prometheus alert rules
Alert rules can be attached as a resource and will be transmitted via
the metrics-endpoint relation. Default alert rules taken from upstream
ceph have been added for reference.

Change-Id: I6a3c6f06e9b9d911b35c8ced1968becc6471b362
2022-09-23 14:22:06 +02:00
Hicham El Gharbi dfbda68e1a Create NRPE check to verify ceph daemons versions
This NRPE check confirms if the versions of cluster daemons are divergent.

WARN - any minor version diverged
WARN – any versions are 1 release behind the mon
CRIT – any versions are 2 releases behind the mon
CRIT – any versions releases are head the mon

A juju action is also provided 'get-versions-report'
which provide to users, a quick way to see
daemons versions running on cluster hosts.

Closes-Bug: #1943628
Change-Id: I41b5c8576dc9cf885fa813a93e6d51e8804eb9d8
2022-07-19 12:18:06 +02:00
Chi Wai, Chan 86f2a17a2c Fixed typo in a function comment.
--check_osds_down --> --check_num_osds

Change-Id: Ic5938cc5f12606ff0cc67df988b95ecf673b6c5f
2022-06-22 15:52:22 +08:00
Garrett Thompson 375a1d0056 Change noout to be a CRITICAL alert instead of WARNING.
When the noout flag is set in a Ceph cluster, the Nagios check
currently marks this as a warning (like Ceph itself). However,
setting it to CRITICAL will raise visbility, and indicate to the
operator that this should be a temporary state.

Closes-Bug: 1926551
Change-Id: I9831cfea3f63e82fbc8bfebc938a9795b69111c7
2021-09-07 14:34:33 -06:00
Frode Nordahl c0113217bf Unpin flake8, fix lint
Change-Id: Iab73f1127bfbdf11626727f3044366d2e5745439
2020-08-24 10:54:54 +02:00
Marian Gasparovic caa1cd8d6a Creates nrpe check for number of OSDs
Alert is triggered when number of known OSDs in osdmap is different
than number of "in" or "up" OSDs.

Change-Id: Id3d43f0146452d0bbd73e1ce98616a994eaee090
Partial-Bug: 1735579
2019-05-03 10:02:31 +02:00
Marian Gasparovic 8e4fe57295 Creates aditional nrpe checks which parse warning/error messages
When Ceph is in a warning state for reason1 and in the meantime
new reason2 appears, operator is not alerted and also cannot mute
alarms selectively (as described in bug #1735579)
This patch alllows to specify a dictionary of 'name':'regex' where
'name' will become ceph-$name check in nrpe and $regex will be
searched for in warning/error messages. It is specified via
charm nagios_additional_checks parameter.
There is also nagios_additional_checks_critical parameter which
specifies if those checks are reported as warning or error.

Change-Id: I73a7c15db88793bb78841d8395535c97ca2af872
Partial-Bug: 1735579
2019-04-02 12:05:22 +02:00
Marian Gasparovic 35c8e40e83 Don't return Critical when ceph is in warning state.
Current implementation returns Critical when Ceph is in warning
state, checking for some known exceptions which are considered
operational tasks. However this causes many Alarms.
This patch changes the behavior to report Warning when Ceph is
in HEALTH_WARN. If known operational tasks are exceeding
thresholds, Critical is returned.

Change-Id: I7a330189da8f0ba9168cedb534823c5e8f4795ba
2018-11-21 16:49:14 +01:00
Xav Paice b97177b7d6 Update Nagios check for Luminous
This adds a test to see if the ceph status output looks like Luminous or
newer, and if so changes the output used to collect info.

Change-Id: I98d194c329aace3c412701e06632dbfedfadefc7
Closes-Bug: #1756864
2018-05-01 10:29:37 +12:00
Tamas Erdei 5dbafb0b2f Fix race condition in collect_ceph_status.sh
There is a race condition between collect_ceph_status.sh writing
the status file and check_ceph_status.py reading that file.

This patch fixes that by directing ceph output into a temp file,
and then replacing the old state file with the new temp file using
an atomic mv operation.

Change-Id: If332d187f8dcb9f7fcd8b4a47f791beb8e27eaaa
Closes-Bug: 1755207
2018-03-13 14:30:49 +01:00
Alex Kavanagh 7586815401 Migrate charm to work with Python3 only
Various changes to migrate the charm to work with Python 3.  The tox.ini
has been modified to inlcude py35 and py36 targets for testing against
Python 3.5 (xenial, zesty), and Python 3.6 (artful+).

Change-Id: I009de528428aaca555b49f3fc17704dcf5f2a28c
2017-11-17 10:22:30 +00:00
Xav Paice e93c4a903a Update ceph nagios plugin
Changes ceph plugin so it ignores ceph rebalancing unless there is a
large percentage of misplaced/degraded objects (return warning for
that).

Adds config options to tweak that monitoring, and also just warn if
nodeep-scrub was deliberately set.

Includes some basic unit tests.

Change-Id: I317448cd769597068a706d3944d9d5419e0445c1
2017-11-07 13:34:27 +13:00
Marian Gasparovic ad9c3da7a2 Plugin should return also a reason for warning from ceph.
Change-Id: I9247f374ce88e0c208252b6a37d82fad407cc84a
Signed-off-by: Marian Gasparovic <marian.gasparovic@canonical.com>
2017-10-26 13:58:19 +02:00
James Page e69a771922 Revert "Add AppArmor profile"
ceph-mon is typically deployed under LXC or LXD, where apparmor is
not supported; revert addition of apparmor profile feature as its
currently breaking MAAS+LXD deployments.

This reverts commit d337fdbfd0.
This reverts commit cf0f18d59a.

Change-Id: I94b7c7f5dc0245d273394aeb352731f7bffb1c91
2016-07-18 16:23:32 +01:00
Chris Holcombe d337fdbfd0 Tweak AppArmor profile
After some testing with aa-complain it was discovered that
one of the apparmor rules was causing aa-complain to fail.
This patch also fixes an indentation typo.

Change-Id: I7a0e7e64f236136cd0f15fed22233cea533cad0c
2016-07-08 09:30:09 -07:00
Chris Holcombe cf0f18d59a AppArmor Profile
This change adds an app armor profile for ceph-mon.  It defaults
to complain mode.  This will log all apparmor failures but not
enforce them.

Change-Id: I8b98580cda84191dff46105f8ce01d4a7a7d414f
2016-07-05 13:21:45 -07:00
James Page 02fd38bb60 Re-license charm as Apache-2.0
All contributions to this charm where made under Canonical
copyright; switch to Apache-2.0 license as agreed so we
can move forward with official project status.

In order to make this change, this commit also drops the
inclusion of upstart configurations for very early versions
of Ceph (argonaut), as they are no longer required.

Change-Id: I3d943dfd78f406ba29f86c51e22a13eab448452e
2016-07-01 13:55:54 +01:00
Chris MacNaughton d09ae363cf fix tests 2016-01-28 18:21:15 +01:00
Chris MacNaughton 90e9865a97 remove osd stuff 2016-01-25 11:10:14 -05:00
Chris MacNaughton 43e73c8250 update to ue ceph-disk prepare instead of ceph-disk-prepare 2016-01-08 14:03:51 -05:00
Brad Marshall ee991c7554 [bradm] Initial nrpe checks 2014-10-30 16:57:10 +10:00
Paul Collins e37e6cc9e6 import upstart scripts from argonaut 2012-10-03 21:19:53 +13:00