A job name passed via the prometheus_scrape library doesn't end up as a
static job name in the prometheus configuration file in the COS world
even though COS expects a fixed string. Practically we cannot have a
static job name like job=ceph in any of the alert rules in COS since the
charms will convert the string "ceph" into:
> juju_MODELNAME_ID_APPNAME_prometheus_scrape_JOBNAME(ceph)-N
Let's give up the possibility of the static job name and use "up{}" so
it will be annotated with the model name/ID, etc. without any specific
job related condition. It will break the alert rules when one unit have
more than one scraping endpoint because there will be no way to
distinguish multiple scraping jobs. Ceph MON only has one prometheus
endpoint for the time being so this change shouldn't cause an immediate
issue. Overall, it's not ideal but at least better than the current
status, which is an alert error out of the box.
The following alert rule:
> up{} == 0
will be converted and annotated as:
> up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="UUID"} == 0
Closes-Bug: #2044062
Change-Id: I0df8bc0238349b5f03179dfb8f4da95da48140c7
Add default prometheus alerting rules for RadosGW multisite deployments based
on the built-in Ceph RGW multisite metrics.
Note that the included prometheus_alerts.yml.default rule file
is included for reference only. The ceph-mon charm will utilize the
resource file from https://charmhub.io/ceph-mon/resources/alert-rules
for deployment so that operators can easily customize these rules.
Change-Id: I5a12162d73686963132a952bddd85ec205964de4
This duplicates the check performed for ceph status and specialises it for
radosgw-admin sync status instead.
The config options available are:
- nagios_rgw_zones: this is which zones are expected to be connected
- nagios_rgw_additional_checks: this is equivalent to nagios_additional_checks
and allows for a configurable set of strings to grep for as critical alerts.
Change-Id: Ideb35587693feaf1cc0736e981005332e91ca861
This reverts commit dfbda68e1a.
Reason for revert:
The Ceph version check seems to be missing a consideration of users to
execute the nrpe check. It actually fails to get keyrings to execute the
command as it's run by a non-root user.
$ juju run-action --wait nrpe/0 run-nrpe-check name=check-ceph-daemons-versions
unit-nrpe-0:
UnitId: nrpe/0
id: "20"
results:
Stderr: |
2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
(2) No such file or directory
2023-02-01T03:03:09.556+0000 7f4677361700 -1
AuthRegistry(0x7f467005f540) no keyring found at
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
disabling cephx
2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
(2) No such file or directory
2023-02-01T03:03:09.556+0000 7f4677361700 -1
AuthRegistry(0x7f4670064d88) no keyring found at
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
disabling cephx
2023-02-01T03:03:09.560+0000 7f4677361700 -1 auth: unable to find
a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
(2) No such file or directory
2023-02-01T03:03:09.560+0000 7f4677361700 -1
AuthRegistry(0x7f4677360000) no keyring found at
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
disabling cephx
[errno 2] RADOS object not found (error connecting to the cluster)
check-output: 'UNKNOWN: could not determine OSDs versions, error: Command ''[''ceph'',
''versions'']'' returned non-zero exit status 1.'
status: completed
timing:
completed: 2023-02-01 03:03:10 +0000 UTC
enqueued: 2023-02-01 03:03:09 +0000 UTC
started: 2023-02-01 03:03:09 +0000 UTC
Related-Bug: #1943628
Change-Id: I84b306e84661e6664e8a69fa93dfdb02fa4f1e7e
This check does not require manually setting the number of expected
OSDs.
Initially, the charm sets the count (per-host) to that of what's
present in the OSD tree. The count will be updated (on a per-host
basis) when the number of OSDs grows, but not when it shrinks. There
is a charm action to reset the expected count using information from
the OSD tree.
Closes-Bug: #1952985
Change-Id: Ia6a060bf151908c1d4159e6bdffa7bfe1f0a7988
Alert rules can be attached as a resource and will be transmitted via
the metrics-endpoint relation. Default alert rules taken from upstream
ceph have been added for reference.
Change-Id: I6a3c6f06e9b9d911b35c8ced1968becc6471b362
This NRPE check confirms if the versions of cluster daemons are divergent.
WARN - any minor version diverged
WARN – any versions are 1 release behind the mon
CRIT – any versions are 2 releases behind the mon
CRIT – any versions releases are head the mon
A juju action is also provided 'get-versions-report'
which provide to users, a quick way to see
daemons versions running on cluster hosts.
Closes-Bug: #1943628
Change-Id: I41b5c8576dc9cf885fa813a93e6d51e8804eb9d8
When the noout flag is set in a Ceph cluster, the Nagios check
currently marks this as a warning (like Ceph itself). However,
setting it to CRITICAL will raise visbility, and indicate to the
operator that this should be a temporary state.
Closes-Bug: 1926551
Change-Id: I9831cfea3f63e82fbc8bfebc938a9795b69111c7
Alert is triggered when number of known OSDs in osdmap is different
than number of "in" or "up" OSDs.
Change-Id: Id3d43f0146452d0bbd73e1ce98616a994eaee090
Partial-Bug: 1735579
When Ceph is in a warning state for reason1 and in the meantime
new reason2 appears, operator is not alerted and also cannot mute
alarms selectively (as described in bug #1735579)
This patch alllows to specify a dictionary of 'name':'regex' where
'name' will become ceph-$name check in nrpe and $regex will be
searched for in warning/error messages. It is specified via
charm nagios_additional_checks parameter.
There is also nagios_additional_checks_critical parameter which
specifies if those checks are reported as warning or error.
Change-Id: I73a7c15db88793bb78841d8395535c97ca2af872
Partial-Bug: 1735579
Current implementation returns Critical when Ceph is in warning
state, checking for some known exceptions which are considered
operational tasks. However this causes many Alarms.
This patch changes the behavior to report Warning when Ceph is
in HEALTH_WARN. If known operational tasks are exceeding
thresholds, Critical is returned.
Change-Id: I7a330189da8f0ba9168cedb534823c5e8f4795ba
This adds a test to see if the ceph status output looks like Luminous or
newer, and if so changes the output used to collect info.
Change-Id: I98d194c329aace3c412701e06632dbfedfadefc7
Closes-Bug: #1756864
There is a race condition between collect_ceph_status.sh writing
the status file and check_ceph_status.py reading that file.
This patch fixes that by directing ceph output into a temp file,
and then replacing the old state file with the new temp file using
an atomic mv operation.
Change-Id: If332d187f8dcb9f7fcd8b4a47f791beb8e27eaaa
Closes-Bug: 1755207
Various changes to migrate the charm to work with Python 3. The tox.ini
has been modified to inlcude py35 and py36 targets for testing against
Python 3.5 (xenial, zesty), and Python 3.6 (artful+).
Change-Id: I009de528428aaca555b49f3fc17704dcf5f2a28c
Changes ceph plugin so it ignores ceph rebalancing unless there is a
large percentage of misplaced/degraded objects (return warning for
that).
Adds config options to tweak that monitoring, and also just warn if
nodeep-scrub was deliberately set.
Includes some basic unit tests.
Change-Id: I317448cd769597068a706d3944d9d5419e0445c1
ceph-mon is typically deployed under LXC or LXD, where apparmor is
not supported; revert addition of apparmor profile feature as its
currently breaking MAAS+LXD deployments.
This reverts commit d337fdbfd0.
This reverts commit cf0f18d59a.
Change-Id: I94b7c7f5dc0245d273394aeb352731f7bffb1c91
After some testing with aa-complain it was discovered that
one of the apparmor rules was causing aa-complain to fail.
This patch also fixes an indentation typo.
Change-Id: I7a0e7e64f236136cd0f15fed22233cea533cad0c
This change adds an app armor profile for ceph-mon. It defaults
to complain mode. This will log all apparmor failures but not
enforce them.
Change-Id: I8b98580cda84191dff46105f8ce01d4a7a7d414f
All contributions to this charm where made under Canonical
copyright; switch to Apache-2.0 license as agreed so we
can move forward with official project status.
In order to make this change, this commit also drops the
inclusion of upstart configurations for very early versions
of Ceph (argonaut), as they are no longer required.
Change-Id: I3d943dfd78f406ba29f86c51e22a13eab448452e