charm-ceph-mon

Commit Graph

Author	SHA1	Message	Date
Nobuto Murata	fb32621831	Don't expect a static job name A job name passed via the prometheus_scrape library doesn't end up as a static job name in the prometheus configuration file in the COS world even though COS expects a fixed string. Practically we cannot have a static job name like job=ceph in any of the alert rules in COS since the charms will convert the string "ceph" into: > juju_MODELNAME_ID_APPNAME_prometheus_scrape_JOBNAME(ceph)-N Let's give up the possibility of the static job name and use "up{}" so it will be annotated with the model name/ID, etc. without any specific job related condition. It will break the alert rules when one unit have more than one scraping endpoint because there will be no way to distinguish multiple scraping jobs. Ceph MON only has one prometheus endpoint for the time being so this change shouldn't cause an immediate issue. Overall, it's not ideal but at least better than the current status, which is an alert error out of the box. The following alert rule: > up{} == 0 will be converted and annotated as: > up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="UUID"} == 0 Closes-Bug: #2044062 Change-Id: I0df8bc0238349b5f03179dfb8f4da95da48140c7	2024-03-18 15:29:49 +09:00
Peter Sabaini	35f9af8c96	Fixup: multisite alert rule help texts Change-Id: I558804c8bbd162a15bd97a023ac612d32fd96b02	2024-01-19 19:12:40 +01:00
Peter Sabaini	24fccea832	Add alerting rules for RGW multisite deployments Add default prometheus alerting rules for RadosGW multisite deployments based on the built-in Ceph RGW multisite metrics. Note that the included prometheus_alerts.yml.default rule file is included for reference only. The ceph-mon charm will utilize the resource file from https://charmhub.io/ceph-mon/resources/alert-rules for deployment so that operators can easily customize these rules. Change-Id: I5a12162d73686963132a952bddd85ec205964de4	2024-01-17 16:50:37 +01:00
Danny Cocks	8d7c1060aa	Add nagios check for radosgw-admin sync status This duplicates the check performed for ceph status and specialises it for radosgw-admin sync status instead. The config options available are: - nagios_rgw_zones: this is which zones are expected to be connected - nagios_rgw_additional_checks: this is equivalent to nagios_additional_checks and allows for a configurable set of strings to grep for as critical alerts. Change-Id: Ideb35587693feaf1cc0736e981005332e91ca861	2024-01-10 10:42:24 +11:00
Nobuto Murata	c9389a8cd0	Revert "Create NRPE check to verify ceph daemons versions" This reverts commit `dfbda68e1a`. Reason for revert: The Ceph version check seems to be missing a consideration of users to execute the nrpe check. It actually fails to get keyrings to execute the command as it's run by a non-root user. $ juju run-action --wait nrpe/0 run-nrpe-check name=check-ceph-daemons-versions unit-nrpe-0: UnitId: nrpe/0 id: "20" results: Stderr: \| 2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2023-02-01T03:03:09.556+0000 7f4677361700 -1 AuthRegistry(0x7f467005f540) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx 2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2023-02-01T03:03:09.556+0000 7f4677361700 -1 AuthRegistry(0x7f4670064d88) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx 2023-02-01T03:03:09.560+0000 7f4677361700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2023-02-01T03:03:09.560+0000 7f4677361700 -1 AuthRegistry(0x7f4677360000) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx [errno 2] RADOS object not found (error connecting to the cluster) check-output: 'UNKNOWN: could not determine OSDs versions, error: Command ''[''ceph'', ''versions'']'' returned non-zero exit status 1.' status: completed timing: completed: 2023-02-01 03:03:10 +0000 UTC enqueued: 2023-02-01 03:03:09 +0000 UTC started: 2023-02-01 03:03:09 +0000 UTC Related-Bug: #1943628 Change-Id: I84b306e84661e6664e8a69fa93dfdb02fa4f1e7e	2023-02-01 12:31:16 +09:00
Edin Sarajlic	b8af44aefa	Add nagios check for expected number of OSDs This check does not require manually setting the number of expected OSDs. Initially, the charm sets the count (per-host) to that of what's present in the OSD tree. The count will be updated (on a per-host basis) when the number of OSDs grows, but not when it shrinks. There is a charm action to reset the expected count using information from the OSD tree. Closes-Bug: #1952985 Change-Id: Ia6a060bf151908c1d4159e6bdffa7bfe1f0a7988	2022-10-05 13:02:54 +00:00
Peter Sabaini	9c7101f573	Implement prometheus alert rules Alert rules can be attached as a resource and will be transmitted via the metrics-endpoint relation. Default alert rules taken from upstream ceph have been added for reference. Change-Id: I6a3c6f06e9b9d911b35c8ced1968becc6471b362	2022-09-23 14:22:06 +02:00
Hicham El Gharbi	dfbda68e1a	Create NRPE check to verify ceph daemons versions This NRPE check confirms if the versions of cluster daemons are divergent. WARN - any minor version diverged WARN – any versions are 1 release behind the mon CRIT – any versions are 2 releases behind the mon CRIT – any versions releases are head the mon A juju action is also provided 'get-versions-report' which provide to users, a quick way to see daemons versions running on cluster hosts. Closes-Bug: #1943628 Change-Id: I41b5c8576dc9cf885fa813a93e6d51e8804eb9d8	2022-07-19 12:18:06 +02:00
Chi Wai, Chan	86f2a17a2c	Fixed typo in a function comment. --check_osds_down --> --check_num_osds Change-Id: Ic5938cc5f12606ff0cc67df988b95ecf673b6c5f	2022-06-22 15:52:22 +08:00
Garrett Thompson	375a1d0056	Change noout to be a CRITICAL alert instead of WARNING. When the noout flag is set in a Ceph cluster, the Nagios check currently marks this as a warning (like Ceph itself). However, setting it to CRITICAL will raise visbility, and indicate to the operator that this should be a temporary state. Closes-Bug: 1926551 Change-Id: I9831cfea3f63e82fbc8bfebc938a9795b69111c7	2021-09-07 14:34:33 -06:00
Frode Nordahl	c0113217bf	Unpin flake8, fix lint Change-Id: Iab73f1127bfbdf11626727f3044366d2e5745439	2020-08-24 10:54:54 +02:00
Marian Gasparovic	caa1cd8d6a	Creates nrpe check for number of OSDs Alert is triggered when number of known OSDs in osdmap is different than number of "in" or "up" OSDs. Change-Id: Id3d43f0146452d0bbd73e1ce98616a994eaee090 Partial-Bug: 1735579	2019-05-03 10:02:31 +02:00
Marian Gasparovic	8e4fe57295	Creates aditional nrpe checks which parse warning/error messages When Ceph is in a warning state for reason1 and in the meantime new reason2 appears, operator is not alerted and also cannot mute alarms selectively (as described in bug #1735579) This patch alllows to specify a dictionary of 'name':'regex' where 'name' will become ceph-$name check in nrpe and $regex will be searched for in warning/error messages. It is specified via charm nagios_additional_checks parameter. There is also nagios_additional_checks_critical parameter which specifies if those checks are reported as warning or error. Change-Id: I73a7c15db88793bb78841d8395535c97ca2af872 Partial-Bug: 1735579	2019-04-02 12:05:22 +02:00
Marian Gasparovic	35c8e40e83	Don't return Critical when ceph is in warning state. Current implementation returns Critical when Ceph is in warning state, checking for some known exceptions which are considered operational tasks. However this causes many Alarms. This patch changes the behavior to report Warning when Ceph is in HEALTH_WARN. If known operational tasks are exceeding thresholds, Critical is returned. Change-Id: I7a330189da8f0ba9168cedb534823c5e8f4795ba	2018-11-21 16:49:14 +01:00
Xav Paice	b97177b7d6	Update Nagios check for Luminous This adds a test to see if the ceph status output looks like Luminous or newer, and if so changes the output used to collect info. Change-Id: I98d194c329aace3c412701e06632dbfedfadefc7 Closes-Bug: #1756864	2018-05-01 10:29:37 +12:00
Tamas Erdei	5dbafb0b2f	Fix race condition in collect_ceph_status.sh There is a race condition between collect_ceph_status.sh writing the status file and check_ceph_status.py reading that file. This patch fixes that by directing ceph output into a temp file, and then replacing the old state file with the new temp file using an atomic mv operation. Change-Id: If332d187f8dcb9f7fcd8b4a47f791beb8e27eaaa Closes-Bug: 1755207	2018-03-13 14:30:49 +01:00
Alex Kavanagh	7586815401	Migrate charm to work with Python3 only Various changes to migrate the charm to work with Python 3. The tox.ini has been modified to inlcude py35 and py36 targets for testing against Python 3.5 (xenial, zesty), and Python 3.6 (artful+). Change-Id: I009de528428aaca555b49f3fc17704dcf5f2a28c	2017-11-17 10:22:30 +00:00
Xav Paice	e93c4a903a	Update ceph nagios plugin Changes ceph plugin so it ignores ceph rebalancing unless there is a large percentage of misplaced/degraded objects (return warning for that). Adds config options to tweak that monitoring, and also just warn if nodeep-scrub was deliberately set. Includes some basic unit tests. Change-Id: I317448cd769597068a706d3944d9d5419e0445c1	2017-11-07 13:34:27 +13:00
Marian Gasparovic	ad9c3da7a2	Plugin should return also a reason for warning from ceph. Change-Id: I9247f374ce88e0c208252b6a37d82fad407cc84a Signed-off-by: Marian Gasparovic <marian.gasparovic@canonical.com>	2017-10-26 13:58:19 +02:00
James Page	e69a771922	Revert "Add AppArmor profile" ceph-mon is typically deployed under LXC or LXD, where apparmor is not supported; revert addition of apparmor profile feature as its currently breaking MAAS+LXD deployments. This reverts commit `d337fdbfd0`. This reverts commit `cf0f18d59a`. Change-Id: I94b7c7f5dc0245d273394aeb352731f7bffb1c91	2016-07-18 16:23:32 +01:00
Chris Holcombe	d337fdbfd0	Tweak AppArmor profile After some testing with aa-complain it was discovered that one of the apparmor rules was causing aa-complain to fail. This patch also fixes an indentation typo. Change-Id: I7a0e7e64f236136cd0f15fed22233cea533cad0c	2016-07-08 09:30:09 -07:00
Chris Holcombe	cf0f18d59a	AppArmor Profile This change adds an app armor profile for ceph-mon. It defaults to complain mode. This will log all apparmor failures but not enforce them. Change-Id: I8b98580cda84191dff46105f8ce01d4a7a7d414f	2016-07-05 13:21:45 -07:00
James Page	02fd38bb60	Re-license charm as Apache-2.0 All contributions to this charm where made under Canonical copyright; switch to Apache-2.0 license as agreed so we can move forward with official project status. In order to make this change, this commit also drops the inclusion of upstart configurations for very early versions of Ceph (argonaut), as they are no longer required. Change-Id: I3d943dfd78f406ba29f86c51e22a13eab448452e	2016-07-01 13:55:54 +01:00
Chris MacNaughton	d09ae363cf	fix tests	2016-01-28 18:21:15 +01:00
Chris MacNaughton	90e9865a97	remove osd stuff	2016-01-25 11:10:14 -05:00
Chris MacNaughton	43e73c8250	update to ue ceph-disk prepare instead of ceph-disk-prepare	2016-01-08 14:03:51 -05:00
Brad Marshall	ee991c7554	[bradm] Initial nrpe checks	2014-10-30 16:57:10 +10:00
Paul Collins	e37e6cc9e6	import upstart scripts from argonaut	2012-10-03 21:19:53 +13:00

28 Commits