A job name passed via the prometheus_scrape library doesn't end up as a
static job name in the prometheus configuration file in the COS world
even though COS expects a fixed string. Practically we cannot have a
static job name like job=ceph in any of the alert rules in COS since the
charms will convert the string "ceph" into:
> juju_MODELNAME_ID_APPNAME_prometheus_scrape_JOBNAME(ceph)-N
Let's give up the possibility of the static job name and use "up{}" so
it will be annotated with the model name/ID, etc. without any specific
job related condition. It will break the alert rules when one unit have
more than one scraping endpoint because there will be no way to
distinguish multiple scraping jobs. Ceph MON only has one prometheus
endpoint for the time being so this change shouldn't cause an immediate
issue. Overall, it's not ideal but at least better than the current
status, which is an alert error out of the box.
The following alert rule:
> up{} == 0
will be converted and annotated as:
> up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="UUID"} == 0
Closes-Bug: #2044062
Change-Id: I0df8bc0238349b5f03179dfb8f4da95da48140c7
If a COS prometheus changed event is processed but bootstrap hasn't
completed yet, we need to retry the event at a later time.
Closes-bug: #2042891
Change-Id: I3d274c09522f9d7ef56bc66f68d8488150c125d8
Add default prometheus alerting rules for RadosGW multisite deployments based
on the built-in Ceph RGW multisite metrics.
Note that the included prometheus_alerts.yml.default rule file
is included for reference only. The ceph-mon charm will utilize the
resource file from https://charmhub.io/ceph-mon/resources/alert-rules
for deployment so that operators can easily customize these rules.
Change-Id: I5a12162d73686963132a952bddd85ec205964de4
Ceph reef has a behaviour change where it doesn't always return
version keys for all components. In
I12a1bcd32be2ed8a8e5ee0e304f716f5a190bd57 an attempt was made to fix
this by retrying, however this code path can also be hit when a
component such as OSDs are absent. While a cluster without OSDs
wouldn't be functional it still should not cause the charm to error.
As a fix, just make the OSD component optional when querying for a
version instead of retrying.
Change-Id: I5524896c7ad944f6f22fb1498ab0069397b52418
This duplicates the check performed for ceph status and specialises it for
radosgw-admin sync status instead.
The config options available are:
- nagios_rgw_zones: this is which zones are expected to be connected
- nagios_rgw_additional_checks: this is equivalent to nagios_additional_checks
and allows for a configurable set of strings to grep for as critical alerts.
Change-Id: Ideb35587693feaf1cc0736e981005332e91ca861
Setting the 'mgr/prometheus/rbd_stats_pools' option can fail
if we arrive too early, even if the cluster is bootstrapped. This is
particularly seen in ceph-radosgw test runs. This patchset thus
adds a retry decorator to work around this issue.
Change-Id: Id9b7b903e67154e7d2bb6fecbeef7fac126804a8
The Openstack libs don't recognize Ceph releases when specifying
the charm source. Instead, we have to use an Openstack release.
Since it was set to quincy, reset it to bobcat.
Closes-Bug: #2026651
Change-Id: Ibac09d2bf77eeba69789434eaa6112c2028fbf64
During cluster deployment a situation can arise where there are
already osd relations but osds are not yet fully added to the cluster.
This can make version retrieval fail for osds. Retry version retrieval
to give the cluster a chance to settle.
Also update tests to install OpenStack from latest/edge
Change-Id: I12a1bcd32be2ed8a8e5ee0e304f716f5a190bd57
Instead of returning an empty dict for already processed
broker requests, store the result and return it. This works
around issues in charms like ceph-fs that spin indefinitely
waiting for the response to a request that never arrives.
Closes-Bug: #2031414
Change-Id: Ie86f007d76fe75cc07cf7a973eff3f535a11dbe7
Add the 'docs' key and point it at a Discourse topic
previously populated with the charm's README contents.
When the new charm revision is released to the Charmhub,
this Discourse-based content will be displayed there. In
the absense of the this new key, the Charmhub's default
behaviour is to display the value of the charm's
'description' key.
Change-Id: I173cadb5a8208283883e1119dbfc5d661809cc5f
Avoid the unintuitive situation where users are deploying from
channel=quincy but get an older ceph due to deploying series=focal by
explicitly setting source=quincy which is what most users want anyway;
those that do not can still explicitly set source.
Change-Id: I9428e93ba6107ba5e2ebcc667995b3d88eb03d27
This PR makes some small changes in the upgrade path logic by
providing a fallback method of fetching the current ceph-mon
version and adding additional checks to see if the upgrade can
be done in a sane way.
Closes-Bug: #2024253
Change-Id: I1ca4316aaf4f0b855a12aa582a8188c88e926fa6
The charm can now set osd_memory_target, but it's not per device class
or type by the nature of how the charm works. Resetting
osd_memory_target always when osd_memory_target is not passed over the
relation is a bit risky behavior since operators may have set
osd_memory_target by hand with `ceph config` command out side of the
charm. Let's be less disruptive on the charm upgrade.
Closes-Bug: #1934143
Change-Id: I34dd33e54193a9ebdbc9571d153aa6206c85a067
Auth for getting pool details can fail initially if we set up a rbd
mirror relation at cloud bootstrap. Add some retry to give it another
chance
Change-Id: I2f5ac561120b1abe52ea0621bb472bc78495fa97
Partial-Bug: #2021967
When ceph doing the version upgrade, it will check the previous ceph
from the `source` config variable which store in persistent file.
But the persistent file update is broken. It is because we use hookenv.Config
from ops framework, but the hookenv._run_atexit, which
save the change to file, is not been called.
Partial-Bug: #2007976
Change-Id: Ibf12a2b87736cb1d32788672fb390e027f15b936
func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/1047
For better stability use LTS series for rabbitmq and mysql when
testing instead of interim releases.
Also remove xena (non-lts) from tests and yoga as a source default
Change-Id: Ie443c55dc4cc1b7f63eacfee79b28f210f1277e4
- update bundles to include UCA pocket tests
- update test configuration
- update metadata to include kinetic and lunar
- update snapcraft to allow run-on for kinetic and lunar
Change-Id: I6b229b502dd4ee9f1d219240b86f7826abf0c25d
The operator framework and charmhelpers use the same path for the
local K/V store, which causes problems when running certain hooks
like 'pre-series-upgrade'. In order to work around this issue, this
patchset makes the charmhelpers lib use a different path, while
migrating the DB file before doing so.
Closes-Bug: #2005137
Change-Id: Ic2e024371ff431888731753d29fff8538232009a
Commit 40b22e3d on juju/charm-helpers repo introduced shell quoting of
each argument passed to the check, turning the quoting of the double quotes
done here not only unnecessary but also damaging to the final command.
Closes-Bug: #2008784
Change-Id: Ifedd5875d27e72a857b01a48afcd058476734695
func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/1022
A bug was introduced when changing ceph-client to
an operator framework library that caused the
fallback application_name handling to present
a class name rather than a remote applicaiton name.
This change updates the handling to get at an
`app.name` rather than an `app`.
As a drive-by, this also allow-lists the fully-
qualified rename.sh.
Closes-Bug: #1995086
Change-Id: I57b685cb78ba5c4930eb0fa73d7ef09d39d73743
func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/1022
This reverts commit dfbda68e1a.
Reason for revert:
The Ceph version check seems to be missing a consideration of users to
execute the nrpe check. It actually fails to get keyrings to execute the
command as it's run by a non-root user.
$ juju run-action --wait nrpe/0 run-nrpe-check name=check-ceph-daemons-versions
unit-nrpe-0:
UnitId: nrpe/0
id: "20"
results:
Stderr: |
2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
(2) No such file or directory
2023-02-01T03:03:09.556+0000 7f4677361700 -1
AuthRegistry(0x7f467005f540) no keyring found at
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
disabling cephx
2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
(2) No such file or directory
2023-02-01T03:03:09.556+0000 7f4677361700 -1
AuthRegistry(0x7f4670064d88) no keyring found at
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
disabling cephx
2023-02-01T03:03:09.560+0000 7f4677361700 -1 auth: unable to find
a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
(2) No such file or directory
2023-02-01T03:03:09.560+0000 7f4677361700 -1
AuthRegistry(0x7f4677360000) no keyring found at
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
disabling cephx
[errno 2] RADOS object not found (error connecting to the cluster)
check-output: 'UNKNOWN: could not determine OSDs versions, error: Command ''[''ceph'',
''versions'']'' returned non-zero exit status 1.'
status: completed
timing:
completed: 2023-02-01 03:03:10 +0000 UTC
enqueued: 2023-02-01 03:03:09 +0000 UTC
started: 2023-02-01 03:03:09 +0000 UTC
Related-Bug: #1943628
Change-Id: I84b306e84661e6664e8a69fa93dfdb02fa4f1e7e
Later Ceph releases require that the --test function of crushtool
is called with replica information for validation.
Pass in "--num-rep 3" as a basic check plus "--show-statistics"
to silence a non-fatal warning message.
This can be clean cherry-picked back at least as far as
Ceph 12.2.x.
Change-Id: I76d21ddd9da79535f68490b4231ae13705e27edb
Closes-Bug: 2003690
Also, drop python-dbus for simplicity since "check_upstart_job" in nrpe
is not enabled any longer. And the python-dbus package is no longer
available on jammy either.
[on focal with systemd]
$ ls -1 /etc/nagios/nrpe.d/
check_ceph.cfg
check_conntrack.cfg
check_reboot.cfg
check_systemd_scopes.cfg
Closes-Bug: #1998163
Change-Id: I30bc22ae8509367207004b90eb2c38ad0fae9ffe
This unpinning is meant to solve the issues with tox 4.x breaking
all the virtualenv dependencies.
Change-Id: Ifc3381b2f2e4e41ebf6676080bf1831baffb0d42
The previous (classic) version of the charm initialised a Config
object in the install hook and let it go out of scope. Initialise
a config object explicitly in the install and upgrade charm hooks.
Change-Id: Ic389c840cc4253adaddcaa50d184db6ca66cb397