The install hooks rsync a set of scripts and one of the destinations
is /usr/lib/stonith/plugins/external, this directory is created by the
installation of the package cluster-glue which is a pulled as an
indirect dependency of pacemaker, this changed on >=jammy where an
intermediate package named resource-agents was split into
resource-agents-base and resource-agents-extra wher the latter doesn't
get installed and it's the one that depends on cluster-glue.
The specific chain of dependencies are:
focal:
pacemaker -> pacemaker-resource-agents -> resource-agents -> cluster-glue
jammy
pacemaker -> pacemaker-resource-agents -> resource-agents-base
Change-Id: Ia00061bff2ebe16d35d52b256c61243935edabba
Closes-Bug: #1971841
Hooks are expected to be idempotent, if the install hook for whatever
reason needs to be re-run and the /etc/corosync directory already exists,
because for example it was created in a previous run, the exception
FileExistsError will be raised, this change captures the exception and
moves on.
Change-Id: If43a5c95bb59c9cca7f1a975214a9f013ad6f4d6
Closes-Bug: #1971762
Starting in focal, the ubuntu version of corosync package synced in from
debian includes node1 as the default name for the local node with a nodeid
of 1. This causes the cluster to have knowledge of this extra node1 node,
which affects quorum, etc. Installing the charm's corosync.conf file
before package installation prevents this conditioning from happening.
Additionally this change removes some Xenial bits in the charm and always
includes a nodelist in corosync.conf as it is compulsory in focal and
newer. It is optional in the bionic packages, so we'll always just
render the nodelist.
Change-Id: I06b9c23eb57274f0c99a3a05979c0cabf87c8118
Closes-Bug: #1874719
The mock third party library was needed for mock support in py2
runtimes. Since we now only support py36 and later, we can use the
standard lib unittest.mock module instead.
Note that https://github.com/openstack/charms.openstack is used during tests
and he need `mock`, unfortunatelly it doesn't declare `mock` in its
requirements so it retrieve mock from other charm project (cross dependency).
So we depend on charms.openstack first and when
Ib1ed5b598a52375e29e247db9ab4786df5b6d142 will be merged then CI
will pass without errors.
Depends-On: Ib1ed5b598a52375e29e247db9ab4786df5b6d142
Change-Id: I631d32e1a330bcd17b53ee873833e8434023958f
Adds a config item for what to do when the cluster does not have quorum.
This is useful with stateless services where, e.g., we only need a VIP
and that can be up on a single host with no problem.
Though this would be a good relation data setting, many sites would
prefer to stop the resources rather than have a VIP on multiple hosts,
causing arp issues with the switch.
Closes-bug: #1850829
Change-Id: I961b6b32e7ed23f967b047dd0ecb45b0c0dff49a
This was happening because
trigger_corosync_update_from_leader() was being called
not only in `ha` relation hooks but also in
`pacemaker-remote` relation hooks after the implementation
for the related bug landed.
Closes-Bug: #1920124
Related-Bug: #1400481
Change-Id: I4952ef694589de6b72f04b387e30ca2333bc4cbc
Add an `update-ring` action for that purpose.
Also print more on various pacemaker failures.
Also removed some dead code.
Func-Test-PR: https://github.com/openstack-charmers/zaza-openstack-tests/pull/369
Change-Id: I35c0c9ce67fd459b9c3099346705d43d76bbdfe4
Closes-Bug: #1400481
Related-Bug: #1874719
Co-Authored-By: Aurelien Lourot <aurelien.lourot@canonical.com>
Co-Authored-By: Felipe Reyes <felipe.reyes@canonical.com>
At bionic crmsh was dropped as a dependency of corosync so it
becomes a candidate for removal for older charm deployments
upgrading from xenial->bionic.
Mark the package as manually installed in the pre-upgrade hook
to ensure that it never becomes a candidate for removal.
Change-Id: I675684fee5410f86aace2e42515d3e325d8d12f8
Closes-Bug: 1900206
This patch replaces the sqlite3 kvstore implementation provided by charmhelpers. Patches the kvstore used in pcmk for both the test_pcmk and the test_hacluster_hooks tests.
closes-bug: #1908282
Change-Id: I3320735314f0b03aecec6635ef82ddd44eecaff1
This value is hard-coded to 60 seconds into the charm code.
This change adds a charm config option (with 60 secs as the
default value) in order to make the `cluster-recheck-interval`
property configurable.
Change-Id: I58f8d4831cf8de0b25d4d026f865e9b8075efe8b
Corosync 2.99 altered the status output for udp/udpu rings to
be hardcoded to 'OK'. This breaks the check_corosync_rings nrpe
check which is looking for 'ring $number active with no faults'.
Since the value has been hardcoded to show 'OK', the check itself
does not provide any real meaningful value.
Change-Id: I642ecf11946b1ea791a27c54f0bec54adbfecb83
Closes-Bug: #1902919
Previously, paused hacluster units showed up CRITICAL error
in nagios even though they were only in the 'standby' mode
in corosync.
The hacluster charm now uses the '-s' option of the check_crm
nrpe script to ignore alerts of the standby units.
Change-Id: I976d5ff01d0156fbaa91f9028ac81b44c96881af
Closes-Bug: #1880576
The old check_crm script had separate checks for failcounts and failed
actions, but since failed actions cause failcounts, the two will always be
present together, and expire together.
Furthermore, the previous defaults effectively caused the failed actions
check to shadow the failcount one, because the former used to cause
CRITICALs, while the latter was only causing WARNINGs.
This version of check_crm deprecates failed actions detection in favor of
only failcount alerting, but adds support for separate warn/crit
thresholds.
Default thresholds are set at 3 and 10 for warn and crit, respectively.
Although sending criticals for high fail counter entries may seem
redundant when we already do that for stopped resources, some resources
are configured with infinite migration thresholds and will therefore
never show up as failed in crm_mon. Having separate fail counter
thresholds can therefore still be valuable, even if for most resources
migration-threshold will be set lower than the critical fail-counter threshold.
Closes-Bug: #1864040
Change-Id: I417416e20593160ddc7eb2e7f8460ab5f9465c00
There appears to be a window between a pacemaker remote resource
being added and the location properties for that resource being
added. In this window the resource is down and pacemaker may fence
the node.
The window is present because the charm charm currently does:
1) Set stonith-enabled=true cluster property
2) Add maas stonith device that controls pacemaker remote node that
has not yet been added.
3) Add pacemaker remote node
4) Add pacemaker location rules.
I think the following two fixes are needed:
1) For initial deploys update the charm so it does not enable stonith
until stonith resources and pacemaker remotes have been added.
2) For scale-out do not add the new pacemaker remote stonith resource
until the corresponding pacemaker resource has been added along
with its location rules.
Depends-On: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43
Change-Id: I7e2f568d829f6d0bfc7859a7d0ea239203bbc490
Closes-Bug: #1884284
Use location directives to spread pacemaker remote resources across
cluster. This is to prevent multiple resources being taken down in
the event of a single node failure. This would usually not be a
problem but if the node is being queried by masakari host
monitors at the time the node goes down then the query can hang.
Change-Id: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43
Partial-Bug: #1889094
Depends-On: Ic45dbdd9d8581f25549580c7e98a8d6e0bf8c3e7
If stonith is enabled then when a compute node is detected as failed
it is powered down. This can include a lxd container which is also
part of the cluster. In this case because stonith is enabled at a
global level, pacemaker will try and power off the lxd container
too. But the container does not have a stonith device and this causes
the container to be marked as unclean (but not down). This running
unclean state prevents resources being moved and causes any
pacemaker-remotes that are associated with the lost container from
losing their connection which prevents masakari hostmonitor from
ascertaining the cluster health.
The way to work around this is to create a dummy stonith device for
the lxd containers. This allows the cluster to properly mark the lost
container as down and resources are relocated.
Change-Id: Ic45dbdd9d8581f25549580c7e98a8d6e0bf8c3e7
Partial-Bug: #1889094
Check whether peers have sent series upgrade notifications before
pausing a unit. If notifications have been sent then HA services
will have been shutdown and pausing will fail.
Similarly, if series upgrade notifications have been sent then
do not try and issue crm commands when assessing status.
Change-Id: I4de0ffe5d5e24578db614c2e8640ebd32b8cd469
Closes-Bug: #1877937
Stop HA services accross all units of an application when doing a
series upgrade to avoid the situation where the cluster has some
nodes on LTS N-1 and some on LTS N.
1) In the 'pre-series-upgrade' send a notification to peers informing
them that the unit is doing a series upgrade and to which Ubuntu
version.
2) Peers receive notification. If they are on a later Ubuntu version
than the one in the notification then they do nothing. Otherwise
they shutdown corosync and pacemaker and add an entry to the local
kv store with waiting-unit-upgrade=True.
3) In the 'post-series-upgrade' hook the notification is removed from
the peer relation. waiting-unit-upgrade is set to False and
corosync and pacemaker are started.
The result of this is that when the first unit in the cluster starts
a series upgrade all cluster services are shutdown across all units.
They then rejoin the cluster one at a time when they have been
upgraded to the new version.
I added the waiting-unit-upgrade key to deal with the situation where
the first node clears the notification after it has successfully
upgraded, with out the waiting-unit-upgrade the peers would not know
they were in a mixed Ubuntu version cluster.
Change-Id: Id9167534e8933312c561a6acba40399bca437706
Closes-Bug: 1859150
Pacemaker will refuse to delete a resource that it's running, so it needs
to be stopped always before deleting it.
Change-Id: I3c6acdef401e9ec18fedc65e9c77db4719fe60ec
Closes-Bug: #1838528
This commit adds two new options, failed_actions_alert_type and
failed_actions_threshold, which map onto the check_crm options
--failedactions and --failcounts, respectively.
The default option values make check_crm generate critical alerts if
actions failed once.
The actions check can be entirely bypassed if failed_actions_alert_type
is set to 'ignore'.
Closes-Bug: #1796400
Change-Id: I72f65bacba8bf17a13db19d2a3472f760776019a
As explained here[0], setting failure-timeout means that the cib will 'forget'
that a resource agent action failed by setting failcount to 0:
- if $failure-timeout seconds have elapsed from the last failure
- if an event wakes up the policy engine (i.e. at the global resource
recheck in an idle cluster)
By default the failure-timeout will be set to 0, which disables the feature,
however this change allows for tuning.
[0] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_failure_response
Change-Id: Ia958a8c5472547c7cf0cb4ecd7e70cb226074b88
Closes-Bug: #1802310
Stonith is being disabled at the global cluster level despite it being needed
for pacemaker-remote nodes.
The legacy hacluster charm option 'stonith_enable' covers the main 'member'
nodes and if it is set to false then stonith resources are not created for
them and the stonith-enabled cluster parameter is set to false. However, in a
masakari deploy stonith is not required for the member nodes but is for the
remote nodes. In this case stonith-enabled cluster option should be set to
true.
Change-Id: Ie1affa17dd3cfcd677aa866b6e3d1c1004bb13c9
Closes-Bug: #1824828
The libmaas stonith plugin uses libmaas which is supplied via
python3-libmaas. python3-libmaas is available from bionic onwards
so install it where possible.
Also, drive by fix to precreate /etc/pacemaker dir if needed to fix
trusty installs.
Closes-Bug: #1823300
Closes-Bug: #1823302
Change-Id: Ib14146f7f667f9c52e11d222f950efcb7cb47a7f
This change adds support for pacmaker-remots joining the cluster
via the pacemaher-remote relation. The pacemaker-remotes can
advertise whether they should host resources. If the
pacemaker-remotes are only being used for failure detection
(as is the case with masakari host monitors) then they
will not host resources.
Pacemaker remotes are managed in the cluster as resources
which nominally run on a member in the main cluster. The
resource that corresponds to the pacemaker-remote is managed
via configure_pacemaker_remotes and configure_pacemaker_remote
functions.
If the pacemaker-remotes should not run resources then the
cluster needs to be switched to an opt-in cluster. In an
opt-in cluster location rules are needed to explicitly
allow a resource to run on a specific node. This behaviour
is controlled via the global cluster parameter 'symmetric-cluster'
this is set via the new method set_cluster_symmetry. The method
add_location_rules_for_local_nodes is used for creating these
explicit rules.
Change-Id: I0a66cfb1ecad02c2b185c5e6402e77f713d25f8b
This patch implements support to update parameters of an already
existing resource using "crm configure load update FILE"
The parameters of a resource are hashed using md5 and stored in the kv
store, when the checksum doesn't match the resource is updated,
otherwise discarded.
Change-Id: I5735eaa1309c57e3620b0a6f68ffe13ec8165592
Closes-Bug: 1753432
validate_dns_ha() shouldn't have the side effect of setting the workload
status and message, so now it will raise MAASConfigIncomplete with a
message and ha_relation_changed() is catching the exception, setting the
workload in blocked status and short circuiting the hook execution until
the user set maas_url and maas_credentials.
Change-Id: I3f1bc6571d2461cb65f384f042423cfe33f4d2f8
Closes-Bug: 1790559
Using the xml output provided by "crm configure" and parsing it to look
for nodes that match the xpath ".//*[@id='$NAME']". The test case added
uses the xml generated when ceph-radosgw has dns-ha enabled which
creates a groups of hostnames that cross references resources making the
previous approach give false positives.
Change-Id: If1c3584c889e7e101f15ed5ba6de89c687667754
Closes-Bug: 1789915
This patch implements support to update parameters of an already
existing resource using "crm configure load update FILE"
Change-Id: I22730091d674145db4a1187b0904d9f88d9d8c6d
Partial-Bug: #1753432
Rework hooks to support network space binding of the hanode
peer relation to a specific network space.
Note that the get_relation_ip function also deals with the
'prefer-ipv6' legacy configuration option handling, so it
was safe to remove some charm specific code in this area.
Change-Id: Ic69e97debddba42e3d4a140f7f9cfc95768f71c3
Closes-Bug: 1659464
The MAAS DNS ocf resource is created specifying the IP address
from the leader node. When the resource is moved to another node
the IP address is not updated because the DNS record already
points to the leader's IP address. This means that the DNS record
will never be updated with any other unit's IP address when the
resource is moved around the cluster.
The IP address to use for the DNS record should be stored in the
pacemaker resource as the configuration is the same across the
entire cluster. This change makes it so that the IP address that
should be bound is written to a file in /etc/maas_dns/$resource_name
and used by the ocf:maas:dns resource when managing the DNS
resource records.
Migration is handled by checking the current version of the MAAS
OCF resource file and determining if the OCF_RESOURCE_INSTANCE
(the name of the resource) is present in the file.
Change-Id: If4e07079dd66dac51cd77c2600106b9b562c2483
Closes-Bug: #1711476
This config option allows syadmins to set pacemaker in maintenance mode
which will stop monitoring on the configured resources, so services
can be stopped/restarted and pacemaker won't start them again or
migrating resources (e.g. virtual IPs).
Change-Id: I232a043e6d9d45f2cf833d4f7c4d89b079f258bb
Partial-Bug: 1698926
All contributions to this charm where made under Canonical
copyright; switch to Apache-2.0 license as agreed so we
can move forward with official project status.
This charm does include files from a few other projects
which we can't re-license - leave those as is for now.
Change-Id: I4d0ec0cceed05ef6b6153148c8b9fc9333189b77
Allow DNS be the HA resource in leiu of a VIP when using MAAS 2.0.
Added an OCF resource dns
Added maas_dns.py as the api script to update a MAAS 2.0 DNS resource
record.
Charmhelpers sync to pull in DNS HA helpers
Change-Id: I0b71feec86a77643892fadc08f2954204b541d01
Specify the vote count for multicast quorum to be the number of nodes
which are configured. Additionally, specify the two_node value in the
quorum section when there are 2 nodes configured.
Closes-Bug: 1394008