Adds a config item for what to do when the cluster does not have quorum.
This is useful with stateless services where, e.g., we only need a VIP
and that can be up on a single host with no problem.
Though this would be a good relation data setting, many sites would
prefer to stop the resources rather than have a VIP on multiple hosts,
causing arp issues with the switch.
Closes-bug: #1850829
Change-Id: I961b6b32e7ed23f967b047dd0ecb45b0c0dff49a
Add an `update-ring` action for that purpose.
Also print more on various pacemaker failures.
Also removed some dead code.
Func-Test-PR: https://github.com/openstack-charmers/zaza-openstack-tests/pull/369
Change-Id: I35c0c9ce67fd459b9c3099346705d43d76bbdfe4
Closes-Bug: #1400481
Related-Bug: #1874719
Co-Authored-By: Aurelien Lourot <aurelien.lourot@canonical.com>
Co-Authored-By: Felipe Reyes <felipe.reyes@canonical.com>
This is already deprecated since June 2020 thanks
to a 'blocked' message in assess_status_helper()
but this commit:
1. makes it clear in config.yaml, and
2. removes the corresponding already dead code.
Change-Id: Ia6315273030e31b10125f2dd7a7fb7507d8a10b7
This value is hard-coded to 60 seconds into the charm code.
This change adds a charm config option (with 60 secs as the
default value) in order to make the `cluster-recheck-interval`
property configurable.
Change-Id: I58f8d4831cf8de0b25d4d026f865e9b8075efe8b
The charm installs systemd overrides of the TimeoutStopSec and
TimeoutStartSec parameters for the corosync and pacemaker services.
The default timeout stop parameter is changed to 60s, which is a
significant change from the package level default of 30 minutes. The
pacemaker systemd default is 30 minutes to allow time for resources
to safely move off the node before shutting down. It can take some
time for services to migrate away under a variety of circumstances (node
usage, the resource, etc).
This change increases the timeout to 10 minutes by default, which should
prevent things like unattended-upgrades from causing outages due
services not starting because systemd timed out (and an instance was
already running).
Change-Id: Ie88982fe987b742082a978ff2488693d0154123b
Closes-Bug: #1903745
The old check_crm script had separate checks for failcounts and failed
actions, but since failed actions cause failcounts, the two will always be
present together, and expire together.
Furthermore, the previous defaults effectively caused the failed actions
check to shadow the failcount one, because the former used to cause
CRITICALs, while the latter was only causing WARNINGs.
This version of check_crm deprecates failed actions detection in favor of
only failcount alerting, but adds support for separate warn/crit
thresholds.
Default thresholds are set at 3 and 10 for warn and crit, respectively.
Although sending criticals for high fail counter entries may seem
redundant when we already do that for stopped resources, some resources
are configured with infinite migration thresholds and will therefore
never show up as failed in crm_mon. Having separate fail counter
thresholds can therefore still be valuable, even if for most resources
migration-threshold will be set lower than the critical fail-counter threshold.
Closes-Bug: #1864040
Change-Id: I417416e20593160ddc7eb2e7f8460ab5f9465c00
This commit adds two new options, failed_actions_alert_type and
failed_actions_threshold, which map onto the check_crm options
--failedactions and --failcounts, respectively.
The default option values make check_crm generate critical alerts if
actions failed once.
The actions check can be entirely bypassed if failed_actions_alert_type
is set to 'ignore'.
Closes-Bug: #1796400
Change-Id: I72f65bacba8bf17a13db19d2a3472f760776019a
As explained here[0], setting failure-timeout means that the cib will 'forget'
that a resource agent action failed by setting failcount to 0:
- if $failure-timeout seconds have elapsed from the last failure
- if an event wakes up the policy engine (i.e. at the global resource
recheck in an idle cluster)
By default the failure-timeout will be set to 0, which disables the feature,
however this change allows for tuning.
[0] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_failure_response
Change-Id: Ia958a8c5472547c7cf0cb4ecd7e70cb226074b88
Closes-Bug: #1802310
To work with pacemaker remotes all pacemaker nodes (including the
remotes) need to share a common key in the same way that corosync
does. This change allows a user to set a pacemaker key via config
in the same way as corosync. If the pacemaker key value is unset
then the corosync key is user.
Change-Id: I75247e7f3af29fc0907a94ae8e1678bdb9ee64e2
This config option allows syadmins to set pacemaker in maintenance mode
which will stop monitoring on the configured resources, so services
can be stopped/restarted and pacemaker won't start them again or
migrating resources (e.g. virtual IPs).
Change-Id: I232a043e6d9d45f2cf833d4f7c4d89b079f258bb
Partial-Bug: 1698926
Most production deployments involve 3 nodes for high availability.
This change modifies the default cluster_count to 3 in order to
align with the typical deployment scenario. Previously, the default
cluster count was 2 which is often unmodified and then leads to
deployment failures due to a mismatch in the nodelist defined in
each node's corosync.conf file.
Change-Id: I0799d8b880ecdb9c933d0361e7dc843b68fc5c82
On corosync restart, corosync may take longer than a minute to come
up. The systemd start script times out too soon. Then pacemaker which
is dependent on corosync is immediatly started and fails as corosync
is still in the process of starting.
Subsequently the charm would run crm node list to validate pacemaker.
This would become an infinite loop.
This change adds longer timeout values for systemd scripts and adds
better error handling and communication to the end user.
Change-Id: I7c3d018a03fddfb1f6bfd91fd7aeed4b13879e45
Partial-Bug: #1654403
Unicast is generally alot more reliable and is guaranteed to work
in all network configurations unlike multicast.
Optimize context build for unicast configuration - its possible to
build the corosync cluster prior to the principle charm presenting
multicast only configuration options.
Change-Id: I7c4f559325234401a7b6f7aa26114349d07817ad
Allow DNS be the HA resource in leiu of a VIP when using MAAS 2.0.
Added an OCF resource dns
Added maas_dns.py as the api script to update a MAAS 2.0 DNS resource
record.
Charmhelpers sync to pull in DNS HA helpers
Change-Id: I0b71feec86a77643892fadc08f2954204b541d01