Commit Graph

45 Commits

Author SHA1 Message Date
Xav Paice d17fdd276e Add option for no-quorum-policy
Adds a config item for what to do when the cluster does not have quorum.
This is useful with stateless services where, e.g., we only need a VIP
and that can be up on a single host with no problem.

Though this would be a good relation data setting, many sites would
prefer to stop the resources rather than have a VIP on multiple hosts,
causing arp issues with the switch.

Closes-bug: #1850829
Change-Id: I961b6b32e7ed23f967b047dd0ecb45b0c0dff49a
2021-06-25 10:18:14 +12:00
Alvaro Uria 457f88eda6 Adjust quorum after node removal
Add an `update-ring` action for that purpose.
Also print more on various pacemaker failures.
Also removed some dead code.

Func-Test-PR: https://github.com/openstack-charmers/zaza-openstack-tests/pull/369
Change-Id: I35c0c9ce67fd459b9c3099346705d43d76bbdfe4
Closes-Bug: #1400481
Related-Bug: #1874719
Co-Authored-By: Aurelien Lourot <aurelien.lourot@canonical.com>
Co-Authored-By: Felipe Reyes <felipe.reyes@canonical.com>
2021-03-11 17:24:01 +01:00
Aurelien Lourot a9191136dc Fully deprecate stonith_enabled config option
This is already deprecated since June 2020 thanks
to a 'blocked' message in assess_status_helper()
but this commit:
1. makes it clear in config.yaml, and
2. removes the corresponding already dead code.

Change-Id: Ia6315273030e31b10125f2dd7a7fb7507d8a10b7
2021-03-04 11:10:34 +01:00
Zuul 355bbabe65 Merge "Increase default TimeoutStopSec value" 2020-12-17 20:26:40 +00:00
Ionut Balutoiu 4670f0effc Add config option for 'cluster-recheck-interval' property
This value is hard-coded to 60 seconds into the charm code.
This change adds a charm config option (with 60 secs as the
default value) in order to make the `cluster-recheck-interval`
property configurable.

Change-Id: I58f8d4831cf8de0b25d4d026f865e9b8075efe8b
2020-12-16 11:34:33 +00:00
Billy Olsen 9645aefdec Increase default TimeoutStopSec value
The charm installs systemd overrides of the TimeoutStopSec and
TimeoutStartSec parameters for the corosync and pacemaker services.
The default timeout stop parameter is changed to 60s, which is a
significant change from the package level default of 30 minutes. The
pacemaker systemd default is 30 minutes to allow time for resources
to safely move off the node before shutting down. It can take some
time for services to migrate away under a variety of circumstances (node
usage, the resource, etc).

This change increases the timeout to 10 minutes by default, which should
prevent things like unattended-upgrades from causing outages due
services not starting because systemd timed out (and an instance was
already running).

Change-Id: Ie88982fe987b742082a978ff2488693d0154123b
Closes-Bug: #1903745
2020-12-15 16:44:11 -07:00
Andrea Ieri 0ce34b17be Improve resource failcount detection
The old check_crm script had separate checks for failcounts and failed
actions, but since failed actions cause failcounts, the two will always be
present together, and expire together.
Furthermore, the previous defaults effectively caused the failed actions
check to shadow the failcount one, because the former used to cause
CRITICALs, while the latter was only causing WARNINGs.

This version of check_crm deprecates failed actions detection in favor of
only failcount alerting, but adds support for separate warn/crit
thresholds.
Default thresholds are set at 3 and 10 for warn and crit, respectively.

Although sending criticals for high fail counter entries may seem
redundant when we already do that for stopped resources, some resources
are configured with infinite migration thresholds and will therefore
never show up as failed in crm_mon. Having separate fail counter
thresholds can therefore still be valuable, even if for most resources
migration-threshold will be set lower than the critical fail-counter threshold.

Closes-Bug: #1864040
Change-Id: I417416e20593160ddc7eb2e7f8460ab5f9465c00
2020-11-02 14:07:18 +00:00
Andrea Ieri 06e1816ed4 Change default expiration of failcounts from never to 180 seconds
Partial-Bug: #1864040
Change-Id: Iabbd26f4505405ee1cac1571bad8452b341e08cb
2020-08-14 12:07:00 -04:00
José Pekkarinen c99eed495c
Add support for maas_source_key for offline deployments.
Closes-Bug: #1856148
Change-Id: Id28d4c5c8c711ef53e9ec0422d80d23a6a844291
Signed-off-by: José Pekkarinen <jose.pekkarinen@canonical.com>
2020-02-26 09:59:15 +02:00
Andrea Ieri 4d391e8107 Allow tuning for check_crm failure handling
This commit adds two new options, failed_actions_alert_type and
failed_actions_threshold, which map onto the check_crm options
--failedactions and --failcounts, respectively.
The default option values make check_crm generate critical alerts if
actions failed once.
The actions check can be entirely bypassed if failed_actions_alert_type
is set to 'ignore'.

Closes-Bug: #1796400
Change-Id: I72f65bacba8bf17a13db19d2a3472f760776019a
2019-06-05 17:13:27 +00:00
Andrea Ieri e28f8a9adc Enable custom failure-timeout configuration
As explained here[0], setting failure-timeout means that the cib will 'forget'
that a resource agent action failed by setting failcount to 0:
- if $failure-timeout seconds have elapsed from the last failure
- if an event wakes up the policy engine (i.e. at the global resource
  recheck in an idle cluster)

By default the failure-timeout will be set to 0, which disables the feature,
however this change allows for tuning.

[0] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_failure_response

Change-Id: Ia958a8c5472547c7cf0cb4ecd7e70cb226074b88
Closes-Bug: #1802310
2019-05-31 21:15:13 +00:00
Liam Young f3873fe67f Add pacemaker authkey
To work with pacemaker remotes all pacemaker nodes (including the
remotes) need to share a common key in the same way that corosync
does. This change allows a user to set a pacemaker key via config
in the same way as corosync. If the pacemaker key value is unset
then the corosync key is user.

Change-Id: I75247e7f3af29fc0907a94ae8e1678bdb9ee64e2
2019-04-03 10:49:22 +00:00
Felipe Reyes e95488afa0 Add maintenance-mode configuration option
This config option allows syadmins to set pacemaker in maintenance mode
which will stop monitoring on the configured resources, so services
can be stopped/restarted and pacemaker won't start them again or
migrating resources (e.g. virtual IPs).

Change-Id: I232a043e6d9d45f2cf833d4f7c4d89b079f258bb
Partial-Bug: 1698926
2017-08-16 17:44:44 +00:00
Shane Peters eedc70ccb6 Cleanup config.yaml
Change-Id: Ib2f5729fb83b38b55babbea19c975fde77dc9ee7
2017-06-14 11:39:33 +01:00
Billy Olsen cc1b547ce4 Change the default cluster_count to 3
Most production deployments involve 3 nodes for high availability.
This change modifies the default cluster_count to 3 in order to
align with the typical deployment scenario. Previously, the default
cluster count was 2 which is often unmodified and then leads to
deployment failures due to a mismatch in the nodelist defined in
each node's corosync.conf file.

Change-Id: I0799d8b880ecdb9c933d0361e7dc843b68fc5c82
2017-02-09 22:27:59 +00:00
David Ames fda5176bd5 Fix pacemaker down crm infinite loop
On corosync restart, corosync may take longer than a minute to come
up. The systemd start script times out too soon. Then pacemaker which
is dependent on corosync is immediatly started and fails as corosync
is still in the process of starting.

Subsequently the charm would run crm node list to validate pacemaker.
This would become an infinite loop.

This change adds longer timeout values for systemd scripts and adds
better error handling and communication to the end user.

Change-Id: I7c3d018a03fddfb1f6bfd91fd7aeed4b13879e45
Partial-Bug: #1654403
2017-01-24 10:55:29 -08:00
James Page 2edb98b7df Switch default transport to unicast
Unicast is generally alot more reliable and is guaranteed to work
in all network configurations unlike multicast.

Optimize context build for unicast configuration - its possible to
build the corosync cluster prior to the principle charm presenting
multicast only configuration options.

Change-Id: I7c4f559325234401a7b6f7aa26114349d07817ad
2016-08-05 09:48:20 +01:00
David Ames 41dc7b3fad DNS HA
Allow DNS be the HA resource in leiu of a VIP when using MAAS 2.0.
Added an OCF resource dns
Added maas_dns.py as the api script to update a MAAS 2.0 DNS resource
record.

Charmhelpers sync to pull in DNS HA helpers

Change-Id: I0b71feec86a77643892fadc08f2954204b541d01
2016-06-23 09:45:49 +01:00
Edward Hope-Morley 4fd68ee194 synced /next 2015-04-30 21:35:31 +02:00
Liam Young 610b4a3fa5 [bradm, r=gnuoy] Adding nrpe checks to the hacluster to check the status of corosync. 2015-04-20 09:54:49 +01:00
Edward Hope-Morley 4e2906a4f3 [gnuoy,r=hopem]
Allow corosync.conf netmtu to be set regardless of inet
mode (ipv4/ipv6).
2015-04-14 11:16:50 +01:00
Liam Young 7eb49bf999 Allow corosync mtu to be set for ipv4 as well as ipv6 but preserve the default behaviour of leaving it unset for ipv4 and set to 1500 for ipv6 2015-04-14 09:26:34 +00:00
Edward Hope-Morley 323d18cf48 Code cleamup. No functional changes. 2015-03-27 22:14:18 -07:00
Brad Marshall 4e0a063250 [bradm] Add nagios_servicegroups config option 2015-02-19 15:42:39 +10:00
Brad Marshall 88d1b56bc6 [bradm] Add nagios-context to config.yaml 2015-02-12 10:06:56 +10:00
Felipe Reyes feb60ede4a Add 'debug' to config.yaml 2014-12-15 10:28:49 -03:00
Liam Young 550670aa54 Update transport config options to be more user friendly and support backwards compatability 2014-11-19 15:51:11 +00:00
Liam Young c9c6735a7e Add in unicast support 2014-10-12 07:08:43 +00:00
james.page@ubuntu.com 562d20be90 rebase, resync 2014-10-01 23:20:35 +01:00
Hui Xiang f29527cc81 Add config option netmtu for corosync, refactor code. 2014-09-28 09:28:13 +08:00
Edward Hope-Morley 81d88b4c22 Fixed minor typo in config.yaml 2014-09-25 17:44:04 +01:00
Edward Hope-Morley 741c86e8db [hopem]
Adds ipv6 privacy extensions deploy note to config.yaml
2014-09-25 17:32:54 +01:00
james.page@ubuntu.com 5e7c7bad84 Pull things apart a bit 2014-09-23 12:50:00 +01:00
james.page@ubuntu.com 461b4a9f29 Allow reconfiguration of cluster resources, enforce quorum 2014-09-04 11:09:07 +01:00
Hui Xiang 67f5697951 Support hacluster for IPv6. 2014-08-19 15:06:29 +08:00
James Page f1c107ee21 Rebase on precise charm for pingd and lint tidy 2014-04-11 11:25:09 +01:00
James Page 3b181ba801 Fixup for trusty corosync 2014-03-31 15:34:27 +01:00
Andres Rodriguez bc8b401891 Add monitor_host option with its monitor_interval option to decide whether to configure the Ping service. This service will ping a provided network IP address to check whether connectivity exists and determine whether networking is working. 2014-01-29 15:48:59 -05:00
James Page 2bae6acc75 Revert back to pcmk v0 integration for corosync 2013-04-30 10:37:52 +01:00
James Page 70571421d5 Refactoring to use openstack charm helpers.
Support base64 encoded corosync_key configuration.
2013-03-24 12:01:17 +00:00
Adam Gandelman 1e39d48f1e configure_stonith: Exit 1 on error(s). 2013-01-24 15:39:45 -08:00
Adam Gandelman cba1d3c816 First pass at STONITH support. 2013-01-24 13:24:09 -08:00
James Page 3b6f24bc15 Refactoring to remove hook ordering issues, switch to ver 0 of pacemaker management 2012-12-13 11:21:53 +00:00
Andres Rodriguez f2b2373497 HACluster refactoring 2012-12-11 07:54:36 -05:00
Andres Rodriguez 1c22ba36b4 Initial release 2012-11-20 15:06:11 -05:00