charm-hacluster

Commit Graph

Author	SHA1	Message	Date
Xav Paice	d17fdd276e	Add option for no-quorum-policy Adds a config item for what to do when the cluster does not have quorum. This is useful with stateless services where, e.g., we only need a VIP and that can be up on a single host with no problem. Though this would be a good relation data setting, many sites would prefer to stop the resources rather than have a VIP on multiple hosts, causing arp issues with the switch. Closes-bug: #1850829 Change-Id: I961b6b32e7ed23f967b047dd0ecb45b0c0dff49a	2021-06-25 10:18:14 +12:00
Alvaro Uria	457f88eda6	Adjust quorum after node removal Add an `update-ring` action for that purpose. Also print more on various pacemaker failures. Also removed some dead code. Func-Test-PR: https://github.com/openstack-charmers/zaza-openstack-tests/pull/369 Change-Id: I35c0c9ce67fd459b9c3099346705d43d76bbdfe4 Closes-Bug: #1400481 Related-Bug: #1874719 Co-Authored-By: Aurelien Lourot <aurelien.lourot@canonical.com> Co-Authored-By: Felipe Reyes <felipe.reyes@canonical.com>	2021-03-11 17:24:01 +01:00
Aurelien Lourot	a9191136dc	Fully deprecate stonith_enabled config option This is already deprecated since June 2020 thanks to a 'blocked' message in assess_status_helper() but this commit: 1. makes it clear in config.yaml, and 2. removes the corresponding already dead code. Change-Id: Ia6315273030e31b10125f2dd7a7fb7507d8a10b7	2021-03-04 11:10:34 +01:00
Zuul	355bbabe65	Merge "Increase default TimeoutStopSec value"	2020-12-17 20:26:40 +00:00
Ionut Balutoiu	4670f0effc	Add config option for 'cluster-recheck-interval' property This value is hard-coded to 60 seconds into the charm code. This change adds a charm config option (with 60 secs as the default value) in order to make the `cluster-recheck-interval` property configurable. Change-Id: I58f8d4831cf8de0b25d4d026f865e9b8075efe8b	2020-12-16 11:34:33 +00:00
Billy Olsen	9645aefdec	Increase default TimeoutStopSec value The charm installs systemd overrides of the TimeoutStopSec and TimeoutStartSec parameters for the corosync and pacemaker services. The default timeout stop parameter is changed to 60s, which is a significant change from the package level default of 30 minutes. The pacemaker systemd default is 30 minutes to allow time for resources to safely move off the node before shutting down. It can take some time for services to migrate away under a variety of circumstances (node usage, the resource, etc). This change increases the timeout to 10 minutes by default, which should prevent things like unattended-upgrades from causing outages due services not starting because systemd timed out (and an instance was already running). Change-Id: Ie88982fe987b742082a978ff2488693d0154123b Closes-Bug: #1903745	2020-12-15 16:44:11 -07:00
Andrea Ieri	0ce34b17be	Improve resource failcount detection The old check_crm script had separate checks for failcounts and failed actions, but since failed actions cause failcounts, the two will always be present together, and expire together. Furthermore, the previous defaults effectively caused the failed actions check to shadow the failcount one, because the former used to cause CRITICALs, while the latter was only causing WARNINGs. This version of check_crm deprecates failed actions detection in favor of only failcount alerting, but adds support for separate warn/crit thresholds. Default thresholds are set at 3 and 10 for warn and crit, respectively. Although sending criticals for high fail counter entries may seem redundant when we already do that for stopped resources, some resources are configured with infinite migration thresholds and will therefore never show up as failed in crm_mon. Having separate fail counter thresholds can therefore still be valuable, even if for most resources migration-threshold will be set lower than the critical fail-counter threshold. Closes-Bug: #1864040 Change-Id: I417416e20593160ddc7eb2e7f8460ab5f9465c00	2020-11-02 14:07:18 +00:00
Andrea Ieri	06e1816ed4	Change default expiration of failcounts from never to 180 seconds Partial-Bug: #1864040 Change-Id: Iabbd26f4505405ee1cac1571bad8452b341e08cb	2020-08-14 12:07:00 -04:00
José Pekkarinen	c99eed495c	Add support for maas_source_key for offline deployments. Closes-Bug: #1856148 Change-Id: Id28d4c5c8c711ef53e9ec0422d80d23a6a844291 Signed-off-by: José Pekkarinen <jose.pekkarinen@canonical.com>	2020-02-26 09:59:15 +02:00
Andrea Ieri	4d391e8107	Allow tuning for check_crm failure handling This commit adds two new options, failed_actions_alert_type and failed_actions_threshold, which map onto the check_crm options --failedactions and --failcounts, respectively. The default option values make check_crm generate critical alerts if actions failed once. The actions check can be entirely bypassed if failed_actions_alert_type is set to 'ignore'. Closes-Bug: #1796400 Change-Id: I72f65bacba8bf17a13db19d2a3472f760776019a	2019-06-05 17:13:27 +00:00
Andrea Ieri	e28f8a9adc	Enable custom failure-timeout configuration As explained here[0], setting failure-timeout means that the cib will 'forget' that a resource agent action failed by setting failcount to 0: - if $failure-timeout seconds have elapsed from the last failure - if an event wakes up the policy engine (i.e. at the global resource recheck in an idle cluster) By default the failure-timeout will be set to 0, which disables the feature, however this change allows for tuning. [0] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_failure_response Change-Id: Ia958a8c5472547c7cf0cb4ecd7e70cb226074b88 Closes-Bug: #1802310	2019-05-31 21:15:13 +00:00
Liam Young	f3873fe67f	Add pacemaker authkey To work with pacemaker remotes all pacemaker nodes (including the remotes) need to share a common key in the same way that corosync does. This change allows a user to set a pacemaker key via config in the same way as corosync. If the pacemaker key value is unset then the corosync key is user. Change-Id: I75247e7f3af29fc0907a94ae8e1678bdb9ee64e2	2019-04-03 10:49:22 +00:00
Felipe Reyes	e95488afa0	Add maintenance-mode configuration option This config option allows syadmins to set pacemaker in maintenance mode which will stop monitoring on the configured resources, so services can be stopped/restarted and pacemaker won't start them again or migrating resources (e.g. virtual IPs). Change-Id: I232a043e6d9d45f2cf833d4f7c4d89b079f258bb Partial-Bug: 1698926	2017-08-16 17:44:44 +00:00
Shane Peters	eedc70ccb6	Cleanup config.yaml Change-Id: Ib2f5729fb83b38b55babbea19c975fde77dc9ee7	2017-06-14 11:39:33 +01:00
Billy Olsen	cc1b547ce4	Change the default cluster_count to 3 Most production deployments involve 3 nodes for high availability. This change modifies the default cluster_count to 3 in order to align with the typical deployment scenario. Previously, the default cluster count was 2 which is often unmodified and then leads to deployment failures due to a mismatch in the nodelist defined in each node's corosync.conf file. Change-Id: I0799d8b880ecdb9c933d0361e7dc843b68fc5c82	2017-02-09 22:27:59 +00:00
David Ames	fda5176bd5	Fix pacemaker down crm infinite loop On corosync restart, corosync may take longer than a minute to come up. The systemd start script times out too soon. Then pacemaker which is dependent on corosync is immediatly started and fails as corosync is still in the process of starting. Subsequently the charm would run crm node list to validate pacemaker. This would become an infinite loop. This change adds longer timeout values for systemd scripts and adds better error handling and communication to the end user. Change-Id: I7c3d018a03fddfb1f6bfd91fd7aeed4b13879e45 Partial-Bug: #1654403	2017-01-24 10:55:29 -08:00
James Page	2edb98b7df	Switch default transport to unicast Unicast is generally alot more reliable and is guaranteed to work in all network configurations unlike multicast. Optimize context build for unicast configuration - its possible to build the corosync cluster prior to the principle charm presenting multicast only configuration options. Change-Id: I7c4f559325234401a7b6f7aa26114349d07817ad	2016-08-05 09:48:20 +01:00
David Ames	41dc7b3fad	DNS HA Allow DNS be the HA resource in leiu of a VIP when using MAAS 2.0. Added an OCF resource dns Added maas_dns.py as the api script to update a MAAS 2.0 DNS resource record. Charmhelpers sync to pull in DNS HA helpers Change-Id: I0b71feec86a77643892fadc08f2954204b541d01	2016-06-23 09:45:49 +01:00
Edward Hope-Morley	4fd68ee194	synced /next	2015-04-30 21:35:31 +02:00
Liam Young	610b4a3fa5	[bradm, r=gnuoy] Adding nrpe checks to the hacluster to check the status of corosync.	2015-04-20 09:54:49 +01:00
Edward Hope-Morley	4e2906a4f3	[gnuoy,r=hopem] Allow corosync.conf netmtu to be set regardless of inet mode (ipv4/ipv6).	2015-04-14 11:16:50 +01:00
Liam Young	7eb49bf999	Allow corosync mtu to be set for ipv4 as well as ipv6 but preserve the default behaviour of leaving it unset for ipv4 and set to 1500 for ipv6	2015-04-14 09:26:34 +00:00
Edward Hope-Morley	323d18cf48	Code cleamup. No functional changes.	2015-03-27 22:14:18 -07:00
Brad Marshall	4e0a063250	[bradm] Add nagios_servicegroups config option	2015-02-19 15:42:39 +10:00
Brad Marshall	88d1b56bc6	[bradm] Add nagios-context to config.yaml	2015-02-12 10:06:56 +10:00
Felipe Reyes	feb60ede4a	Add 'debug' to config.yaml	2014-12-15 10:28:49 -03:00
Liam Young	550670aa54	Update transport config options to be more user friendly and support backwards compatability	2014-11-19 15:51:11 +00:00
Liam Young	c9c6735a7e	Add in unicast support	2014-10-12 07:08:43 +00:00
james.page@ubuntu.com	562d20be90	rebase, resync	2014-10-01 23:20:35 +01:00
Hui Xiang	f29527cc81	Add config option netmtu for corosync, refactor code.	2014-09-28 09:28:13 +08:00
Edward Hope-Morley	81d88b4c22	Fixed minor typo in config.yaml	2014-09-25 17:44:04 +01:00
Edward Hope-Morley	741c86e8db	[hopem] Adds ipv6 privacy extensions deploy note to config.yaml	2014-09-25 17:32:54 +01:00
james.page@ubuntu.com	5e7c7bad84	Pull things apart a bit	2014-09-23 12:50:00 +01:00
james.page@ubuntu.com	461b4a9f29	Allow reconfiguration of cluster resources, enforce quorum	2014-09-04 11:09:07 +01:00
Hui Xiang	67f5697951	Support hacluster for IPv6.	2014-08-19 15:06:29 +08:00
James Page	f1c107ee21	Rebase on precise charm for pingd and lint tidy	2014-04-11 11:25:09 +01:00
James Page	3b181ba801	Fixup for trusty corosync	2014-03-31 15:34:27 +01:00
Andres Rodriguez	bc8b401891	Add monitor_host option with its monitor_interval option to decide whether to configure the Ping service. This service will ping a provided network IP address to check whether connectivity exists and determine whether networking is working.	2014-01-29 15:48:59 -05:00
James Page	2bae6acc75	Revert back to pcmk v0 integration for corosync	2013-04-30 10:37:52 +01:00
James Page	70571421d5	Refactoring to use openstack charm helpers. Support base64 encoded corosync_key configuration.	2013-03-24 12:01:17 +00:00
Adam Gandelman	1e39d48f1e	configure_stonith: Exit 1 on error(s).	2013-01-24 15:39:45 -08:00
Adam Gandelman	cba1d3c816	First pass at STONITH support.	2013-01-24 13:24:09 -08:00
James Page	3b6f24bc15	Refactoring to remove hook ordering issues, switch to ver 0 of pacemaker management	2012-12-13 11:21:53 +00:00
Andres Rodriguez	f2b2373497	HACluster refactoring	2012-12-11 07:54:36 -05:00
Andres Rodriguez	1c22ba36b4	Initial release	2012-11-20 15:06:11 -05:00

45 Commits