charm-hacluster

Commit Graph

Author	SHA1	Message	Date
Felipe Reyes	32ab2d4bda	Install resource-agents-extra on jammy. The install hooks rsync a set of scripts and one of the destinations is /usr/lib/stonith/plugins/external, this directory is created by the installation of the package cluster-glue which is a pulled as an indirect dependency of pacemaker, this changed on >=jammy where an intermediate package named resource-agents was split into resource-agents-base and resource-agents-extra wher the latter doesn't get installed and it's the one that depends on cluster-glue. The specific chain of dependencies are: focal: pacemaker -> pacemaker-resource-agents -> resource-agents -> cluster-glue jammy pacemaker -> pacemaker-resource-agents -> resource-agents-base Change-Id: Ia00061bff2ebe16d35d52b256c61243935edabba Closes-Bug: #1971841	2022-05-06 09:55:15 -04:00
Felipe Reyes	715d31e09f	Catch FileExistsError when creating /etc/corosync dir. Hooks are expected to be idempotent, if the install hook for whatever reason needs to be re-run and the /etc/corosync directory already exists, because for example it was created in a previous run, the exception FileExistsError will be raised, this change captures the exception and moves on. Change-Id: If43a5c95bb59c9cca7f1a975214a9f013ad6f4d6 Closes-Bug: #1971762	2022-05-05 15:03:58 -04:00
Billy Olsen	d1191dbcab	Render corosync.conf file prior to pkg install Starting in focal, the ubuntu version of corosync package synced in from debian includes node1 as the default name for the local node with a nodeid of 1. This causes the cluster to have knowledge of this extra node1 node, which affects quorum, etc. Installing the charm's corosync.conf file before package installation prevents this conditioning from happening. Additionally this change removes some Xenial bits in the charm and always includes a nodelist in corosync.conf as it is compulsory in focal and newer. It is optional in the bionic packages, so we'll always just render the nodelist. Change-Id: I06b9c23eb57274f0c99a3a05979c0cabf87c8118 Closes-Bug: #1874719	2022-03-16 08:13:49 -07:00
Zuul	a5b408ff52	Merge "Safely delete node from ring"	2021-06-28 11:11:15 +00:00
David Ames	102d463aa3	Safely delete node from ring Provide the delete-node-from-ring action to safely remove a known node from the corosync ring. Update the less safe update-ring action to avoid LP Bug #1933223 and provide warnings in actions.yaml on its use. Change-Id: I56cf2360ac41b12fc0a508881897ba63a5e89dbd Closes-Bug: #1933223	2021-06-25 07:38:18 -07:00
Xav Paice	d17fdd276e	Add option for no-quorum-policy Adds a config item for what to do when the cluster does not have quorum. This is useful with stateless services where, e.g., we only need a VIP and that can be up on a single host with no problem. Though this would be a good relation data setting, many sites would prefer to stop the resources rather than have a VIP on multiple hosts, causing arp issues with the switch. Closes-bug: #1850829 Change-Id: I961b6b32e7ed23f967b047dd0ecb45b0c0dff49a	2021-06-25 10:18:14 +12:00
Aurelien Lourot	06796e6518	Fix pacemaker-remote-relation-changed hook error This was happening because trigger_corosync_update_from_leader() was being called not only in `ha` relation hooks but also in `pacemaker-remote` relation hooks after the implementation for the related bug landed. Closes-Bug: #1920124 Related-Bug: #1400481 Change-Id: I4952ef694589de6b72f04b387e30ca2333bc4cbc	2021-03-19 13:29:30 +01:00
Alvaro Uria	457f88eda6	Adjust quorum after node removal Add an `update-ring` action for that purpose. Also print more on various pacemaker failures. Also removed some dead code. Func-Test-PR: https://github.com/openstack-charmers/zaza-openstack-tests/pull/369 Change-Id: I35c0c9ce67fd459b9c3099346705d43d76bbdfe4 Closes-Bug: #1400481 Related-Bug: #1874719 Co-Authored-By: Aurelien Lourot <aurelien.lourot@canonical.com> Co-Authored-By: Felipe Reyes <felipe.reyes@canonical.com>	2021-03-11 17:24:01 +01:00
James Page	6058f9985c	Ensure crmsh is not removed during series upgrade At bionic crmsh was dropped as a dependency of corosync so it becomes a candidate for removal for older charm deployments upgrading from xenial->bionic. Mark the package as manually installed in the pre-upgrade hook to ensure that it never becomes a candidate for removal. Change-Id: I675684fee5410f86aace2e42515d3e325d8d12f8 Closes-Bug: 1900206	2021-01-29 10:57:09 +00:00
Zuul	355bbabe65	Merge "Increase default TimeoutStopSec value"	2020-12-17 20:26:40 +00:00
Ionut Balutoiu	4670f0effc	Add config option for 'cluster-recheck-interval' property This value is hard-coded to 60 seconds into the charm code. This change adds a charm config option (with 60 secs as the default value) in order to make the `cluster-recheck-interval` property configurable. Change-Id: I58f8d4831cf8de0b25d4d026f865e9b8075efe8b	2020-12-16 11:34:33 +00:00
Billy Olsen	9645aefdec	Increase default TimeoutStopSec value The charm installs systemd overrides of the TimeoutStopSec and TimeoutStartSec parameters for the corosync and pacemaker services. The default timeout stop parameter is changed to 60s, which is a significant change from the package level default of 30 minutes. The pacemaker systemd default is 30 minutes to allow time for resources to safely move off the node before shutting down. It can take some time for services to migrate away under a variety of circumstances (node usage, the resource, etc). This change increases the timeout to 10 minutes by default, which should prevent things like unattended-upgrades from causing outages due services not starting because systemd timed out (and an instance was already running). Change-Id: Ie88982fe987b742082a978ff2488693d0154123b Closes-Bug: #1903745	2020-12-15 16:44:11 -07:00
Aurelien Lourot	2e799e5cf0	Fix install hook on Groovy Also add Groovy to the test gate and sync libraries Change-Id: If32560a88cfa6735bf5e502a70e6b84b0171f045 Closes-Bug: #1903546	2020-11-10 16:40:06 +01:00
Zuul	a2797151f7	Merge "NRPE: Don't report paused hacluster nodes as CRITICAL error"	2020-11-09 12:21:25 +00:00
Billy Olsen	3080d64281	Remove the corosync_rings check in eoan+ Corosync 2.99 altered the status output for udp/udpu rings to be hardcoded to 'OK'. This breaks the check_corosync_rings nrpe check which is looking for 'ring $number active with no faults'. Since the value has been hardcoded to show 'OK', the check itself does not provide any real meaningful value. Change-Id: I642ecf11946b1ea791a27c54f0bec54adbfecb83 Closes-Bug: #1902919	2020-11-06 14:20:00 -07:00
Martin Kalcok	c385fef7b0	NRPE: Don't report paused hacluster nodes as CRITICAL error Previously, paused hacluster units showed up CRITICAL error in nagios even though they were only in the 'standby' mode in corosync. The hacluster charm now uses the '-s' option of the check_crm nrpe script to ignore alerts of the standby units. Change-Id: I976d5ff01d0156fbaa91f9028ac81b44c96881af Closes-Bug: #1880576	2020-11-06 14:19:42 +01:00
Andrea Ieri	0ce34b17be	Improve resource failcount detection The old check_crm script had separate checks for failcounts and failed actions, but since failed actions cause failcounts, the two will always be present together, and expire together. Furthermore, the previous defaults effectively caused the failed actions check to shadow the failcount one, because the former used to cause CRITICALs, while the latter was only causing WARNINGs. This version of check_crm deprecates failed actions detection in favor of only failcount alerting, but adds support for separate warn/crit thresholds. Default thresholds are set at 3 and 10 for warn and crit, respectively. Although sending criticals for high fail counter entries may seem redundant when we already do that for stopped resources, some resources are configured with infinite migration thresholds and will therefore never show up as failed in crm_mon. Having separate fail counter thresholds can therefore still be valuable, even if for most resources migration-threshold will be set lower than the critical fail-counter threshold. Closes-Bug: #1864040 Change-Id: I417416e20593160ddc7eb2e7f8460ab5f9465c00	2020-11-02 14:07:18 +00:00
Alex Kavanagh	b8c9fc66b4	Sync libraries & common files prior to freeze * charm-helpers sync for classic charms * charms.ceph sync for ceph charms * rebuild for reactive charms * sync tox.ini files as needed * sync requirements.txt files to sync to standard Change-Id: I7c643447959cfd82234653fdbd2bab1c0594469c	2020-09-28 10:22:11 +01:00
Liam Young	e02c6257ae	Fix adding of stonith controlled resources. There appears to be a window between a pacemaker remote resource being added and the location properties for that resource being added. In this window the resource is down and pacemaker may fence the node. The window is present because the charm charm currently does: 1) Set stonith-enabled=true cluster property 2) Add maas stonith device that controls pacemaker remote node that has not yet been added. 3) Add pacemaker remote node 4) Add pacemaker location rules. I think the following two fixes are needed: 1) For initial deploys update the charm so it does not enable stonith until stonith resources and pacemaker remotes have been added. 2) For scale-out do not add the new pacemaker remote stonith resource until the corresponding pacemaker resource has been added along with its location rules. Depends-On: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43 Change-Id: I7e2f568d829f6d0bfc7859a7d0ea239203bbc490 Closes-Bug: #1884284	2020-09-09 09:35:30 +00:00
Liam Young	b40a6754b0	Create null stonith resource for lxd containers. If stonith is enabled then when a compute node is detected as failed it is powered down. This can include a lxd container which is also part of the cluster. In this case because stonith is enabled at a global level, pacemaker will try and power off the lxd container too. But the container does not have a stonith device and this causes the container to be marked as unclean (but not down). This running unclean state prevents resources being moved and causes any pacemaker-remotes that are associated with the lost container from losing their connection which prevents masakari hostmonitor from ascertaining the cluster health. The way to work around this is to create a dummy stonith device for the lxd containers. This allows the cluster to properly mark the lost container as down and resources are relocated. Change-Id: Ic45dbdd9d8581f25549580c7e98a8d6e0bf8c3e7 Partial-Bug: #1889094	2020-09-03 10:58:48 +00:00
Liam Young	ca34574592	Ensure setup is only run on leader. configure_cluster_global, configure_monitor_host and configure_stonith should only be run by the leader, otherwise there is the risk of the updates happening simultaneously and failing. Change-Id: I495ee093a8395433412d890396cd991c6acd97f3 Closes-Bug: #1884797	2020-07-20 18:06:49 +00:00
Alex Kavanagh	24fa642247	Fix directory /etc/nagios/nrpe.d/ issue Under certain deployment conditions, the charm can attempt to write to the /etc/nagios/nrpe.d/ directory before it exists. This directory is created by the nrpe charm, but if the hacluster (this charm) gets installed first, then it can be triggered to attempt to set up the nrpe entries before the directory can be created by nrpe. This change (and the associated charm-helpers change) ensures that the charm will delay the nrpe config until the directory is available (and thus, the nrpe charm is fully installed) Related charm-helpers: https://github.com/juju/charm-helpers/pull/492 Change-Id: Ibcbb5f56205b72c475807e3c34c64a00844908f4 Closes-Bug: #1882557	2020-07-15 15:13:08 +01:00
Aurelien Lourot	c30f8a8e19	Trace failures of set_unit_status() Closes-Bug: #1878221 Change-Id: I9a741d51fb2bab7be12fd0496af6a18bbddd1709	2020-05-12 16:43:58 +02:00
Liam Young	d860f3406c	Check for peer series upgrade in pause and status Check whether peers have sent series upgrade notifications before pausing a unit. If notifications have been sent then HA services will have been shutdown and pausing will fail. Similarly, if series upgrade notifications have been sent then do not try and issue crm commands when assessing status. Change-Id: I4de0ffe5d5e24578db614c2e8640ebd32b8cd469 Closes-Bug: #1877937	2020-05-11 10:56:22 +00:00
Liam Young	d3512ef320	Stop HA services accross units for series upgrade Stop HA services accross all units of an application when doing a series upgrade to avoid the situation where the cluster has some nodes on LTS N-1 and some on LTS N. 1) In the 'pre-series-upgrade' send a notification to peers informing them that the unit is doing a series upgrade and to which Ubuntu version. 2) Peers receive notification. If they are on a later Ubuntu version than the one in the notification then they do nothing. Otherwise they shutdown corosync and pacemaker and add an entry to the local kv store with waiting-unit-upgrade=True. 3) In the 'post-series-upgrade' hook the notification is removed from the peer relation. waiting-unit-upgrade is set to False and corosync and pacemaker are started. The result of this is that when the first unit in the cluster starts a series upgrade all cluster services are shutdown across all units. They then rejoin the cluster one at a time when they have been upgraded to the new version. I added the waiting-unit-upgrade key to deal with the situation where the first node clears the notification after it has successfully upgraded, with out the waiting-unit-upgrade the peers would not know they were in a mixed Ubuntu version cluster. Change-Id: Id9167534e8933312c561a6acba40399bca437706 Closes-Bug: 1859150	2020-01-31 07:16:21 +00:00
Felipe Reyes	666055844e	Stop resource before deleting it. Pacemaker will refuse to delete a resource that it's running, so it needs to be stopped always before deleting it. Change-Id: I3c6acdef401e9ec18fedc65e9c77db4719fe60ec Closes-Bug: #1838528	2019-10-11 17:34:13 -03:00
Alex Kavanagh	f71fe595ce	Fix log update-status error This patch adds a dummy update_status function so that the update-status hook 'has' a function to run and thus silence the log error. Change-Id: Ia0ed9367809fd47c10ad6a57011555589a7d940c Closes-bug: #1837639	2019-08-20 19:39:09 +00:00
Andrea Ieri	4d391e8107	Allow tuning for check_crm failure handling This commit adds two new options, failed_actions_alert_type and failed_actions_threshold, which map onto the check_crm options --failedactions and --failcounts, respectively. The default option values make check_crm generate critical alerts if actions failed once. The actions check can be entirely bypassed if failed_actions_alert_type is set to 'ignore'. Closes-Bug: #1796400 Change-Id: I72f65bacba8bf17a13db19d2a3472f760776019a	2019-06-05 17:13:27 +00:00
Andrea Ieri	e28f8a9adc	Enable custom failure-timeout configuration As explained here[0], setting failure-timeout means that the cib will 'forget' that a resource agent action failed by setting failcount to 0: - if $failure-timeout seconds have elapsed from the last failure - if an event wakes up the policy engine (i.e. at the global resource recheck in an idle cluster) By default the failure-timeout will be set to 0, which disables the feature, however this change allows for tuning. [0] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_failure_response Change-Id: Ia958a8c5472547c7cf0cb4ecd7e70cb226074b88 Closes-Bug: #1802310	2019-05-31 21:15:13 +00:00
Liam Young	ed3ea84126	Install libmaas for stonith plugin The libmaas stonith plugin uses libmaas which is supplied via python3-libmaas. python3-libmaas is available from bionic onwards so install it where possible. Also, drive by fix to precreate /etc/pacemaker dir if needed to fix trusty installs. Closes-Bug: #1823300 Closes-Bug: #1823302 Change-Id: Ib14146f7f667f9c52e11d222f950efcb7cb47a7f	2019-04-05 08:28:56 +00:00
Liam Young	e357f2a1b5	Add support for pacemaker-remotes This change adds support for pacmaker-remots joining the cluster via the pacemaher-remote relation. The pacemaker-remotes can advertise whether they should host resources. If the pacemaker-remotes are only being used for failure detection (as is the case with masakari host monitors) then they will not host resources. Pacemaker remotes are managed in the cluster as resources which nominally run on a member in the main cluster. The resource that corresponds to the pacemaker-remote is managed via configure_pacemaker_remotes and configure_pacemaker_remote functions. If the pacemaker-remotes should not run resources then the cluster needs to be switched to an opt-in cluster. In an opt-in cluster location rules are needed to explicitly allow a resource to run on a specific node. This behaviour is controlled via the global cluster parameter 'symmetric-cluster' this is set via the new method set_cluster_symmetry. The method add_location_rules_for_local_nodes is used for creating these explicit rules. Change-Id: I0a66cfb1ecad02c2b185c5e6402e77f713d25f8b	2019-04-03 10:50:58 +00:00
Liam Young	d3a16df5a9	New member_ready state in peer relation This change adds a new 'member_ready' state to the peer relation. This purpose of this is to indicate that a unit has rendered its configuration and started its services. This is distinct from the existing 'ready' indicator. 'ready' indicates that the unit can render its config and start services and ensures that all units are present prior to starting services. This seems to safe guard against units starting to early and forming single node clusters or small clusters with a sub-set of units. The 'member_ready' flag is later in the process and if it is set it shows that this unit has joined the cluster and can be referenceed explicitly in any resource configuration. Change-Id: I80d42a628a3fe51bc6f8d02610031afd9386d7a4	2019-04-03 10:49:22 +00:00
Liam Young	bca864f33f	Support hacluster using peer-availability relation Add support for the hacluster charm to be related to a principle using the juju-info interface using the peer-availability relation. This is useful in the situation where a cluster without any resources is needed. Change-Id: Ibd03ba7923cfd2c412d5f772cf385a21c423e5af	2019-04-03 10:48:40 +00:00
Alex Kavanagh	02b406b6f3	Convert charm to Python 3 Change-Id: Ib7cc06b3b42f26f725a9ea79f09189cc72952d29	2019-03-14 12:40:07 +00:00
Felipe Reyes	639dadb141	Support update parameters of a resource This patch implements support to update parameters of an already existing resource using "crm configure load update FILE" The parameters of a resource are hashed using md5 and stored in the kv store, when the checksum doesn't match the resource is updated, otherwise discarded. Change-Id: I5735eaa1309c57e3620b0a6f68ffe13ec8165592 Closes-Bug: 1753432	2019-02-13 18:16:35 -03:00
Barry Price	600ba322fa	Add Bionic compatibility for the NRPE scripts via libmonitoring. From Bionic onwards, libnagios is replaced by libmonitoring. Trusty only contains the former, Bionic only the latter. Xenial contains both, but we prefer libmonitoring where it exists, so that we can drop libnagios support entirely once Trusty goes EOL. Change-Id: I613fd0b29b797e8900581f939eda72a1ab72868b Closes-Bug: 1796143	2019-01-21 15:44:57 +07:00
Liam Young	2ca245127e	Reorder colocation creation A colocation constraint may reference a clone resource (which percona-cluster charm does). So, the colocation must be created after the clones. Closes-Bug: #1808166 Change-Id: I304df954b2f81477535fe64687af11055b932c27	2018-12-12 16:40:12 +00:00
Felipe Reyes	6b6a177699	Replace oldest_peer() with is_leader() Change-Id: I9b5e97695185f0c0aafed2618c4edd9497192a3b Closes-Bug: 1794850	2018-09-28 16:47:10 -03:00
David Ames	ea6eb53059	Series Upgrade Implement the series-upgrade feature allowing to move between Ubuntu series. Change-Id: Idcc77b66e65633eb26e485e93ef7928b7f455ca8	2018-09-14 20:56:34 -07:00
Zuul	621ddb11b5	Merge "Catch MAASConfigIncomplete and set workload message"	2018-09-13 12:54:33 +00:00
Chris MacNaughton (icey)	b20a87517d	Revert "Support update parameters of a resource" This reverts commit `02d83b2e4e`. Change-Id: I47db2d4ad1672827e7e5667e9fe9f8b3737b8906	2018-09-05 15:42:47 +00:00
Felipe Reyes	2f4a815856	Catch MAASConfigIncomplete and set workload message validate_dns_ha() shouldn't have the side effect of setting the workload status and message, so now it will raise MAASConfigIncomplete with a message and ha_relation_changed() is catching the exception, setting the workload in blocked status and short circuiting the hook execution until the user set maas_url and maas_credentials. Change-Id: I3f1bc6571d2461cb65f384f042423cfe33f4d2f8 Closes-Bug: 1790559	2018-09-03 22:56:20 -03:00
Felipe Reyes	02d83b2e4e	Support update parameters of a resource This patch implements support to update parameters of an already existing resource using "crm configure load update FILE" Change-Id: I22730091d674145db4a1187b0904d9f88d9d8c6d Partial-Bug: #1753432	2018-06-07 08:59:04 -04:00
Liam Young	526ffd7587	Create DNS records when using DNS HA Deficiencies in the older maas API client that is used to manage DNS records for DNS HA prevent DNS records being created. This change creates the DNS records by directly interacting with the REST maas api (v2). If the old approach of pre-creating the DNS records is followed then those are used rather than creating new ones. Change-Id: I6e1d7e5f0a7d813c78dfc1e78743ae7b218fbd01 Closes-Bug: 1764832	2018-04-23 17:26:47 +00:00
James Page	6fc85dc5a2	bionic: ensure crmsh package is installed pacemaker no longer Recommends crmsh so explicitly install it for all Ubuntu series as its required for charm operation. Also enable bionic amulet test as part of a full gate recheck. Change-Id: I06e0dcfec0a787f85655c89bf36e18253c75de2e	2018-03-16 10:54:49 +00:00
James Page	b74d4aac41	Support network space binding of hanode relation Rework hooks to support network space binding of the hanode peer relation to a specific network space. Note that the get_relation_ip function also deals with the 'prefer-ipv6' legacy configuration option handling, so it was safe to remove some charm specific code in this area. Change-Id: Ic69e97debddba42e3d4a140f7f9cfc95768f71c3 Closes-Bug: 1659464	2017-09-28 09:00:43 +01:00
Jenkins	0bdf97fa9b	Merge "Change MAAS DNS OCF to read IP from file"	2017-09-27 20:13:27 +00:00
Billy Olsen	8d81e41576	Change MAAS DNS OCF to read IP from file The MAAS DNS ocf resource is created specifying the IP address from the leader node. When the resource is moved to another node the IP address is not updated because the DNS record already points to the leader's IP address. This means that the DNS record will never be updated with any other unit's IP address when the resource is moved around the cluster. The IP address to use for the DNS record should be stored in the pacemaker resource as the configuration is the same across the entire cluster. This change makes it so that the IP address that should be bound is written to a file in /etc/maas_dns/$resource_name and used by the ocf:maas:dns resource when managing the DNS resource records. Migration is handled by checking the current version of the MAAS OCF resource file and determining if the OCF_RESOURCE_INSTANCE (the name of the resource) is present in the file. Change-Id: If4e07079dd66dac51cd77c2600106b9b562c2483 Closes-Bug: #1711476	2017-09-20 17:08:44 -07:00
Corey Bryant	41b60eb037	Ensure python2 is installed before hook execution Change-Id: I724eaea5f09191b548e5497f9bb2c777f882d290 Closes-Bug: 1606906	2017-08-24 14:19:04 +00:00
Felipe Reyes	e95488afa0	Add maintenance-mode configuration option This config option allows syadmins to set pacemaker in maintenance mode which will stop monitoring on the configured resources, so services can be stopped/restarted and pacemaker won't start them again or migrating resources (e.g. virtual IPs). Change-Id: I232a043e6d9d45f2cf833d4f7c4d89b079f258bb Partial-Bug: 1698926	2017-08-16 17:44:44 +00:00

1 2 3 4

166 Commits