Commit Graph

277 Commits

Author SHA1 Message Date
Zuul 871f551e21 Merge "Get private-address for local unit from relation" 2023-10-03 15:20:00 +00:00
Tiago Pasqualini 71100249ee Get private-address for local unit from relation
Currently, the private-address for the local unit is queried using
unit_get, which can cause it to return an address from a different
binding. This patch changes it to always query from the relation.

Closes-bug: #2020669
Change-Id: I128420c572d5491b9af4cf34614f4534c787d02c
2023-09-21 11:00:34 -03:00
Gabriel Cocenza aa557b85b7 Add application version on HA Cluster charm
Closes-Bug: #2031438
Change-Id: I4dab721ebe42d4c43c09a98204ce8113892aa817
2023-08-16 11:01:31 -03:00
Felipe Reyes 8446b38347 Use get_property instead of get-property
hacluster uses the command "crm configure get-property <CMD>" to obtain
a property of the cluster, although "get-property" has been deprecated
in favor of "get_property", since crmsh-4.2.1 a warning is printed to
stdout[0] breaking the parsing.

    # crm configure get-property maintenance-mode 2>/dev/null
    WARNING: This command 'get-property' is deprecated, please use 'get_property'
    INFO: "get-property" is accepted as "get_property"
    true

[0] 86282af8e5

Change-Id: Id0ee9ab1873d14dcd1c960001cdeb8318f599ef5
Closes-Bug: #2008704
2023-02-27 11:02:28 -03:00
Corey Bryant a03b0b2a87 Only return hacluster nodes from list_nodes()
list_nodes() recently had some changes to run 'crm node show'
in jammy+ instead of 'crm node status'. The difference is that
'crm node show' returns the pacemaker-remote nodes in addition
 to the hacluster nodes. This change limits the nodes returned
by list_nodes() to the hacluster nodes (ie. the nodes that
have a node ID).

Closes-Bug: #1995295
Change-Id: Ia405d4270f56c949f79167f8b75c1304b598b918
2022-11-01 15:40:09 +01:00
Corey Bryant f73ca4d52f Update 'crm node show' parsing to trim ': member'
The command 'crm node show' is used on jammy to retrieve the list of
nodes defined in a cluster. The output for nodes includes ': member'
which breaks ensuing commands that are using list_nodes() output.

For example:
juju-3f6cb6-zaza-4135aa8b2509-8.project.serverstack: member

This change trims everything including and after the ':' from the
output.

Closes-Bug: #1994160
Change-Id: I54a4f854f3e293503ec97d99a49b6dc51ee50c87
2022-10-25 19:16:55 +00:00
Felipe Reyes 4e53bea076 Fix 'crm node show' parsing to get list of nodes.
The command 'crm node show' is used on jammy to retrieve the list of
nodes defined in a cluster, although this command also includes the
properties set on a node (e.g. standby=off) which breaks the current
logic parsing.

This change uses a regular expresion to filter out all the lines from
the output that don't start with a non-white character (^\S+).

Change-Id: I3e00daa1b877a7faae1370f08b2d9c5bd7795c5f
Closes-Bug: #1987685
Related-Bug: #1972022
2022-08-25 12:13:33 -04:00
Zuul cf1c3eeeb3 Merge "Drop the use of 'crm node status' on jammy." 2022-06-23 20:58:16 +00:00
Zuul 3acc36209d Merge "Install resource-agents-extra on jammy." 2022-06-23 20:58:04 +00:00
Robert Gildein 920d0ab927 Switch to render from charmhelpers
- add contrib/templating
- using render instead of render_template
- remove render_template function

Change-Id: I395d7dc06618998b9e6023ff649f4aa8c5930cc0
2022-06-01 16:16:17 +02:00
Felipe Reyes 1347ea00c1 Drop the use of 'crm node status' on jammy.
The version of crmsh available on jammy doesn't have the 'crm node
status' subcommand available since it was removed[0], this change uses
the command 'crm node attribute' to figure out if the node is in standby
mode when running on ubuntu>=jammy, and 'crm node show' to get the list
of nodes.

[0] https://github.com/ClusterLabs/crmsh/pull/753

Change-Id: Iafb711be220573cb701527ec84de285edd8942cf
Closes-Bug: #1972022
2022-05-19 10:53:53 -04:00
Felipe Reyes 32ab2d4bda Install resource-agents-extra on jammy.
The install hooks rsync a set of scripts and one of the destinations
is /usr/lib/stonith/plugins/external, this directory is created by the
installation of the package cluster-glue which is a pulled as an
indirect dependency of pacemaker, this changed on >=jammy where an
intermediate package named resource-agents was split into
resource-agents-base and resource-agents-extra wher the latter doesn't
get installed and it's the one that depends on cluster-glue.

The specific chain of dependencies are:

focal:
pacemaker -> pacemaker-resource-agents -> resource-agents -> cluster-glue

jammy
pacemaker -> pacemaker-resource-agents -> resource-agents-base

Change-Id: Ia00061bff2ebe16d35d52b256c61243935edabba
Closes-Bug: #1971841
2022-05-06 09:55:15 -04:00
Felipe Reyes 715d31e09f Catch FileExistsError when creating /etc/corosync dir.
Hooks are expected to be idempotent, if the install hook for whatever
reason needs to be re-run and the /etc/corosync directory already exists,
because for example it was created in a previous run, the exception
FileExistsError will be raised, this change captures the exception and
moves on.

Change-Id: If43a5c95bb59c9cca7f1a975214a9f013ad6f4d6
Closes-Bug: #1971762
2022-05-05 15:03:58 -04:00
Rodrigo Barbieri d54de3d346 Prevent errors when private-address=None
Whenever a peer returns None as its IP, it results in
misconfiguration in corosync.conf, which results in
a series of cascading hook errors that are difficult to
sort out.

More specifically, this usually happens when network-get
does not work for the current binding. The main problem
is that when changing bindings, a hook fires before the
network-get data is updated. This hook fails and prevents
the network-get from being re-read.

This patch changes the code behavior to ignore None IP
entries, therefore gracefully exiting and deferring further
configuration due to insufficient number of peers when that
happens, so that a later hook can successfully read the IP
from the relation and set the IPs correctly in corosync.

Closes-bug: #1961448
Change-Id: I5ed140a17e184fcf6954d0f66e25f74564bd281c
2022-04-11 16:58:33 -03:00
Billy Olsen d1191dbcab Render corosync.conf file prior to pkg install
Starting in focal, the ubuntu version of corosync package synced in from
debian includes node1 as the default name for the local node with a nodeid
of 1. This causes the cluster to have knowledge of this extra node1 node,
which affects quorum, etc. Installing the charm's corosync.conf file
before package installation prevents this conditioning from happening.

Additionally this change removes some Xenial bits in the charm and always
includes a nodelist in corosync.conf as it is compulsory in focal and
newer. It is optional in the bionic packages, so we'll always just
render the nodelist.

Change-Id: I06b9c23eb57274f0c99a3a05979c0cabf87c8118
Closes-Bug: #1874719
2022-03-16 08:13:49 -07:00
Stephan Pampel e2249d05e1 Set loglevel of "Pacemaker is ready" to TRACE
Closes-Bug: #1889482
Change-Id: Ie97d09f5bd319a4adf93abd44fc465c77fd20620
2021-08-20 14:41:37 +02:00
Zuul a5b408ff52 Merge "Safely delete node from ring" 2021-06-28 11:11:15 +00:00
David Ames 102d463aa3 Safely delete node from ring
Provide the delete-node-from-ring action to safely remove a known node
from the corosync ring.

Update the less safe update-ring action to avoid LP Bug #1933223 and
provide warnings in actions.yaml on its use.

Change-Id: I56cf2360ac41b12fc0a508881897ba63a5e89dbd
Closes-Bug: #1933223
2021-06-25 07:38:18 -07:00
Zuul 8215c5c9e3 Merge "Retry on "Transport endpoint is not connected"" 2021-06-25 07:45:16 +00:00
Xav Paice d17fdd276e Add option for no-quorum-policy
Adds a config item for what to do when the cluster does not have quorum.
This is useful with stateless services where, e.g., we only need a VIP
and that can be up on a single host with no problem.

Though this would be a good relation data setting, many sites would
prefer to stop the resources rather than have a VIP on multiple hosts,
causing arp issues with the switch.

Closes-bug: #1850829
Change-Id: I961b6b32e7ed23f967b047dd0ecb45b0c0dff49a
2021-06-25 10:18:14 +12:00
David Ames 3b872ff4d2 Retry on "Transport endpoint is not connected"
The crm node delete already handles some expected failure modes. Add
"Transport endpoint is not connected" so that it retries the node
delete.

Change-Id: I9727e7b5babcfed1444f6d4821498fbc16e69297
Closes-Bug: #1931588
Co-authored-by: Aurelien Lourot <aurelien.lourot@canonical.com>
2021-06-24 11:59:02 +02:00
Aurelien Lourot 06796e6518 Fix pacemaker-remote-relation-changed hook error
This was happening because
trigger_corosync_update_from_leader() was being called
not only in `ha` relation hooks but also in
`pacemaker-remote` relation hooks after the implementation
for the related bug landed.

Closes-Bug: #1920124
Related-Bug: #1400481
Change-Id: I4952ef694589de6b72f04b387e30ca2333bc4cbc
2021-03-19 13:29:30 +01:00
Robert Gildein 64e696ae74 Improved action to display the cluster status
The `state` action will provide details about the health of the cluster.
This action has one parameter to display the history of the cluster status,
which is false by default.

Closes-Bug: #1717831
Change-Id: Iaf6e4a75a36491eab8e6802a6f437e5f410ed29e
2021-03-16 14:12:48 +01:00
Alvaro Uria 457f88eda6 Adjust quorum after node removal
Add an `update-ring` action for that purpose.
Also print more on various pacemaker failures.
Also removed some dead code.

Func-Test-PR: https://github.com/openstack-charmers/zaza-openstack-tests/pull/369
Change-Id: I35c0c9ce67fd459b9c3099346705d43d76bbdfe4
Closes-Bug: #1400481
Related-Bug: #1874719
Co-Authored-By: Aurelien Lourot <aurelien.lourot@canonical.com>
Co-Authored-By: Felipe Reyes <felipe.reyes@canonical.com>
2021-03-11 17:24:01 +01:00
Aurelien Lourot 6e1f20040c Remove some more dead code
Change-Id: Icf0c1a9c5e819bd2253d9a631e1ff6875bfd5200
Related-Bug: #1881114
2021-03-04 13:04:57 +01:00
Aurelien Lourot a9191136dc Fully deprecate stonith_enabled config option
This is already deprecated since June 2020 thanks
to a 'blocked' message in assess_status_helper()
but this commit:
1. makes it clear in config.yaml, and
2. removes the corresponding already dead code.

Change-Id: Ia6315273030e31b10125f2dd7a7fb7507d8a10b7
2021-03-04 11:10:34 +01:00
Zuul 71e771ac34 Merge "Fix docstring typo" 2021-02-23 08:58:54 +00:00
Przemysław Lal 2e218038b8 Fix docstring typo
Fix minor typo in need_resources_on_remotes() method docstring.

Signed-off-by: Przemysław Lal <przemyslaw.lal@canonical.com>
Change-Id: I1e7c377957552b179576bd4ec20089661379e763
2021-02-22 16:31:16 +01:00
James Page 6058f9985c Ensure crmsh is not removed during series upgrade
At bionic crmsh was dropped as a dependency of corosync so it
becomes a candidate for removal for older charm deployments
upgrading from xenial->bionic.

Mark the package as manually installed in the pre-upgrade hook
to ensure that it never becomes a candidate for removal.

Change-Id: I675684fee5410f86aace2e42515d3e325d8d12f8
Closes-Bug: 1900206
2021-01-29 10:57:09 +00:00
Zuul 355bbabe65 Merge "Increase default TimeoutStopSec value" 2020-12-17 20:26:40 +00:00
Ionut Balutoiu 4670f0effc Add config option for 'cluster-recheck-interval' property
This value is hard-coded to 60 seconds into the charm code.
This change adds a charm config option (with 60 secs as the
default value) in order to make the `cluster-recheck-interval`
property configurable.

Change-Id: I58f8d4831cf8de0b25d4d026f865e9b8075efe8b
2020-12-16 11:34:33 +00:00
Billy Olsen 9645aefdec Increase default TimeoutStopSec value
The charm installs systemd overrides of the TimeoutStopSec and
TimeoutStartSec parameters for the corosync and pacemaker services.
The default timeout stop parameter is changed to 60s, which is a
significant change from the package level default of 30 minutes. The
pacemaker systemd default is 30 minutes to allow time for resources
to safely move off the node before shutting down. It can take some
time for services to migrate away under a variety of circumstances (node
usage, the resource, etc).

This change increases the timeout to 10 minutes by default, which should
prevent things like unattended-upgrades from causing outages due
services not starting because systemd timed out (and an instance was
already running).

Change-Id: Ie88982fe987b742082a978ff2488693d0154123b
Closes-Bug: #1903745
2020-12-15 16:44:11 -07:00
Aurelien Lourot 2e799e5cf0 Fix install hook on Groovy
Also add Groovy to the test gate and sync libraries

Change-Id: If32560a88cfa6735bf5e502a70e6b84b0171f045
Closes-Bug: #1903546
2020-11-10 16:40:06 +01:00
Zuul a2797151f7 Merge "NRPE: Don't report paused hacluster nodes as CRITICAL error" 2020-11-09 12:21:25 +00:00
Billy Olsen 3080d64281 Remove the corosync_rings check in eoan+
Corosync 2.99 altered the status output for udp/udpu rings to
be hardcoded to 'OK'. This breaks the check_corosync_rings nrpe
check which is looking for 'ring $number active with no faults'.
Since the value has been hardcoded to show 'OK', the check itself
does not provide any real meaningful value.

Change-Id: I642ecf11946b1ea791a27c54f0bec54adbfecb83
Closes-Bug: #1902919
2020-11-06 14:20:00 -07:00
Martin Kalcok c385fef7b0 NRPE: Don't report paused hacluster nodes as CRITICAL error
Previously, paused hacluster units showed up CRITICAL error
in nagios even though they were only in the 'standby' mode
in corosync.
The hacluster charm now uses the '-s' option of the check_crm
nrpe script to ignore alerts of the standby units.

Change-Id: I976d5ff01d0156fbaa91f9028ac81b44c96881af
Closes-Bug: #1880576
2020-11-06 14:19:42 +01:00
Andrea Ieri 0ce34b17be Improve resource failcount detection
The old check_crm script had separate checks for failcounts and failed
actions, but since failed actions cause failcounts, the two will always be
present together, and expire together.
Furthermore, the previous defaults effectively caused the failed actions
check to shadow the failcount one, because the former used to cause
CRITICALs, while the latter was only causing WARNINGs.

This version of check_crm deprecates failed actions detection in favor of
only failcount alerting, but adds support for separate warn/crit
thresholds.
Default thresholds are set at 3 and 10 for warn and crit, respectively.

Although sending criticals for high fail counter entries may seem
redundant when we already do that for stopped resources, some resources
are configured with infinite migration thresholds and will therefore
never show up as failed in crm_mon. Having separate fail counter
thresholds can therefore still be valuable, even if for most resources
migration-threshold will be set lower than the critical fail-counter threshold.

Closes-Bug: #1864040
Change-Id: I417416e20593160ddc7eb2e7f8460ab5f9465c00
2020-11-02 14:07:18 +00:00
Alex Kavanagh b8c9fc66b4 Sync libraries & common files prior to freeze
* charm-helpers sync for classic charms
* charms.ceph sync for ceph charms
* rebuild for reactive charms
* sync tox.ini files as needed
* sync requirements.txt files to sync to standard

Change-Id: I7c643447959cfd82234653fdbd2bab1c0594469c
2020-09-28 10:22:11 +01:00
Liam Young e02c6257ae Fix adding of stonith controlled resources.
There appears to be a window between a pacemaker remote resource
being added and the location properties for that resource being
added. In this window the resource is down and pacemaker may fence
the node.

The window is present because the charm charm currently does:

1) Set stonith-enabled=true cluster property
2) Add maas stonith device that controls pacemaker remote node that
   has not yet been added.
3) Add pacemaker remote node
4) Add pacemaker location rules.

I think the following two fixes are needed:

1) For initial deploys update the charm so it does not enable stonith
   until stonith resources and pacemaker remotes have been added.

2) For scale-out do not add the new pacemaker remote stonith resource
   until the corresponding pacemaker resource has been added along
   with its location rules.

Depends-On: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43
Change-Id: I7e2f568d829f6d0bfc7859a7d0ea239203bbc490
Closes-Bug: #1884284
2020-09-09 09:35:30 +00:00
Liam Young 527fd2c704 Spread pacemaker remote resources across cluster.
Use location directives to spread pacemaker remote resources across
cluster. This is to prevent multiple resources being taken down in
the event of a single node failure. This would usually not be a
problem but if the node is being queried by masakari host
monitors at the time the node goes down then the query can hang.

Change-Id: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43
Partial-Bug: #1889094
Depends-On: Ic45dbdd9d8581f25549580c7e98a8d6e0bf8c3e7
2020-09-08 11:35:25 +00:00
Liam Young b40a6754b0 Create null stonith resource for lxd containers.
If stonith is enabled then when a compute node is detected as failed
it is powered down. This can include a lxd container which is also
part of the cluster. In this case because stonith is enabled at a
global level, pacemaker will try and power off the lxd container
too. But the container does not have a stonith device and this causes
the container to be marked as unclean (but not down). This running
unclean state prevents resources being moved and causes any
pacemaker-remotes that are associated with the lost container from
losing their connection which prevents masakari hostmonitor from
ascertaining the cluster health.

The way to work around this is to create a dummy stonith device for
the lxd containers. This allows the cluster to properly mark the lost
container as down and resources are relocated.

Change-Id: Ic45dbdd9d8581f25549580c7e98a8d6e0bf8c3e7
Partial-Bug: #1889094
2020-09-03 10:58:48 +00:00
Liam Young ca34574592 Ensure setup is only run on leader.
configure_cluster_global, configure_monitor_host and configure_stonith should
only be run by the leader, otherwise there is the risk of the updates
happening simultaneously and failing.

Change-Id: I495ee093a8395433412d890396cd991c6acd97f3
Closes-Bug: #1884797
2020-07-20 18:06:49 +00:00
Zuul d2c7ed2821 Merge "Remove old configure_stonith" 2020-07-20 14:34:21 +00:00
Alex Kavanagh 24fa642247 Fix directory /etc/nagios/nrpe.d/ issue
Under certain deployment conditions, the charm can attempt to write to
the /etc/nagios/nrpe.d/ directory before it exists.  This directory is
created by the nrpe charm, but if the hacluster (this charm) gets
installed first, then it can be triggered to attempt to set up the nrpe
entries before the directory can be created by nrpe.  This change (and
the associated charm-helpers change) ensures that the charm will delay
the nrpe config until the directory is available (and thus, the nrpe
charm is fully installed)

Related charm-helpers: https://github.com/juju/charm-helpers/pull/492

Change-Id: Ibcbb5f56205b72c475807e3c34c64a00844908f4
Closes-Bug: #1882557
2020-07-15 15:13:08 +01:00
Liam Young eaaa5e5bd8 Remove old configure_stonith
The old configure_stonith last worked on xenial. This patch bypassies
it given it is broken and it is interfering with the new pacemaker
remote stonith resource setup. The code will be removed by a follow
up patch. This patch is designed to be small for easy cherry picking.

Change-Id: I2a704c1221bda242caaa5e87849d9984db3c6b71
Partial-Bug: #1881114
2020-06-25 05:10:26 +00:00
Liam Young 8f54a80d24 Remove hash from stonith resource name
The recent maas stonith resources, introduced to support stonith
with pacemaker-remotes, included a hash of the combined url and
api key in the resource name. But the charm only supports one
stonith resource (single maas_url api key config options). Having
the hash makes managing the resources more complicated espically
when the url or api key change. So remove any existing resource
(there is very unlikely to be one as the feature is only just
out of preview state) and replace with a single resource called
'st-maas'.

Change-Id: I053f1dc882eebfdef384cbbbfa7cabc82bce5f8b
2020-06-03 11:40:59 +00:00
Aurelien Lourot c30f8a8e19 Trace failures of set_unit_status()
Closes-Bug: #1878221
Change-Id: I9a741d51fb2bab7be12fd0496af6a18bbddd1709
2020-05-12 16:43:58 +02:00
Liam Young d860f3406c Check for peer series upgrade in pause and status
Check whether peers have sent series upgrade notifications before
pausing a unit. If notifications have been sent then HA services
will have been shutdown and pausing will fail.

Similarly, if series upgrade notifications have been sent then
do not try and issue crm commands when assessing status.

Change-Id: I4de0ffe5d5e24578db614c2e8640ebd32b8cd469
Closes-Bug: #1877937
2020-05-11 10:56:22 +00:00
Liam Young 4c9887b38c Use fqdn to refer to remote nodes
Now that nova-compute uses an fqdn to identify the hypervisor *1
switch to registering remote nodes using their fqdn to keep
everything consistent. This is a requirement for masakari host
monitoring.

*1 1869bfbc97

Change-Id: I6cd55fdf21482de908920bcc4d399ac03f903d58
2020-03-10 13:32:37 +00:00
José Pekkarinen c99eed495c
Add support for maas_source_key for offline deployments.
Closes-Bug: #1856148
Change-Id: Id28d4c5c8c711ef53e9ec0422d80d23a6a844291
Signed-off-by: José Pekkarinen <jose.pekkarinen@canonical.com>
2020-02-26 09:59:15 +02:00