Commit Graph

75 Commits

Author SHA1 Message Date
Zuul 871f551e21 Merge "Get private-address for local unit from relation" 2023-10-03 15:20:00 +00:00
Tiago Pasqualini 71100249ee Get private-address for local unit from relation
Currently, the private-address for the local unit is queried using
unit_get, which can cause it to return an address from a different
binding. This patch changes it to always query from the relation.

Closes-bug: #2020669
Change-Id: I128420c572d5491b9af4cf34614f4534c787d02c
2023-09-21 11:00:34 -03:00
Gabriel Cocenza aa557b85b7 Add application version on HA Cluster charm
Closes-Bug: #2031438
Change-Id: I4dab721ebe42d4c43c09a98204ce8113892aa817
2023-08-16 11:01:31 -03:00
Zuul cf1c3eeeb3 Merge "Drop the use of 'crm node status' on jammy." 2022-06-23 20:58:16 +00:00
Robert Gildein 920d0ab927 Switch to render from charmhelpers
- add contrib/templating
- using render instead of render_template
- remove render_template function

Change-Id: I395d7dc06618998b9e6023ff649f4aa8c5930cc0
2022-06-01 16:16:17 +02:00
Felipe Reyes 1347ea00c1 Drop the use of 'crm node status' on jammy.
The version of crmsh available on jammy doesn't have the 'crm node
status' subcommand available since it was removed[0], this change uses
the command 'crm node attribute' to figure out if the node is in standby
mode when running on ubuntu>=jammy, and 'crm node show' to get the list
of nodes.

[0] https://github.com/ClusterLabs/crmsh/pull/753

Change-Id: Iafb711be220573cb701527ec84de285edd8942cf
Closes-Bug: #1972022
2022-05-19 10:53:53 -04:00
Rodrigo Barbieri d54de3d346 Prevent errors when private-address=None
Whenever a peer returns None as its IP, it results in
misconfiguration in corosync.conf, which results in
a series of cascading hook errors that are difficult to
sort out.

More specifically, this usually happens when network-get
does not work for the current binding. The main problem
is that when changing bindings, a hook fires before the
network-get data is updated. This hook fails and prevents
the network-get from being re-read.

This patch changes the code behavior to ignore None IP
entries, therefore gracefully exiting and deferring further
configuration due to insufficient number of peers when that
happens, so that a later hook can successfully read the IP
from the relation and set the IPs correctly in corosync.

Closes-bug: #1961448
Change-Id: I5ed140a17e184fcf6954d0f66e25f74564bd281c
2022-04-11 16:58:33 -03:00
Stephan Pampel e2249d05e1 Set loglevel of "Pacemaker is ready" to TRACE
Closes-Bug: #1889482
Change-Id: Ie97d09f5bd319a4adf93abd44fc465c77fd20620
2021-08-20 14:41:37 +02:00
Zuul a5b408ff52 Merge "Safely delete node from ring" 2021-06-28 11:11:15 +00:00
David Ames 102d463aa3 Safely delete node from ring
Provide the delete-node-from-ring action to safely remove a known node
from the corosync ring.

Update the less safe update-ring action to avoid LP Bug #1933223 and
provide warnings in actions.yaml on its use.

Change-Id: I56cf2360ac41b12fc0a508881897ba63a5e89dbd
Closes-Bug: #1933223
2021-06-25 07:38:18 -07:00
Xav Paice d17fdd276e Add option for no-quorum-policy
Adds a config item for what to do when the cluster does not have quorum.
This is useful with stateless services where, e.g., we only need a VIP
and that can be up on a single host with no problem.

Though this would be a good relation data setting, many sites would
prefer to stop the resources rather than have a VIP on multiple hosts,
causing arp issues with the switch.

Closes-bug: #1850829
Change-Id: I961b6b32e7ed23f967b047dd0ecb45b0c0dff49a
2021-06-25 10:18:14 +12:00
Alvaro Uria 457f88eda6 Adjust quorum after node removal
Add an `update-ring` action for that purpose.
Also print more on various pacemaker failures.
Also removed some dead code.

Func-Test-PR: https://github.com/openstack-charmers/zaza-openstack-tests/pull/369
Change-Id: I35c0c9ce67fd459b9c3099346705d43d76bbdfe4
Closes-Bug: #1400481
Related-Bug: #1874719
Co-Authored-By: Aurelien Lourot <aurelien.lourot@canonical.com>
Co-Authored-By: Felipe Reyes <felipe.reyes@canonical.com>
2021-03-11 17:24:01 +01:00
Aurelien Lourot a9191136dc Fully deprecate stonith_enabled config option
This is already deprecated since June 2020 thanks
to a 'blocked' message in assess_status_helper()
but this commit:
1. makes it clear in config.yaml, and
2. removes the corresponding already dead code.

Change-Id: Ia6315273030e31b10125f2dd7a7fb7507d8a10b7
2021-03-04 11:10:34 +01:00
Przemysław Lal 2e218038b8 Fix docstring typo
Fix minor typo in need_resources_on_remotes() method docstring.

Signed-off-by: Przemysław Lal <przemyslaw.lal@canonical.com>
Change-Id: I1e7c377957552b179576bd4ec20089661379e763
2021-02-22 16:31:16 +01:00
Ionut Balutoiu 4670f0effc Add config option for 'cluster-recheck-interval' property
This value is hard-coded to 60 seconds into the charm code.
This change adds a charm config option (with 60 secs as the
default value) in order to make the `cluster-recheck-interval`
property configurable.

Change-Id: I58f8d4831cf8de0b25d4d026f865e9b8075efe8b
2020-12-16 11:34:33 +00:00
Alex Kavanagh b8c9fc66b4 Sync libraries & common files prior to freeze
* charm-helpers sync for classic charms
* charms.ceph sync for ceph charms
* rebuild for reactive charms
* sync tox.ini files as needed
* sync requirements.txt files to sync to standard

Change-Id: I7c643447959cfd82234653fdbd2bab1c0594469c
2020-09-28 10:22:11 +01:00
Liam Young e02c6257ae Fix adding of stonith controlled resources.
There appears to be a window between a pacemaker remote resource
being added and the location properties for that resource being
added. In this window the resource is down and pacemaker may fence
the node.

The window is present because the charm charm currently does:

1) Set stonith-enabled=true cluster property
2) Add maas stonith device that controls pacemaker remote node that
   has not yet been added.
3) Add pacemaker remote node
4) Add pacemaker location rules.

I think the following two fixes are needed:

1) For initial deploys update the charm so it does not enable stonith
   until stonith resources and pacemaker remotes have been added.

2) For scale-out do not add the new pacemaker remote stonith resource
   until the corresponding pacemaker resource has been added along
   with its location rules.

Depends-On: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43
Change-Id: I7e2f568d829f6d0bfc7859a7d0ea239203bbc490
Closes-Bug: #1884284
2020-09-09 09:35:30 +00:00
Liam Young 527fd2c704 Spread pacemaker remote resources across cluster.
Use location directives to spread pacemaker remote resources across
cluster. This is to prevent multiple resources being taken down in
the event of a single node failure. This would usually not be a
problem but if the node is being queried by masakari host
monitors at the time the node goes down then the query can hang.

Change-Id: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43
Partial-Bug: #1889094
Depends-On: Ic45dbdd9d8581f25549580c7e98a8d6e0bf8c3e7
2020-09-08 11:35:25 +00:00
Liam Young b40a6754b0 Create null stonith resource for lxd containers.
If stonith is enabled then when a compute node is detected as failed
it is powered down. This can include a lxd container which is also
part of the cluster. In this case because stonith is enabled at a
global level, pacemaker will try and power off the lxd container
too. But the container does not have a stonith device and this causes
the container to be marked as unclean (but not down). This running
unclean state prevents resources being moved and causes any
pacemaker-remotes that are associated with the lost container from
losing their connection which prevents masakari hostmonitor from
ascertaining the cluster health.

The way to work around this is to create a dummy stonith device for
the lxd containers. This allows the cluster to properly mark the lost
container as down and resources are relocated.

Change-Id: Ic45dbdd9d8581f25549580c7e98a8d6e0bf8c3e7
Partial-Bug: #1889094
2020-09-03 10:58:48 +00:00
Zuul d2c7ed2821 Merge "Remove old configure_stonith" 2020-07-20 14:34:21 +00:00
Liam Young eaaa5e5bd8 Remove old configure_stonith
The old configure_stonith last worked on xenial. This patch bypassies
it given it is broken and it is interfering with the new pacemaker
remote stonith resource setup. The code will be removed by a follow
up patch. This patch is designed to be small for easy cherry picking.

Change-Id: I2a704c1221bda242caaa5e87849d9984db3c6b71
Partial-Bug: #1881114
2020-06-25 05:10:26 +00:00
Liam Young 8f54a80d24 Remove hash from stonith resource name
The recent maas stonith resources, introduced to support stonith
with pacemaker-remotes, included a hash of the combined url and
api key in the resource name. But the charm only supports one
stonith resource (single maas_url api key config options). Having
the hash makes managing the resources more complicated espically
when the url or api key change. So remove any existing resource
(there is very unlikely to be one as the feature is only just
out of preview state) and replace with a single resource called
'st-maas'.

Change-Id: I053f1dc882eebfdef384cbbbfa7cabc82bce5f8b
2020-06-03 11:40:59 +00:00
Liam Young d860f3406c Check for peer series upgrade in pause and status
Check whether peers have sent series upgrade notifications before
pausing a unit. If notifications have been sent then HA services
will have been shutdown and pausing will fail.

Similarly, if series upgrade notifications have been sent then
do not try and issue crm commands when assessing status.

Change-Id: I4de0ffe5d5e24578db614c2e8640ebd32b8cd469
Closes-Bug: #1877937
2020-05-11 10:56:22 +00:00
Liam Young 4c9887b38c Use fqdn to refer to remote nodes
Now that nova-compute uses an fqdn to identify the hypervisor *1
switch to registering remote nodes using their fqdn to keep
everything consistent. This is a requirement for masakari host
monitoring.

*1 1869bfbc97

Change-Id: I6cd55fdf21482de908920bcc4d399ac03f903d58
2020-03-10 13:32:37 +00:00
José Pekkarinen c99eed495c
Add support for maas_source_key for offline deployments.
Closes-Bug: #1856148
Change-Id: Id28d4c5c8c711ef53e9ec0422d80d23a6a844291
Signed-off-by: José Pekkarinen <jose.pekkarinen@canonical.com>
2020-02-26 09:59:15 +02:00
Liam Young d3512ef320 Stop HA services accross units for series upgrade
Stop HA services accross all units of an application when doing a
series upgrade to avoid the situation where the cluster has some
nodes on LTS N-1 and some on LTS N.

1) In the 'pre-series-upgrade' send a notification to peers informing
   them that the unit is doing a series upgrade and to which Ubuntu
   version.
2) Peers receive notification. If they are on a later Ubuntu version
   than the one in the notification then they do nothing. Otherwise
   they shutdown corosync and pacemaker and add an entry to the local
   kv store with waiting-unit-upgrade=True.
3) In the 'post-series-upgrade' hook the notification is removed from
   the peer relation. waiting-unit-upgrade is set to False and
   corosync and pacemaker are started.

The result of this is that when the first unit in the cluster starts
a series upgrade all cluster services are shutdown across all units.
They then rejoin the cluster one at a time when they have been
upgraded to the new version.

I added the waiting-unit-upgrade key to deal with the situation where
the first node clears the notification after it has successfully
upgraded, with out the waiting-unit-upgrade the peers would not know
they were in a mixed Ubuntu version cluster.

Change-Id: Id9167534e8933312c561a6acba40399bca437706
Closes-Bug: 1859150
2020-01-31 07:16:21 +00:00
Alex Kavanagh bb4438cddb Fix minor spelling missage in status message
... for the series upgrade status message.

Change-Id: I6e0b6e4945eabd8294600ffab103d85768b622a6
2019-12-09 16:07:18 +00:00
Ryan Beisner 6ed2bb0943
Standardize auxiliary file location across os-charms
Change-Id: Ifaa5453bc0703c77184184e05c53d21649f6b92e
Closes-Bug: #1843826
2019-09-12 15:51:49 -05:00
David Ames 9364440075 Make the workgroup status more robust
The current charm does not indicated to the end user when a specific
resource is not running. Neither does it indicate when a node is offline
or stopped.

Validate that configured resources are actually running and let the end
user know if they are not.

Closes-Bug: #1834263

Change-Id: I1171e71ae3b015b4b838b7ecf0de18eb10d7c8f2
2019-06-25 23:26:10 +00:00
Andrea Ieri e28f8a9adc Enable custom failure-timeout configuration
As explained here[0], setting failure-timeout means that the cib will 'forget'
that a resource agent action failed by setting failcount to 0:
- if $failure-timeout seconds have elapsed from the last failure
- if an event wakes up the policy engine (i.e. at the global resource
  recheck in an idle cluster)

By default the failure-timeout will be set to 0, which disables the feature,
however this change allows for tuning.

[0] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_failure_response

Change-Id: Ia958a8c5472547c7cf0cb4ecd7e70cb226074b88
Closes-Bug: #1802310
2019-05-31 21:15:13 +00:00
Liam Young 09865333a4 Stop stonith-enabled being incorrectly set false.
Stonith is being disabled at the global cluster level despite it being needed
for pacemaker-remote nodes.

The legacy hacluster charm option 'stonith_enable' covers the main 'member'
nodes and if it is set to false then stonith resources are not created for
them and the stonith-enabled cluster parameter is set to false. However, in a
masakari deploy stonith is not required for the member nodes but is for the
remote nodes. In this case stonith-enabled cluster option should be set to
true.

Change-Id: Ie1affa17dd3cfcd677aa866b6e3d1c1004bb13c9
Closes-Bug: #1824828
2019-04-15 15:13:24 +00:00
Liam Young 27699267ed Use IP supplied via relation for pacmaker remotes
When setting up resources for pacemaker remote nodes use the IP
address supplied by the remote node for communication. This
ensures that communication happens over the desired network
space.

Depends-On: I5aa6993ec702f97403d1a659e09a3fb2f5af4202
Change-Id: I9bb20b5f0b0d780fbf4cc0ac0e5f86fe277c4715
Closes-Bug: #1824514
2019-04-12 11:14:36 +00:00
Liam Young ed3ea84126 Install libmaas for stonith plugin
The libmaas stonith plugin uses libmaas which is supplied via
python3-libmaas. python3-libmaas is available from bionic onwards
so install it where possible.

Also, drive by fix to precreate /etc/pacemaker dir if needed to fix
trusty installs.

Closes-Bug: #1823300
Closes-Bug: #1823302

Change-Id: Ib14146f7f667f9c52e11d222f950efcb7cb47a7f
2019-04-05 08:28:56 +00:00
Zuul 438677fe7e Merge "Add support for pacemaker-remotes" 2019-04-03 14:23:47 +00:00
Zuul 482f8bba3a Merge "New member_ready state in peer relation" 2019-04-03 14:21:29 +00:00
Zuul 7c1794c2c9 Merge "Add support for maas stonith" 2019-04-03 14:21:28 +00:00
Zuul 30ccef8c92 Merge "Add pacemaker authkey" 2019-04-03 14:15:47 +00:00
Liam Young e357f2a1b5 Add support for pacemaker-remotes
This change adds support for pacmaker-remots joining the cluster
via the pacemaher-remote relation. The pacemaker-remotes can
advertise whether they should host resources. If the
pacemaker-remotes are only being used for failure detection
(as is the case with masakari host monitors) then they
will not host resources.

Pacemaker remotes are managed in the cluster as resources
which nominally run on a member in the main cluster. The
resource that corresponds to the pacemaker-remote is managed
via configure_pacemaker_remotes and configure_pacemaker_remote
functions.

If the pacemaker-remotes should not run resources then the
cluster needs to be switched to an opt-in cluster. In an
opt-in cluster location rules are needed to explicitly
allow a resource to run on a specific node. This behaviour
is controlled via the global cluster parameter 'symmetric-cluster'
this is set via the new method set_cluster_symmetry. The method
add_location_rules_for_local_nodes is used for creating these
explicit rules.

Change-Id: I0a66cfb1ecad02c2b185c5e6402e77f713d25f8b
2019-04-03 10:50:58 +00:00
Liam Young d3a16df5a9 New member_ready state in peer relation
This change adds a new 'member_ready' state to the peer relation.
This purpose of this is to indicate that a unit has rendered its
configuration and started its services. This is distinct from the
existing 'ready' indicator. 'ready' indicates that the unit can
render its config and start services and ensures that all units are
present prior to starting services. This seems to safe guard
against units starting to early and forming single node clusters or
small clusters with a sub-set of units. The 'member_ready' flag
is later in the process and if it is set it shows that this unit
has joined the cluster and can be referenceed explicitly in any
resource configuration.

Change-Id: I80d42a628a3fe51bc6f8d02610031afd9386d7a4
2019-04-03 10:49:22 +00:00
Liam Young 3d34611e88 Add support for maas stonith
The change adds a stonith plugin for maas and method for creating
stonith resources that use the plugin.

Change-Id: I825d211d68facce94bee9c6b4b34debaa359e836
2019-04-03 10:49:22 +00:00
Liam Young f3873fe67f Add pacemaker authkey
To work with pacemaker remotes all pacemaker nodes (including the
remotes) need to share a common key in the same way that corosync
does. This change allows a user to set a pacemaker key via config
in the same way as corosync. If the pacemaker key value is unset
then the corosync key is user.

Change-Id: I75247e7f3af29fc0907a94ae8e1678bdb9ee64e2
2019-04-03 10:49:22 +00:00
Chris MacNaughton fa9633f87a Ensure we have python3-jinja2
Also remove the conditional installs of python2 packages

Change-Id: I48e08c85d50080bf92d3b14882d9ce664fdb3621
2019-03-27 16:39:36 +01:00
Alex Kavanagh 02b406b6f3 Convert charm to Python 3
Change-Id: Ib7cc06b3b42f26f725a9ea79f09189cc72952d29
2019-03-14 12:40:07 +00:00
Andrea Ieri 81b386d379 Lower cluster-recheck-interval to 1 minute
The upstream default for cluster-recheck-interval is 15 minutes. This
renders any quick timer (i.e. failure-timeout) effectively meaningless,
as in many cases their expiration will only be checked 4 times per hour.
This commit lowers cluster-recheck-interval to 1 minute, which ensures a
fine enough granularity for specifying timers on a scale of minutes,
without overly stressing the policy engine.

Change-Id: I756d53e7267e6fa1b6fb348e219e30bafe757360
Closes-Bug: 1804667
2018-11-22 22:46:45 +01:00
Felipe Reyes 863bc4d05b After go in standby mode give time to the resources to be migrated
Add a 3 retries chance with 5 seconds between each to let pacemaker
finish migrating resources out from the node before failing.

Change-Id: Ia25489e7e702fe26cb3ee7d96c4cf2e53ead8a96
Closes-Bug: 1794992
2018-09-28 15:41:46 -03:00
David Ames ea6eb53059 Series Upgrade
Implement the series-upgrade feature allowing to move between Ubuntu
series.

Change-Id: Idcc77b66e65633eb26e485e93ef7928b7f455ca8
2018-09-14 20:56:34 -07:00
Felipe Reyes 2f4a815856 Catch MAASConfigIncomplete and set workload message
validate_dns_ha() shouldn't have the side effect of setting the workload
status and message, so now it will raise MAASConfigIncomplete with a
message and ha_relation_changed() is catching the exception, setting the
workload in blocked status and short circuiting the hook execution until
the user set maas_url and maas_credentials.

Change-Id: I3f1bc6571d2461cb65f384f042423cfe33f4d2f8
Closes-Bug: 1790559
2018-09-03 22:56:20 -03:00
Trent Lloyd 8200221a30 Enforce no-quorum-policy=stop for all cluster sizes
Previously quorum was only enforced on clusters with 3 or more nodes
under the mistaken assumption that it is not possible to have quorum
with only 2 nodes. The corosync votequorum agent which is configured
allows for quorum in 2-node scenarios using the "two_node" option which
is already configured by the charm.

In this new scenario, corosync requires that both nodes are present in
order to initially form cluster quorum, but also allows a single
surviving node to keep quorum or take-over once it was already started
while in contact with the other node.

The net effect of this change is that nodes are unable to startup
independently (which is when split brain situations are frequently seen
due to network startup delays, etc). There is no change to the runtime
behavior (there is still a risk that both nodes can go active if the
network connection between them is interrupted, this is an inherrent
risk of two-node clusters and requires a 3-node cluster to fix).

Thus we update the CRM configuration to always set no-quorum-policy=stop
regardless of whether the cluster has 2 or 3+ nodes.

In the event that you need to startup a cluster manually with only 1
node, first verify that the second node is definitely either powered off
or that corosync/pacemaker and all managed resources are stopped (thus
we can be sure it won't go split brain, because it cannot startup again
until it is in contact with the other node). Then you can override
cluster startup using this command to temporarily set the expected votes
to 1 instead of 2:
$ corosync-quorumtool -e1

Once the second node comes back up and corosync reconnects, the expected
vote count will automatically be reset to the configured value (or if
corosync is restarted).

Change-Id: Ica6a3ba387a4ab362400a25ff2ba0145e0218e1f
2018-02-09 14:10:46 +08:00
James Page 22d30c2294 Support use of json for relation data
In order to provide consistent presentation of relation data
in Python 3 based charms, support passing of data using JSON
format which ensures that data on relations won't continually
change due to non-deterministic dictionary iteration in py3.

Change-Id: I364a60ca7b91327fe88ee729cf49ff8ab3f5e2b6
Closes-Bug: 1741304
2018-01-04 18:01:28 +00:00
Billy Olsen 8d81e41576 Change MAAS DNS OCF to read IP from file
The MAAS DNS ocf resource is created specifying the IP address
from the leader node. When the resource is moved to another node
the IP address is not updated because the DNS record already
points to the leader's IP address. This means that the DNS record
will never be updated with any other unit's IP address when the
resource is moved around the cluster.

The IP address to use for the DNS record should be stored in the
pacemaker resource as the configuration is the same across the
entire cluster. This change makes it so that the IP address that
should be bound is written to a file in /etc/maas_dns/$resource_name
and used by the ocf:maas:dns resource when managing the DNS
resource records.

Migration is handled by checking the current version of the MAAS
OCF resource file and determining if the OCF_RESOURCE_INSTANCE
(the name of the resource) is present in the file.

Change-Id: If4e07079dd66dac51cd77c2600106b9b562c2483
Closes-Bug: #1711476
2017-09-20 17:08:44 -07:00