Commit Graph

263 Commits

Author SHA1 Message Date
Monty Taylor 5f8b586576 Retire stackforge/cookbook-pacemaker 2015-10-17 16:02:55 -04:00
Vincent Untz 5b4bb9262d Merge pull request #128 from aspiers/crm-wait-sync-master
use crm --wait to avoid races (stoney)
2014-08-01 14:14:32 -04:00
Vincent Untz 883a6ca4e7 Merge pull request #130 from vuntz/no-Stopped-on-retry
Do not force target-role to Stopped on retry of creation of a primitive
2014-08-01 11:54:43 -04:00
Vincent Untz 071dd4970d Always remove deprecated target-role meta on update of a primitive
We deprecated setting target-role values via the meta attribute, in
favor of :start/:stop actions on the resource. So this should not be
relied upon anymore, and it's safe to drop this if we want to.

There is one racy case where this matters:
  - node1 and node2 try to create a primitive with the chef resource; on
    initial creation, we set target-role='Stopped' because we do not
    want to autostart primitives.
  - because they can't create it at the same time, node2 will fail on
    creation. If the chef resource is configured to retry, then node2
    will then try to update the primitive (since it now exists); but the
    chef resource is not reloaded so still has target-role='Stopped'.
  - if node1 had also started the primitive before node2 retries the
    :create, then the target-role will be changed from 'Started' to
    'Stopped' with the update.

This can result in a primitive not being started with [:create, :start].
Therefore, we just delete this deprecated bit from meta to avoid any issue.

[Thu, 31 Jul 2014 15:11:55 +0200] INFO: Processing pacemaker_primitive[vip-public-cluster-proposal-1] action create (crowbar-pacemaker::haproxy line 15)
[Thu, 31 Jul 2014 15:11:56 +0200] INFO: Creating new primitive resource 'vip-public-cluster-proposal-1'
[Thu, 31 Jul 2014 15:11:56 +0200] INFO: Processing execute[crm configure primitive vip-public-cluster-proposal-1 ocf💓IPaddr2 params ip="192.168.126.5" meta target-role="Stopped" ] action run (/var/chef/cache/cookbooks/pacemaker/libraries/chef/mixin/pacemaker/sta
ndard_cib_object.rb line 91)

================================================================================
Error executing action `run` on resource 'execute[crm configure primitive vip-public-cluster-proposal-1 ocf💓IPaddr2 params ip="192.168.126.5" meta target-role="Stopped" ]'
================================================================================

Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of crm configure primitive vip-public-cluster-proposal-1 ocf💓IPaddr2 params ip="192.168.126.5" meta target-role="Stopped"  ----
STDOUT:
STDERR: ERROR: vip-public-cluster-proposal-1: id is already in use
---- End output of crm configure primitive vip-public-cluster-proposal-1 ocf💓IPaddr2 params ip="192.168.126.5" meta target-role="Stopped"  ----
Ran crm configure primitive vip-public-cluster-proposal-1 ocf💓IPaddr2 params ip="192.168.126.5" meta target-role="Stopped"  returned 1

[...]

[Thu, 31 Jul 2014 15:11:56 +0200] INFO: Retrying execution of pacemaker_primitive[vip-public-cluster-proposal-1], 0 attempt(s) left
[Thu, 31 Jul 2014 15:12:02 +0200] INFO: Checking existing primitive resource 'vip-public-cluster-proposal-1' for modifications
[Thu, 31 Jul 2014 15:12:02 +0200] INFO: vip-public-cluster-proposal-1's ip params didn't change
[Thu, 31 Jul 2014 15:12:02 +0200] INFO: vip-public-cluster-proposal-1's target-role meta changed from  to Stopped
[Thu, 31 Jul 2014 15:12:02 +0200] INFO: Processing execute[crm_resource --resource vip-public-cluster-proposal-1 --set-parameter "target-role" --parameter-value "Stopped" --meta] action run (/var/chef/cache/cookbooks/pacemaker/providers/primitive.rb line 88)
[Thu, 31 Jul 2014 15:12:02 +0200] INFO: execute[crm_resource --resource vip-public-cluster-proposal-1 --set-parameter "target-role" --parameter-value "Stopped" --meta] ran successfully
[Thu, 31 Jul 2014 15:12:02 +0200] INFO: Processing execute[crm configure primitive vip-public-cluster-proposal-1 ocf💓IPaddr2 params ip="192.168.126.5" meta target-role="Stopped" ] action nothing (/var/chef/cache/cookbooks/pacemaker/libraries/chef/mixin/pacemaker
/standard_cib_object.rb line 91)
[Thu, 31 Jul 2014 15:12:02 +0200] INFO: Processing execute[crm_resource --resource vip-public-cluster-proposal-1 --set-parameter "target-role" --parameter-value "Stopped" --meta] action nothing (/var/chef/cache/cookbooks/pacemaker/providers/primitive.rb line 88)
2014-08-01 17:50:48 +02:00
Vincent Untz b9cdda08b7 Start STONITH primitive resources
For some reason, sometimes, I see them not running after the cluster
creation. Since we're supposed to explicitly start all primitives, this
should work better.
2014-07-31 16:08:08 +02:00
Adam Spiers 9847e38ec0 use --wait with crm resource start/stop to avoid races
https://mailman.suse.de/mailman/private/ha-devel/2014-May/003603.html
(sorry - SUSE internal-only mailing list; but thread available on request)
2014-07-30 11:11:06 +01:00
Ralf Haferkamp 0b93f5ea16 Fix extraction of stonith parameter from parameter string
Pacemaker::Resource.extract_hash() is written to extract things from a complete
CIB object definition. Therefore it rightfully expects whitespaces before the
data_type ("params" in this case). So we just add those here to make the string
parsable by extract_hash.

https://bugzilla.novell.com/show_bug.cgi?id=887244
2014-07-17 14:51:45 +02:00
Balazs Kutil 68836e12bf Use more descriptive msg on sbd init check failure
Stonith + SBD deployment can fail due to SBD not being created on the
shared disk. The message does not really tell what went wrong and what
should be done to fix it. Use a more descriptive one.

The SBD creation is a destructive op and a typo could wipe the user
data, so this is the reason that sbd create is not run by default.

An alternative would be to run the sdb create tonly after user types
the volume name into the input field in the UI twice, for confirmation.
2014-07-02 14:38:20 +02:00
Adam Spiers a03aa062f8 fix Travis breakage
Travis started failing with:

  uninitialized constant RSpec::Matchers::BuiltIn::RaiseError::MatchAliases (NameError)

due to bundler installing rspec-expectations-2.99 which is a
pre-release:

  https://github.com/rubygems/rubygems/issues/853
2014-06-19 15:05:12 +01:00
Vincent Untz c9ebbebe3a sbd: Wait for cluster to be up after corosync restart
Every time we start corosync, we need to wait for the cluster to be up.
And when we use sbd, we restart corosync to start sbd, so we need to
wait there too.
2014-05-11 22:22:19 -04:00
Vincent Untz 3044ba1927 Merge pull request #66 from aspiers/default-stopped-resources
prevent LWRPs starting newly-created resources by default
2014-04-26 09:55:41 +02:00
Vincent Untz 46584ae2da pacemaker: Change default op timeout from 10 minutes to 1 minute
10 minutes is way too long and people have time to think that something
is just totally broken. 1 minute is more reasonable. And in the worst
case, an attribute can be set to change this now.
2014-04-17 16:19:10 +02:00
Vincent Untz c8567bc5cd Merge pull request #91 from aspiers/guard-superclasses
changes to superclasses should trigger all tests
2014-04-16 08:33:29 +02:00
Adam Spiers e662b98246 changes to superclasses should trigger all tests 2014-04-15 19:36:58 +01:00
Vincent Untz ead286b713 Merge pull request #81 from aspiers/move-vip-definition
move pacemaker_vip_primitive definition to crowbar-pacemaker cookbook
2014-04-15 09:30:13 +02:00
Vincent Untz 11d74e9846 sbd: Pass -P to sbd
This option is used to "Check Pacemaker quorum and node health" and is
recommended by our HA experts.
2014-04-11 08:53:36 +02:00
Vincent Untz 3d2a86e754 stonith: Ease deployment with SBD
While we cannot fully deploy SBD fencing (the user needs to provide the
devices and to make sure there's a watchdog for this hardware), we can
help a bit by doing the last few bits.
2014-04-11 08:53:36 +02:00
Adam Spiers fdebd24117 newly created resources should not be started
In Pacemaker, target-role defaults to 'Started', but we want to allow
consumers of the LWRPs the choice whether their newly created resource
gets started or not, and we also want to adhere to the Principle of
Least Surprise.  Therefore we stick to the intuitive semantics that

  action :create

creates the resource with target-role="Stopped" in order to prevent it
from starting immediately, whereas

  action [:create, :start]

creates the resource and then starts it.

Since we are honouring :start / :stop actions to determine the
target-role value, if target-role is specified via meta, it will just be
overridden anyway.  So we also deprecate direct use of target-role meta
parameter in recipes.
2014-04-10 23:53:20 +01:00
Adam Spiers c340bd4be2 avoid test messing with fixture
Assigning to `fixture` means that the change to the fixture can leak
out to other tests, so we assign to a temporary variable instead.
2014-04-10 23:34:09 +01:00
Adam Spiers c08fb8a3bd use crm --force to ensure start/stop in batch mode
We want to allow a resource X to be created without starting it, and
then for a group/clone Y to be created referencing X, which can
subsequently be started.  This ensures that child resources are only
started via their parent, respecting the constraints their parent needs
to impose (e.g. ordering if the parent is a group).

However in this case, changing target-role on Y causes a "Do you want to
override target-role for child resource ..." interactive prompt.  When
STDIN is not a tty, --force is required to ensure that the target-role
on the child resource.  This only works with newer versions of crm which
are patched according to:

  https://bugzilla.novell.com/show_bug.cgi?id=868697
2014-04-10 21:33:13 +01:00
Adam Spiers 1b4357d278 move pacemaker_vip_primitive definition to crowbar-pacemaker cookbook
The pacemaker_vip_primitive definition only works in a Crowbar
environment (due to retrieving the VIP name/address from Crowbar's
data bag for the relevant network), so it belongs in the
crowbar-pacemaker cookbook not the pacemaker cookbook.
2014-04-08 19:17:53 +01:00
Vincent Untz 90ec14e7c3 Improve naming of stonith resource in shared mode 2014-04-07 09:54:54 +02:00
Vincent Untz 5b5e96f417 ha: Update naming scheme for vip resources
We prefix with vip- so that it's a bit clearer.
2014-04-07 09:03:20 +02:00
Adam Spiers f705052ec7 Merge pull request #77 from aspiers/guard-mixins
changes to mixins require a full test run
2014-04-04 17:53:50 +01:00
Adam Spiers 3f13c02a65 Merge pull request #72 from aspiers/guard-recipes
Guard recipes
2014-04-04 17:53:13 +01:00
Adam Spiers cbb143b7ca changes to mixins require a full test run 2014-04-04 17:50:28 +01:00
Vincent Untz de7ef4c45c Merge pull request #76 from aspiers/vip-definition-network-error
make pacemaker_vip_primitive more defensive
2014-04-04 14:39:28 +02:00
Vincent Untz 0ef38c7109 Merge pull request #75 from aspiers/ignore-pacemaker-tmp
add tmp/ to .gitignore
2014-04-04 14:39:21 +02:00
Adam Spiers 5bf65a0ba8 add tmp/ to .gitignore
For example, guard creates tmp/rspec_guard_result, and we
do not want to see that in `git status`.
2014-04-04 11:51:42 +01:00
Adam Spiers 220389b697 make pacemaker_vip_primitive more defensive
Currently we are seeing pacemaker_vip_primitive occasionally
generate NoMethodError exceptions like:

  undefined method `[]' for nil:NilClass

Clearly we need more helpful error messages when things go wrong.
2014-04-04 11:48:53 +01:00
Vincent Untz e014361a8c Add support for mail notifications
Inspired from
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#s-notification-email
2014-04-04 08:51:20 +02:00
Adam Spiers 6983740487 rerun LWRP tests when recipes change 2014-04-03 16:22:55 +01:00
Adam Spiers 49ccaaf757 remove unused parentheses 2014-04-03 16:20:58 +01:00
Vincent Untz dabbb24445 Merge pull request #56 from aspiers/fix-travis-badge
fix Travis build badge
2014-03-29 13:18:25 +01:00
Adam Spiers 8e9b532b5f fix Travis build badge 2014-03-29 11:47:59 +00:00
Vincent Untz dd48572ffe Refer to "Fencing agent", not "STONITH plugin"
This is the recommendation from HA people.
2014-03-28 23:49:03 +01:00
Vincent Untz ab21bdf3a8 Do not use a clone resource for shared STONITH plugin
This is not required anymore according to our HA experts.

We therefore rename everything to "shared" instead of "clone".
2014-03-28 23:49:03 +01:00
Vincent Untz 3016e33d0f For per_node STONITH, be a bit clever for auto-generation of host* param
Some STONITH plugin take hostname as param, while some take hostlist.
And we can't really guess that. So for now, we check if there's a
hostname or hostlist key in the params, and if one of them is there, we
don't do anything. If none is them, we try with hostname.
2014-03-28 23:49:02 +01:00
Vincent Untz 7722560153 For per_node STONITH, have each node configure its own resources only
We have some races when creating STONITH resources because nodes are
creating resources at the same time. It turns out that, in the case of
crowbar, we can have each node configure its own resources only because
we know all nodes will run the code.
2014-03-28 23:49:02 +01:00
Vincent Untz 3e04b36985 Wait for corosync to be ready before doing anything
On the initial run, it takes 20-30 seconds before corosync is ready to
accept changes with "crm configure". So wait for that before we continue
(and apply crm-initial.conf, for instance).
2014-03-28 23:49:02 +01:00
Vincent Untz 4915a3dd62 Move STONITH check to second pass 2014-03-28 23:49:02 +01:00
Vincent Untz 95563f791a Check that the specified STONITH plugin is available
We simply need to check if the plugin is in the output of "stonith -L".
2014-03-28 22:10:49 +01:00
Vincent Untz 80e67ed039 Add a STONITH recipe
By default, STONITH is marked as disabled. It can be enabled and
configured:

 - manually (by the user)
 - with a stonith plugin that will be used with a clone resource
 - with a stonith plugin that will have per-node parameters
2014-03-28 22:01:32 +01:00
Vincent Untz 5044f71539 Make stonith-enabled/no-quorum-policy variables in crm-initial.conf
The values will change depending on options in the wrapper, or other
factors.
2014-03-28 22:01:31 +01:00
Vincent Untz e388359f6e Only load crm-initial.conf when the file got changed
There is no point in loading it if it's the same as before, since we
already imported these settings.

We load it immediately because we know that corosync is running already,
and it's a critical file that we do not want to fail loading because of
a later crash in chef-client.
2014-03-28 22:01:31 +01:00
Adam Spiers 5140ad4cb6 add Pacemaker-aware alternative service provider
This is an alternative provider for Chef's "service" platform resource.
It allows us to make an existing service resource Pacemaker-aware by
adding a single line to the service block.
2014-03-28 15:29:53 +00:00
Vincent Untz 73edd77d1a Merge pull request #51 from aspiers/order
refactoring and implement order LWRP
2014-03-26 12:02:49 +01:00
Adam Spiers cf0cd1a9fc implement order LWRP using library code 2014-03-25 18:37:51 +00:00
Adam Spiers d00a17fb1e implement Pacemaker::Constraint::Order 2014-03-25 18:37:50 +00:00
Adam Spiers e56a821156 test delete action better, even for non-runnable LWRPs 2014-03-25 18:37:50 +00:00