In case /var/lib/nova/instances resides on NFS we have seen migrations
failing with 'Failed to get "write" lock - Is another process using the
image' errors.
This has been tracked down to grace/lease timeouts not having expired
before attempting the migration/evacuate, so in this cases it might be
desirable to delay the nova evacuate call to give the storage time to
release the locks.
Change-Id: Ie2fe784202d754eda38092479b1ab3ff4d02136a
Resolves: rhbz#1740069
In all the chaos of complicated patch unentanglement and rebasing,
I forgot to actually even syntax-check the thing :-(
Change-Id: Ib0a31efe2ff75fc55cf67d4de73e74ebafb219b0
Neutron already takes care of the HA for dhcp agents. See neutron setting
`dhcp_agents_per_network` in `neutron.conf`. Having the OCF script call dhcp
replication function of neutron-ha-tool would mean that the tool will
try to plug each network to each DHCP agent, which contradicts neutron's
settings.
Change-Id: I87d9f7010092178c1677e14456b5e2606e5830dc
We don't need to run crudini twice to get the same config item; instead,
just remember the result of the first time.
Change-Id: I7591f5c7d1474447e29861e499d04b4b5bdb2a27
Apache-2.0 is the recommended license for OpenStack Big Tent
projects (see https://governance.openstack.org/reference/licensing.html)
and this simplifies the licensing of the overall git repo
quite a bit by removing an exception clause.
Change-Id: I827eb91fd18ced1848439d573cfe6df16ed27748
Closes-Bug: #1564844
When neutron routers need migration, make neutron-ha-tool's monitor
action return OCF_ERR_GENERIC not OCF_NOT_RUNNING. This is based on the
OCF Resource Agent Developer’s Guide, which says in the section for
OCF_ERR_GENERIC:
The action returned a generic error. A resource agent should use
this exit code only when none of the more specific error codes,
defined below, accurately describes the problem.
The cluster resource manager interprets this exit code as a soft
error. This means that unless specifically configured otherwise, the
resource manager will attempt to recover a resource which failed
with OCF_ERR_GENERIC in-place — usually by restarting the resource
on the same node.
-- http://www.linux-ha.org/doc/dev-guides/_literal_ocf_err_generic_literal_1.html
and also in the section for OCF_NOT_RUNNING:
If the resource is not running due to an error condition, the
monitor action should instead return one of the OCF_ERR_ exit codes
or OCF_FAILED_MASTER.
-- http://www.linux-ha.org/doc/dev-guides/_literal_ocf_not_running_literal_7.html
Change-Id: I55f78a5c341a8a552e06a252a9c6836877c0cf77
Make it clearer what the risks of not using shared storage are.
Information is based on:
http://docs.openstack.org/user-guide-admin/cli_nova_evacuate.html
which says "The command rebuilds the instance from the original image or
volume" but later says that "To preserve the user disk data on the
evacuated server, deploy Compute with a shared file system" and then use
--on-shared-storage.
Change-Id: I09600414eb0d7fff1cf301b11b3fa9a76fc08c77
https://github.com/SUSE-Cloud/cookbook-openstack-network/pull/1
adds (amongst many other things) support for neutron-ha-tool to retry
its connections to neutron-server. By taking advantage of this in
this OCF RA, we can make failover more robust.
Signed-off-by: Adam Spiers <aspiers@suse.com>
Change-Id: I41c37500f691e2e0ecfd6c31f1720f483513e447
The openstack mailing list moved from launchpad to openstack.org quite a
long time ago.
Change-Id: I8fcc16d223891c3cd12289b5ccd6a6a674bd2255
Signed-off-by: Adam Spiers <aspiers@suse.com>
neutron-ha-tool.py is now being maintained in the
neutron-ha-tool-maintenance branch of this fork:
https://github.com/SUSE-Cloud/cookbook-openstack-network/
One of the new changes in that branch is the option to obtain
os_password from /etc/neutron/os_password instead of from the Pacemaker
CIB. This is more secure and also avoids quoting issues with crmsh when
the password has unusual characters:
29e9759937
When we are using that approach, os_password is not set on the
primitive, so we change this parameter to no longer be required,
in order to avoid warnings from crm_verify etc.
Change-Id: I6cd675fc744c7cfb444bf524c6d6d6444f8e4368
Signed-off-by: Adam Spiers <aspiers@suse.com>
neutron-ha-tool.py is no longer available in the original upstream, so
it is now being maintained in the neutron-ha-tool-maintenance branch of
this fork:
https://github.com/SUSE-Cloud/cookbook-openstack-network/
Change-Id: If5145d76bd703c1e9f44b5ee6433216715755702
The neutron-ha-tool Pacemaker resource primitive is only intended to be
run on a single node at a time, i.e. in active/passive mode, rather than
as a clone. However until now, the RA didn't change behaviour depending
on whether it was supposed to be active on the current node. So if
Pacemaker did a probe on a node where it was not expecting it to be
active, the monitor action would typically return OCF_SUCCESS, causing
messages from pengine like:
error: Resource neutron-ha-tool (ocf::neutron-ha-tool) is active on 2 nodes attempting recovery
warning: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
and then Pacemaker could attempt unnecessary recovery according to the
value of the cluster-wide "multiple-active" option, which defaults to
"stop-start". This would stop the resource everywhere (which is a
noop), and then start it on one node, resulting in unnecessary cluster
transitions and unnecessary runs of this RA's "start" action.
To avoid this, we introduce a state file to keep track of whether it's
active on the current node, and if so, skip the l3-agent check and
always return OCF_NOT_RUNNING. This is the same technique already used
by NovaEvacuate.
Change-Id: I459e49d27802552ef5424d290ef3fca51640723b
Closes-Bug: #1555711
Signed-off-by: Adam Spiers <aspiers@suse.com>
Self-explanatory. This was rescued from the unmerged pull request on
the old repository:
https://github.com/madkiss/openstack-resource-agents/pull/21
(cherry picked from commit a4ba41bd23f5386afb4c3c6608f0a31211b5c179)
Change-Id: Id487efdbf1ec93242c30d4ec157fd482abbfc8b5
We have a lots of long heredocs lines in the OCF scripts and older bashate
versions consider these as E006 violations.
From version 0.5.0, bashate doesn't check heredocs, so we specify this
version as a dependency.
In addition, this commit turns on E006 violation checking again.
Change-Id: I1ff675dd587239f0b7fd65c15b8df57a39a2c72b
Signed-off-by: Norbert Illes <norbert.e.illes@ericsson.com>
The currently available bashate releases are considering heredocs as
normal code lines, hence lines longer than 79 columns in these sections
are also considered as E006 violations. As the OCF scripts are
containing lots of heredocs, we are affected by this behaviour.
However, there is a commit in the bashate repository (649c7dc79948)
which modifies bashate to ignore long lines in heredocs.
Currently there is no bashate release which contains the above commit,
so we ignore E006 errors until a new bashate released.
Change-Id: I33a9737ce1ec7eddab0b24ddedefe5c17da03b7a
Partial-Bug: #1550203
Signed-off-by: Norbert Illes <norbert.e.illes@ericsson.com>
This commit fixes a bashate E010 violation:
[E] E010: The "do" should be on same line as for: ' for i in `ps -o pid --no-headers --ppid $pid`'
- /home/adam/SUSE/cloud/OpenStack/git/openstack-resource-agents/ocf/cinder-volume : L219
Change-Id: I25b6e05336b1679818ad6f876bf94679a6d5ac10
Partial-Bug: #1550203
Signed-off-by: Adam Spiers <aspiers@suse.com>
This commit fixes bashate E003 (indents are a multiple of 4 spaces)
violations in the OCF scripts.
Partial-Bug: #1550203
Change-Id: I6fbc935bd5f9b383ca97c45f2dd89d7d33a5780f
Signed-off-by: Norbert Illes <norbert.e.illes@ericsson.com>
This commit fixes bashate E002 (indents are only spaces, and not hard
tabs) violations
Partial-Bug: #1550203
Change-Id: I7d156d47023781be74e6fa8daef6ffc311b55d9d
Signed-off-by: Norbert Illes <norbert.e.illes@ericsson.com>
This commit moves the syntax-check test from a make target to tox.ini
Change-Id: Id15320c589afea2b3a4a5cff5e7fa9c5c2b9d0b8
Signed-off-by: Norbert Illes <norbert.e.illes@ericsson.com>
This commit implements a simple tox.ini configuration to run bashate
style checker against all files in the ocf directory.
Partial-Bug: #1508559
Change-Id: I34b3fc108a86d902d0d856f632b5221e14f1f118
Signed-off-by: Norbert Illes <norbert.e.illes@ericsson.com>
These can be quite useful in some setups.
This depends on https://github.com/ClusterLabs/fence-agents/pull/37
Change-Id: I2cfef0a4bf7f94f74041c8fee236788c7a110cc5
Signed-off-by: Vincent Untz <vuntz@suse.com>
When we check the availability of Neutron API service, now we simply
check the response code of "List API version" call instead of getting a
token from Keystone, then checking a Neutron API endpoint using that
token. This way we don't need a token anymore so the checking process
will not depend on the availability on Keystone.
Partial-Bug: #1511721
Change-Id: I5fee8d47bd8e9af9f415b9f74c4f9325ac99df2f
Signed-off-by: Norbert Illes <norbert.e.illes@ericsson.com>
When no evacuation has been done yet, we're spamming syslog with:
Could not query value of evacuate: attribute does not exist
So let's just filter this out, since it's known to be expected on
initial setup.
As this requires a bashism, also move the script to use bash.
Change-Id: I3351919febc0ef0101e4a08ce6eb412e3c7cfc76