Merge "Pausing Principal units and hacluster without sending false alerts"
This commit is contained in:
commit
85afbf9397
|
@ -0,0 +1,453 @@
|
||||||
|
..
|
||||||
|
Copyright 2021 Canonical Ltd
|
||||||
|
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0
|
||||||
|
Unported License.
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
..
|
||||||
|
This template should be in ReSTructured text. Please do not delete
|
||||||
|
any of the sections in this template. If you have nothing to say
|
||||||
|
for a whole section, just write: "None". For help with syntax, see
|
||||||
|
http://sphinx-doc.org/rest.html To test out your formatting, see
|
||||||
|
http://www.tele3.cz/jbar/rest/rest.html
|
||||||
|
|
||||||
|
======================================================================
|
||||||
|
Pausing Charms with subordinate hacluster without sending false alerts
|
||||||
|
======================================================================
|
||||||
|
|
||||||
|
Overall, the goal is to leave “warning” alerts instead of “critical” that
|
||||||
|
will help a human operator understand that all services are not completely
|
||||||
|
healthy while reducing the criticality due to an on-going operation. Nrpe
|
||||||
|
checks will be reconfigured once services under a maintenance operation
|
||||||
|
are set back to normal (resume).
|
||||||
|
|
||||||
|
The following logic will be applied when pausing/resuming a unit:
|
||||||
|
|
||||||
|
- Pausing a principal unit, pauses the subordinate hacluster;
|
||||||
|
- Resuming a principal unit, resumes the subordinate hacluster;
|
||||||
|
- Pausing a hacluster unit, pauses the principal unit;
|
||||||
|
- Resuming a hacluster unit, resumes the principal unit;
|
||||||
|
|
||||||
|
|
||||||
|
Problem Description
|
||||||
|
===================
|
||||||
|
|
||||||
|
We need to stop sending false alerts when a hacluster subordinate of an
|
||||||
|
Openstack charm unit is paused or when the principal unit is also paused
|
||||||
|
for maintenance. This may help operations to receive more effective alerts.
|
||||||
|
|
||||||
|
There are several charms that use hacluster and NRPE that may benefit from
|
||||||
|
this:
|
||||||
|
|
||||||
|
- charm-ceilometer
|
||||||
|
- charm-ceph-radosgw
|
||||||
|
- charm-designate
|
||||||
|
- charm-keystone
|
||||||
|
- charm-neutron-api
|
||||||
|
- charm-nova-cloud-controller
|
||||||
|
- charm-openstack-dashboard
|
||||||
|
- charm-cinder
|
||||||
|
- charm-glance
|
||||||
|
- charm-heat
|
||||||
|
- charm-swift-proxy
|
||||||
|
|
||||||
|
|
||||||
|
Pausing Principal Unit
|
||||||
|
----------------------
|
||||||
|
If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed
|
||||||
|
and keystone/0 is paused:
|
||||||
|
|
||||||
|
1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert,
|
||||||
|
because apache2 service on keystone/0 is down
|
||||||
|
|
||||||
|
2) haproxy, apache2.service and memcached.service in keystone/0 will also alert
|
||||||
|
|
||||||
|
3) it's possible that corosync and pacemaker have the VIP placed on the same
|
||||||
|
unit at which point the service will fail as haproxy is disabled. So hacluster
|
||||||
|
subordinate unit should also be paused.
|
||||||
|
|
||||||
|
Note: Services affected when pausing a principal unit may change depending on
|
||||||
|
the principal charm
|
||||||
|
|
||||||
|
Pausing hacluster unit
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
Pausing hacluster set the cluster node, e.g keystone, in standby mode.
|
||||||
|
A standby node will have its resources stopped (hacluster, apache2) which will
|
||||||
|
fire false alerts. To solve this issue, the units of hacluster should inform
|
||||||
|
the keystone unit that they are paused. A way of doing this is through the ha
|
||||||
|
relation.
|
||||||
|
|
||||||
|
|
||||||
|
Proposed Change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Pausing Pausing Principal Unit
|
||||||
|
------------------------------
|
||||||
|
Pause action on a principal unit should share the event with its peers to
|
||||||
|
modify the behavior on them (until the resume action is triggered). It should
|
||||||
|
also share the status (paused/resumed) to the subordinate unit to be able to
|
||||||
|
catch-up the same status.
|
||||||
|
|
||||||
|
File actions.py in the principal unit
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
def pause(args):
|
||||||
|
pause_unit_helper(register_configs())
|
||||||
|
|
||||||
|
# Logic added to share the event with peers
|
||||||
|
inform_peers_if_ready(check_api_unit_ready)
|
||||||
|
if is_nrpe_joined():
|
||||||
|
update_nrpe_config()
|
||||||
|
|
||||||
|
# logic added to inform hacluster subordinate unit has been paused
|
||||||
|
relid = relation_ids('ha')
|
||||||
|
for r_id in relid:
|
||||||
|
relation_set(relation_id=r_id, paused=True)
|
||||||
|
|
||||||
|
def resume(args):
|
||||||
|
resume_unit_helper(register_configs())
|
||||||
|
|
||||||
|
# Logic added to share the event with peers
|
||||||
|
inform_peers_if_ready(check_api_unit_ready)
|
||||||
|
if is_nrpe_joined():
|
||||||
|
update_nrpe_config()
|
||||||
|
|
||||||
|
# logic added to inform hacluster subordinate unit has been resumed
|
||||||
|
relid = relation_ids('ha')
|
||||||
|
for r_id in relid:
|
||||||
|
relation_set(relation_id=r_id, paused=False)
|
||||||
|
|
||||||
|
After pausing a principal unit, it will change the unit-state-{unit_name}
|
||||||
|
to NOTREADY. E.g:
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
juju show-unit keystone/0 --endpoint cluster
|
||||||
|
keystone/0:
|
||||||
|
workload-version: 17.0.0
|
||||||
|
machine: "1"
|
||||||
|
opened-ports:
|
||||||
|
- 5000/tcp
|
||||||
|
public-address: 10.5.2.64
|
||||||
|
charm: cs:~openstack-charmers-next/keystone-562
|
||||||
|
leader: true
|
||||||
|
relation-info:
|
||||||
|
- endpoint: cluster
|
||||||
|
related-endpoint: cluster
|
||||||
|
application-data: {}
|
||||||
|
local-unit:
|
||||||
|
in-scope: true
|
||||||
|
data:
|
||||||
|
admin-address: 10.5.2.64
|
||||||
|
egress-subnets: 10.5.2.64/32
|
||||||
|
ingress-address: 10.5.2.64
|
||||||
|
internal-address: 10.5.2.64
|
||||||
|
private-address: 10.5.2.64
|
||||||
|
public-address: 10.5.2.64
|
||||||
|
unit-state-keystone-0: NOTREADY
|
||||||
|
|
||||||
|
Note: unit-state-{unit_name} field is already implemented, I’m just proposing
|
||||||
|
to use this field and change the value to NOTREADY when a unit is paused and
|
||||||
|
return to READY when resumed.
|
||||||
|
|
||||||
|
|
||||||
|
With every unit knowing which one is paused, it is possible to change the
|
||||||
|
script check_haproxy.sh to accept a flag to warn the keystone units that
|
||||||
|
are paused. The bash script is not able now to receive flags.
|
||||||
|
|
||||||
|
Check_haproxy.sh could be rewritten from Bash to Python to accept a flag
|
||||||
|
to warn specific hostname (e.g. check_haproxy.py --warning keystone-0) is
|
||||||
|
under maintenance.
|
||||||
|
|
||||||
|
The file nrpe.py on charmhelpers/contrib/charmsupport should have changes
|
||||||
|
to first check if there is any paused unit in the cluster and then add the
|
||||||
|
warning flag if necessary
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
def add_haproxy_checks(nrpe, unit_name):
|
||||||
|
"""
|
||||||
|
Add checks for each service in list
|
||||||
|
|
||||||
|
:param NRPE nrpe: NRPE object to add check to
|
||||||
|
:param str unit_name: Unit name to use in check description
|
||||||
|
"""
|
||||||
|
cmd = "check_haproxy.py"
|
||||||
|
|
||||||
|
peers_states = get_peers_unit_state()
|
||||||
|
units_not_ready = [
|
||||||
|
unit.replace('/', '-')
|
||||||
|
for unit, state in peers_states.items()
|
||||||
|
if state == UNIT_NOTREADY
|
||||||
|
]
|
||||||
|
|
||||||
|
if is_unit_paused_set():
|
||||||
|
units_not_ready.append(local_unit().replace('/', '-'))
|
||||||
|
|
||||||
|
if units_not_ready:
|
||||||
|
cmd += " --warning {}".format(','.join(units_not_ready))
|
||||||
|
|
||||||
|
nrpe.add_check(
|
||||||
|
shortname='haproxy_servers',
|
||||||
|
description='Check HAProxy {%s}' % unit_name,
|
||||||
|
check_cmd=cmd)
|
||||||
|
nrpe.add_check(
|
||||||
|
shortname='haproxy_queue',
|
||||||
|
description='Check HAProxy queue depth {%s}' % unit_name,
|
||||||
|
check_cmd='check_haproxy_queue_depth.sh')
|
||||||
|
|
||||||
|
When a principal unit changes the state e.g READY to NOTREADY, it’s necessary
|
||||||
|
to rewrite the nrpe files on the other principal units in the cluster because,
|
||||||
|
otherwise, they won’t be able to warn that a unit is under maintenance.
|
||||||
|
|
||||||
|
File responsible for hooks in the classic charms:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
@hooks.hook('cluster-relation-changed')
|
||||||
|
@restart_on_change(restart_map(), stopstart=True)
|
||||||
|
def cluster_changed():
|
||||||
|
# logic added to update nrpe_config in all principal units when
|
||||||
|
# a status is changed
|
||||||
|
update_nrpe_config()
|
||||||
|
|
||||||
|
Note: In reactive charms, it might be slightly different using handlers, but
|
||||||
|
the mean idea is to update_nrpe_config every time that a config in the cluster
|
||||||
|
is changed. This will prevent false alerts in the other units in the cluster.
|
||||||
|
|
||||||
|
|
||||||
|
Services from Principal Unit
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
Removing the .cfg files, when the unit is paused, for those services at
|
||||||
|
/etc/nagios/nrpe.d would stop sending critical errors. The downside of this
|
||||||
|
approach is that it won’t have user friendly messages in Nagios saying that
|
||||||
|
the specific services (apache2, memcached and etc) is under maintenance, on
|
||||||
|
the other hand, it’s simpler to be achieved.
|
||||||
|
|
||||||
|
File responsible for hooks in a classic charm:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
@hooks.hook('nrpe-external-master-relation-joined',
|
||||||
|
'nrpe-external-master-relation-changed')
|
||||||
|
def update_nrpe_config():
|
||||||
|
# logic before change
|
||||||
|
# ...
|
||||||
|
|
||||||
|
nrpe_setup = nrpe.NRPE(hostname=hostname)
|
||||||
|
nrpe.copy_nrpe_checks()
|
||||||
|
|
||||||
|
# added logic to remove services
|
||||||
|
if is_unit_paused_set():
|
||||||
|
nrpe.remove_init_service_checks(
|
||||||
|
nrpe_setup,
|
||||||
|
_services,
|
||||||
|
current_unit
|
||||||
|
)
|
||||||
|
|
||||||
|
else:
|
||||||
|
nrpe.add_init_service_checks(
|
||||||
|
nrpe_setup,
|
||||||
|
_services,
|
||||||
|
current_unit
|
||||||
|
)
|
||||||
|
|
||||||
|
# end of added logic
|
||||||
|
|
||||||
|
nrpe.add_haproxy_checks(nrpe_setup, current_unit)
|
||||||
|
nrpe_setup.write()
|
||||||
|
|
||||||
|
The new logic to remove those services is presented below.
|
||||||
|
|
||||||
|
File charmhelpers/contrib/charmsupport/nrpe.py
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
# added logic to remove apache2, memcached and etc...
|
||||||
|
def remove_init_service_checks(nrpe, services, unit_name):
|
||||||
|
for svc in services:
|
||||||
|
if host.init_is_systemd(service_name=svc):
|
||||||
|
nrpe.remove_check(
|
||||||
|
shortname=svc,
|
||||||
|
description='process check {%s}' % unit_name,
|
||||||
|
check_cmd='check_systemd.py %s' % svc
|
||||||
|
)
|
||||||
|
|
||||||
|
The status of the services will disappear from nagios after some minutes.
|
||||||
|
When the resume action is used, the services are restored initially as
|
||||||
|
PENDING, but after some minutes the check is done.
|
||||||
|
|
||||||
|
Pausing hacluster unit
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
File actions.py in charm-hacluster:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
def pause(args):
|
||||||
|
"""Pause the hacluster services.
|
||||||
|
@raises Exception should the service fail to stop.
|
||||||
|
"""
|
||||||
|
pause_unit()
|
||||||
|
# logic added to inform keystone that unit has been paused
|
||||||
|
relid = relation_ids('ha')
|
||||||
|
for r_id in relid:
|
||||||
|
relation_set(relation_id=r_id, paused=True)
|
||||||
|
|
||||||
|
|
||||||
|
def resume(args):
|
||||||
|
"""Resume the hacluster services.
|
||||||
|
@raises Exception should the service fail to start."""
|
||||||
|
resume_unit()
|
||||||
|
# logic added to inform keystone that unit has been resumed
|
||||||
|
relid = relation_ids('ha')
|
||||||
|
for r_id in relid:
|
||||||
|
relation_set(relation_id=r_id, paused=False)
|
||||||
|
|
||||||
|
|
||||||
|
Pausing a hacluster would result in sharing a new variable paused that can be
|
||||||
|
used in the principal units.
|
||||||
|
|
||||||
|
|
||||||
|
File responsible for hooks in a classic charm:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
@hooks.hook('ha-relation-changed')
|
||||||
|
@restart_on_change(restart_map(), restart_functions=restart_function_map())
|
||||||
|
def ha_changed():
|
||||||
|
|
||||||
|
# Added logic to pause keystone unit when hacluster is paused
|
||||||
|
for rid in relation_ids('ha'):
|
||||||
|
for unit in related_units(rid):
|
||||||
|
paused = relation_get('paused', rid=rid, unit=unit)
|
||||||
|
clustered = relation_get('clustered', rid=rid, unit=unit)
|
||||||
|
if clustered and is_db_ready():
|
||||||
|
if paused == 'True':
|
||||||
|
pause_unit_helper(register_configs())
|
||||||
|
|
||||||
|
elif paused == 'False':
|
||||||
|
resume_unit_helper(register_configs())
|
||||||
|
|
||||||
|
update_nrpe_config()
|
||||||
|
inform_peers_if_ready(check_api_unit_ready)
|
||||||
|
# inform subordinate unit that is paused or resumed
|
||||||
|
relation_set(relation_id=rid, paused=is_unit_paused_set())
|
||||||
|
|
||||||
|
By informing peers and updating the nrpe config this will be enough to trigger
|
||||||
|
the necessary logic to remove the services checks.
|
||||||
|
|
||||||
|
In a situation where the principal unit is paused, hacluster should also be
|
||||||
|
paused. For this to happen, it can use the ha-relation-changed from
|
||||||
|
charm-ha-cluster:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
@hooks.hook('ha-relation-joined',
|
||||||
|
'ha-relation-changed',
|
||||||
|
'peer-availability-relation-joined',
|
||||||
|
'peer-availability-relation-changed',
|
||||||
|
'pacemaker-remote-relation-changed')
|
||||||
|
def ha_relation_changed():
|
||||||
|
# Inserted logic
|
||||||
|
# pauses if the principal unit is paused
|
||||||
|
paused = relation_get('paused')
|
||||||
|
if paused == 'True':
|
||||||
|
pause_unit()
|
||||||
|
elif paused == 'False':
|
||||||
|
resume_unit()
|
||||||
|
|
||||||
|
# share if the subordinate unit status
|
||||||
|
for rel_id in relation_ids('ha'):
|
||||||
|
relation_set(
|
||||||
|
relation_id=rel_id,
|
||||||
|
clustered="yes",
|
||||||
|
paused=is_unit_paused_set()
|
||||||
|
)
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
One alternative to services from the principal unit checks is to change
|
||||||
|
systemd.py in charm-nrpe to accept flag -w like the proposal for the
|
||||||
|
check_haproxy.py
|
||||||
|
|
||||||
|
This way would not be necessary to remove the .cfg files for services from
|
||||||
|
the principal unit, but would be necessary to adapt the function
|
||||||
|
`add_init_service_checks` to be able to accept services with the warning flag.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
gabrielcocenza
|
||||||
|
|
||||||
|
Gerrit Topic
|
||||||
|
------------
|
||||||
|
|
||||||
|
Use Gerrit topic "pausing-charms-hacluster-no-false-alerts" for all patches
|
||||||
|
related to this spec.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
git-review -t pausing-charms-hacluster-no-false-alerts
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
- charmhelpers
|
||||||
|
|
||||||
|
- nrpe.py
|
||||||
|
|
||||||
|
- check_haproxy.py
|
||||||
|
|
||||||
|
- charm-ceilometer
|
||||||
|
- charm-ceph-radosgw
|
||||||
|
- charm-designate
|
||||||
|
- charm-keystone
|
||||||
|
- charm-neutron-api
|
||||||
|
- charm-nova-cloud-controller
|
||||||
|
- charm-openstack-dashboard
|
||||||
|
- charm-cinder
|
||||||
|
- charm-glance
|
||||||
|
- charm-heat
|
||||||
|
- charm-swift-proxy
|
||||||
|
|
||||||
|
- charm-nrpe (Alternative)
|
||||||
|
|
||||||
|
- systemd.py
|
||||||
|
|
||||||
|
- charm-hacluster
|
||||||
|
|
||||||
|
- actions.py
|
||||||
|
|
||||||
|
Repositories
|
||||||
|
------------
|
||||||
|
|
||||||
|
No new git Repository is required.
|
||||||
|
|
||||||
|
Documentation
|
||||||
|
-------------
|
||||||
|
|
||||||
|
It will be necessary to document the impact of pausing/resuming a
|
||||||
|
subordinate hacluster and the side effect on Openstack API charms.
|
||||||
|
|
||||||
|
Security
|
||||||
|
--------
|
||||||
|
|
||||||
|
No additional security concerns.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
-------
|
||||||
|
|
||||||
|
Code changes will be covered by unit and functional tests. For functional
|
||||||
|
tests, it will use a bundle with keystone, hacluster, nrpe and nagios.
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
None
|
Loading…
Reference in New Issue