.. Copyright 2021 Canonical Ltd This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode .. This template should be in ReSTructured text. Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: "None". For help with syntax, see http://sphinx-doc.org/rest.html To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html ====================================================================== Pausing Charms with subordinate hacluster without sending false alerts ====================================================================== Overall, the goal is to leave “warning” alerts instead of “critical” that will help a human operator understand that all services are not completely healthy while reducing the criticality due to an on-going operation. Nrpe checks will be reconfigured once services under a maintenance operation are set back to normal (resume). The following logic will be applied when pausing/resuming a unit: - Pausing a principal unit, pauses the subordinate hacluster; - Resuming a principal unit, resumes the subordinate hacluster; - Pausing a hacluster unit, pauses the principal unit; - Resuming a hacluster unit, resumes the principal unit; Problem Description =================== We need to stop sending false alerts when a hacluster subordinate of an Openstack charm unit is paused or when the principal unit is also paused for maintenance. This may help operations to receive more effective alerts. There are several charms that use hacluster and NRPE that may benefit from this: - charm-ceilometer - charm-ceph-radosgw - charm-designate - charm-keystone - charm-neutron-api - charm-nova-cloud-controller - charm-openstack-dashboard - charm-cinder - charm-glance - charm-heat - charm-swift-proxy Pausing Principal Unit ---------------------- If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed and keystone/0 is paused: 1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert, because apache2 service on keystone/0 is down 2) haproxy, apache2.service and memcached.service in keystone/0 will also alert 3) it's possible that corosync and pacemaker have the VIP placed on the same unit at which point the service will fail as haproxy is disabled. So hacluster subordinate unit should also be paused. Note: Services affected when pausing a principal unit may change depending on the principal charm Pausing hacluster unit ---------------------- Pausing hacluster set the cluster node, e.g keystone, in standby mode. A standby node will have its resources stopped (hacluster, apache2) which will fire false alerts. To solve this issue, the units of hacluster should inform the keystone unit that they are paused. A way of doing this is through the ha relation. Proposed Change =============== Pausing Pausing Principal Unit ------------------------------ Pause action on a principal unit should share the event with its peers to modify the behavior on them (until the resume action is triggered). It should also share the status (paused/resumed) to the subordinate unit to be able to catch-up the same status. File actions.py in the principal unit .. code-block:: python def pause(args): pause_unit_helper(register_configs()) # Logic added to share the event with peers inform_peers_if_ready(check_api_unit_ready) if is_nrpe_joined(): update_nrpe_config() # logic added to inform hacluster subordinate unit has been paused relid = relation_ids('ha') for r_id in relid: relation_set(relation_id=r_id, paused=True) def resume(args): resume_unit_helper(register_configs()) # Logic added to share the event with peers inform_peers_if_ready(check_api_unit_ready) if is_nrpe_joined(): update_nrpe_config() # logic added to inform hacluster subordinate unit has been resumed relid = relation_ids('ha') for r_id in relid: relation_set(relation_id=r_id, paused=False) After pausing a principal unit, it will change the unit-state-{unit_name} to NOTREADY. E.g: .. code-block:: yaml juju show-unit keystone/0 --endpoint cluster keystone/0: workload-version: 17.0.0 machine: "1" opened-ports: - 5000/tcp public-address: 10.5.2.64 charm: cs:~openstack-charmers-next/keystone-562 leader: true relation-info: - endpoint: cluster related-endpoint: cluster application-data: {} local-unit: in-scope: true data: admin-address: 10.5.2.64 egress-subnets: 10.5.2.64/32 ingress-address: 10.5.2.64 internal-address: 10.5.2.64 private-address: 10.5.2.64 public-address: 10.5.2.64 unit-state-keystone-0: NOTREADY Note: unit-state-{unit_name} field is already implemented, I’m just proposing to use this field and change the value to NOTREADY when a unit is paused and return to READY when resumed. With every unit knowing which one is paused, it is possible to change the script check_haproxy.sh to accept a flag to warn the keystone units that are paused. The bash script is not able now to receive flags. Check_haproxy.sh could be rewritten from Bash to Python to accept a flag to warn specific hostname (e.g. check_haproxy.py --warning keystone-0) is under maintenance. The file nrpe.py on charmhelpers/contrib/charmsupport should have changes to first check if there is any paused unit in the cluster and then add the warning flag if necessary .. code-block:: python def add_haproxy_checks(nrpe, unit_name): """ Add checks for each service in list :param NRPE nrpe: NRPE object to add check to :param str unit_name: Unit name to use in check description """ cmd = "check_haproxy.py" peers_states = get_peers_unit_state() units_not_ready = [ unit.replace('/', '-') for unit, state in peers_states.items() if state == UNIT_NOTREADY ] if is_unit_paused_set(): units_not_ready.append(local_unit().replace('/', '-')) if units_not_ready: cmd += " --warning {}".format(','.join(units_not_ready)) nrpe.add_check( shortname='haproxy_servers', description='Check HAProxy {%s}' % unit_name, check_cmd=cmd) nrpe.add_check( shortname='haproxy_queue', description='Check HAProxy queue depth {%s}' % unit_name, check_cmd='check_haproxy_queue_depth.sh') When a principal unit changes the state e.g READY to NOTREADY, it’s necessary to rewrite the nrpe files on the other principal units in the cluster because, otherwise, they won’t be able to warn that a unit is under maintenance. File responsible for hooks in the classic charms: .. code-block:: python @hooks.hook('cluster-relation-changed') @restart_on_change(restart_map(), stopstart=True) def cluster_changed(): # logic added to update nrpe_config in all principal units when # a status is changed update_nrpe_config() Note: In reactive charms, it might be slightly different using handlers, but the mean idea is to update_nrpe_config every time that a config in the cluster is changed. This will prevent false alerts in the other units in the cluster. Services from Principal Unit ------------------------------ Removing the .cfg files, when the unit is paused, for those services at /etc/nagios/nrpe.d would stop sending critical errors. The downside of this approach is that it won’t have user friendly messages in Nagios saying that the specific services (apache2, memcached and etc) is under maintenance, on the other hand, it’s simpler to be achieved. File responsible for hooks in a classic charm: .. code-block:: python @hooks.hook('nrpe-external-master-relation-joined', 'nrpe-external-master-relation-changed') def update_nrpe_config(): # logic before change # ... nrpe_setup = nrpe.NRPE(hostname=hostname) nrpe.copy_nrpe_checks() # added logic to remove services if is_unit_paused_set(): nrpe.remove_init_service_checks( nrpe_setup, _services, current_unit ) else: nrpe.add_init_service_checks( nrpe_setup, _services, current_unit ) # end of added logic nrpe.add_haproxy_checks(nrpe_setup, current_unit) nrpe_setup.write() The new logic to remove those services is presented below. File charmhelpers/contrib/charmsupport/nrpe.py .. code-block:: python # added logic to remove apache2, memcached and etc... def remove_init_service_checks(nrpe, services, unit_name): for svc in services: if host.init_is_systemd(service_name=svc): nrpe.remove_check( shortname=svc, description='process check {%s}' % unit_name, check_cmd='check_systemd.py %s' % svc ) The status of the services will disappear from nagios after some minutes. When the resume action is used, the services are restored initially as PENDING, but after some minutes the check is done. Pausing hacluster unit ---------------------- File actions.py in charm-hacluster: .. code-block:: python def pause(args): """Pause the hacluster services. @raises Exception should the service fail to stop. """ pause_unit() # logic added to inform keystone that unit has been paused relid = relation_ids('ha') for r_id in relid: relation_set(relation_id=r_id, paused=True) def resume(args): """Resume the hacluster services. @raises Exception should the service fail to start.""" resume_unit() # logic added to inform keystone that unit has been resumed relid = relation_ids('ha') for r_id in relid: relation_set(relation_id=r_id, paused=False) Pausing a hacluster would result in sharing a new variable paused that can be used in the principal units. File responsible for hooks in a classic charm: .. code-block:: python @hooks.hook('ha-relation-changed') @restart_on_change(restart_map(), restart_functions=restart_function_map()) def ha_changed(): # Added logic to pause keystone unit when hacluster is paused for rid in relation_ids('ha'): for unit in related_units(rid): paused = relation_get('paused', rid=rid, unit=unit) clustered = relation_get('clustered', rid=rid, unit=unit) if clustered and is_db_ready(): if paused == 'True': pause_unit_helper(register_configs()) elif paused == 'False': resume_unit_helper(register_configs()) update_nrpe_config() inform_peers_if_ready(check_api_unit_ready) # inform subordinate unit that is paused or resumed relation_set(relation_id=rid, paused=is_unit_paused_set()) By informing peers and updating the nrpe config this will be enough to trigger the necessary logic to remove the services checks. In a situation where the principal unit is paused, hacluster should also be paused. For this to happen, it can use the ha-relation-changed from charm-ha-cluster: .. code-block:: python @hooks.hook('ha-relation-joined', 'ha-relation-changed', 'peer-availability-relation-joined', 'peer-availability-relation-changed', 'pacemaker-remote-relation-changed') def ha_relation_changed(): # Inserted logic # pauses if the principal unit is paused paused = relation_get('paused') if paused == 'True': pause_unit() elif paused == 'False': resume_unit() # share if the subordinate unit status for rel_id in relation_ids('ha'): relation_set( relation_id=rel_id, clustered="yes", paused=is_unit_paused_set() ) Alternatives ------------ One alternative to services from the principal unit checks is to change systemd.py in charm-nrpe to accept flag -w like the proposal for the check_haproxy.py This way would not be necessary to remove the .cfg files for services from the principal unit, but would be necessary to adapt the function `add_init_service_checks` to be able to accept services with the warning flag. Implementation ============== Assignee(s) ----------- Primary assignee: gabrielcocenza Gerrit Topic ------------ Use Gerrit topic "pausing-charms-hacluster-no-false-alerts" for all patches related to this spec. .. code-block:: bash git-review -t pausing-charms-hacluster-no-false-alerts Work Items ---------- - charmhelpers - nrpe.py - check_haproxy.py - charm-ceilometer - charm-ceph-radosgw - charm-designate - charm-keystone - charm-neutron-api - charm-nova-cloud-controller - charm-openstack-dashboard - charm-cinder - charm-glance - charm-heat - charm-swift-proxy - charm-nrpe (Alternative) - systemd.py - charm-hacluster - actions.py Repositories ------------ No new git Repository is required. Documentation ------------- It will be necessary to document the impact of pausing/resuming a subordinate hacluster and the side effect on Openstack API charms. Security -------- No additional security concerns. Testing ------- Code changes will be covered by unit and functional tests. For functional tests, it will use a bundle with keystone, hacluster, nrpe and nagios. Dependencies ============ None