diff --git a/specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst b/specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst new file mode 100644 index 0000000..0605c60 --- /dev/null +++ b/specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst @@ -0,0 +1,453 @@ +.. + Copyright 2021 Canonical Ltd + + This work is licensed under a Creative Commons Attribution 3.0 + Unported License. + http://creativecommons.org/licenses/by/3.0/legalcode + +.. + This template should be in ReSTructured text. Please do not delete + any of the sections in this template. If you have nothing to say + for a whole section, just write: "None". For help with syntax, see + http://sphinx-doc.org/rest.html To test out your formatting, see + http://www.tele3.cz/jbar/rest/rest.html + +====================================================================== +Pausing Charms with subordinate hacluster without sending false alerts +====================================================================== + +Overall, the goal is to leave “warning” alerts instead of “critical” that +will help a human operator understand that all services are not completely +healthy while reducing the criticality due to an on-going operation. Nrpe +checks will be reconfigured once services under a maintenance operation +are set back to normal (resume). + +The following logic will be applied when pausing/resuming a unit: + +- Pausing a principal unit, pauses the subordinate hacluster; +- Resuming a principal unit, resumes the subordinate hacluster; +- Pausing a hacluster unit, pauses the principal unit; +- Resuming a hacluster unit, resumes the principal unit; + + +Problem Description +=================== + +We need to stop sending false alerts when a hacluster subordinate of an +Openstack charm unit is paused or when the principal unit is also paused +for maintenance. This may help operations to receive more effective alerts. + +There are several charms that use hacluster and NRPE that may benefit from +this: + +- charm-ceilometer +- charm-ceph-radosgw +- charm-designate +- charm-keystone +- charm-neutron-api +- charm-nova-cloud-controller +- charm-openstack-dashboard +- charm-cinder +- charm-glance +- charm-heat +- charm-swift-proxy + + +Pausing Principal Unit +---------------------- +If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed +and keystone/0 is paused: + +1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert, +because apache2 service on keystone/0 is down + +2) haproxy, apache2.service and memcached.service in keystone/0 will also alert + +3) it's possible that corosync and pacemaker have the VIP placed on the same +unit at which point the service will fail as haproxy is disabled. So hacluster +subordinate unit should also be paused. + +Note: Services affected when pausing a principal unit may change depending on +the principal charm + +Pausing hacluster unit +---------------------- + +Pausing hacluster set the cluster node, e.g keystone, in standby mode. +A standby node will have its resources stopped (hacluster, apache2) which will +fire false alerts. To solve this issue, the units of hacluster should inform +the keystone unit that they are paused. A way of doing this is through the ha +relation. + + +Proposed Change +=============== + +Pausing Pausing Principal Unit +------------------------------ +Pause action on a principal unit should share the event with its peers to +modify the behavior on them (until the resume action is triggered). It should +also share the status (paused/resumed) to the subordinate unit to be able to +catch-up the same status. + +File actions.py in the principal unit + +.. code-block:: python + + def pause(args): + pause_unit_helper(register_configs()) + + # Logic added to share the event with peers + inform_peers_if_ready(check_api_unit_ready) + if is_nrpe_joined(): + update_nrpe_config() + + # logic added to inform hacluster subordinate unit has been paused + relid = relation_ids('ha') + for r_id in relid: + relation_set(relation_id=r_id, paused=True) + + def resume(args): + resume_unit_helper(register_configs()) + + # Logic added to share the event with peers + inform_peers_if_ready(check_api_unit_ready) + if is_nrpe_joined(): + update_nrpe_config() + + # logic added to inform hacluster subordinate unit has been resumed + relid = relation_ids('ha') + for r_id in relid: + relation_set(relation_id=r_id, paused=False) + +After pausing a principal unit, it will change the unit-state-{unit_name} +to NOTREADY. E.g: + +.. code-block:: yaml + + juju show-unit keystone/0 --endpoint cluster + keystone/0: + workload-version: 17.0.0 + machine: "1" + opened-ports: + - 5000/tcp + public-address: 10.5.2.64 + charm: cs:~openstack-charmers-next/keystone-562 + leader: true + relation-info: + - endpoint: cluster + related-endpoint: cluster + application-data: {} + local-unit: + in-scope: true + data: + admin-address: 10.5.2.64 + egress-subnets: 10.5.2.64/32 + ingress-address: 10.5.2.64 + internal-address: 10.5.2.64 + private-address: 10.5.2.64 + public-address: 10.5.2.64 + unit-state-keystone-0: NOTREADY + +Note: unit-state-{unit_name} field is already implemented, I’m just proposing +to use this field and change the value to NOTREADY when a unit is paused and +return to READY when resumed. + + +With every unit knowing which one is paused, it is possible to change the +script check_haproxy.sh to accept a flag to warn the keystone units that +are paused. The bash script is not able now to receive flags. + +Check_haproxy.sh could be rewritten from Bash to Python to accept a flag +to warn specific hostname (e.g. check_haproxy.py --warning keystone-0) is +under maintenance. + +The file nrpe.py on charmhelpers/contrib/charmsupport should have changes +to first check if there is any paused unit in the cluster and then add the +warning flag if necessary + +.. code-block:: python + + def add_haproxy_checks(nrpe, unit_name): + """ + Add checks for each service in list + + :param NRPE nrpe: NRPE object to add check to + :param str unit_name: Unit name to use in check description + """ + cmd = "check_haproxy.py" + + peers_states = get_peers_unit_state() + units_not_ready = [ + unit.replace('/', '-') + for unit, state in peers_states.items() + if state == UNIT_NOTREADY + ] + + if is_unit_paused_set(): + units_not_ready.append(local_unit().replace('/', '-')) + + if units_not_ready: + cmd += " --warning {}".format(','.join(units_not_ready)) + + nrpe.add_check( + shortname='haproxy_servers', + description='Check HAProxy {%s}' % unit_name, + check_cmd=cmd) + nrpe.add_check( + shortname='haproxy_queue', + description='Check HAProxy queue depth {%s}' % unit_name, + check_cmd='check_haproxy_queue_depth.sh') + +When a principal unit changes the state e.g READY to NOTREADY, it’s necessary +to rewrite the nrpe files on the other principal units in the cluster because, +otherwise, they won’t be able to warn that a unit is under maintenance. + +File responsible for hooks in the classic charms: + +.. code-block:: python + + @hooks.hook('cluster-relation-changed') + @restart_on_change(restart_map(), stopstart=True) + def cluster_changed(): + # logic added to update nrpe_config in all principal units when + # a status is changed + update_nrpe_config() + +Note: In reactive charms, it might be slightly different using handlers, but +the mean idea is to update_nrpe_config every time that a config in the cluster +is changed. This will prevent false alerts in the other units in the cluster. + + +Services from Principal Unit +------------------------------ + +Removing the .cfg files, when the unit is paused, for those services at +/etc/nagios/nrpe.d would stop sending critical errors. The downside of this +approach is that it won’t have user friendly messages in Nagios saying that +the specific services (apache2, memcached and etc) is under maintenance, on +the other hand, it’s simpler to be achieved. + +File responsible for hooks in a classic charm: + +.. code-block:: python + + @hooks.hook('nrpe-external-master-relation-joined', + 'nrpe-external-master-relation-changed') + def update_nrpe_config(): + # logic before change + # ... + + nrpe_setup = nrpe.NRPE(hostname=hostname) + nrpe.copy_nrpe_checks() + + # added logic to remove services + if is_unit_paused_set(): + nrpe.remove_init_service_checks( + nrpe_setup, + _services, + current_unit + ) + + else: + nrpe.add_init_service_checks( + nrpe_setup, + _services, + current_unit + ) + + # end of added logic + + nrpe.add_haproxy_checks(nrpe_setup, current_unit) + nrpe_setup.write() + +The new logic to remove those services is presented below. + +File charmhelpers/contrib/charmsupport/nrpe.py + +.. code-block:: python + + # added logic to remove apache2, memcached and etc... + def remove_init_service_checks(nrpe, services, unit_name): + for svc in services: + if host.init_is_systemd(service_name=svc): + nrpe.remove_check( + shortname=svc, + description='process check {%s}' % unit_name, + check_cmd='check_systemd.py %s' % svc + ) + +The status of the services will disappear from nagios after some minutes. +When the resume action is used, the services are restored initially as +PENDING, but after some minutes the check is done. + +Pausing hacluster unit +---------------------- + +File actions.py in charm-hacluster: + +.. code-block:: python + + def pause(args): + """Pause the hacluster services. + @raises Exception should the service fail to stop. + """ + pause_unit() + # logic added to inform keystone that unit has been paused + relid = relation_ids('ha') + for r_id in relid: + relation_set(relation_id=r_id, paused=True) + + + def resume(args): + """Resume the hacluster services. + @raises Exception should the service fail to start.""" + resume_unit() + # logic added to inform keystone that unit has been resumed + relid = relation_ids('ha') + for r_id in relid: + relation_set(relation_id=r_id, paused=False) + + +Pausing a hacluster would result in sharing a new variable paused that can be +used in the principal units. + + +File responsible for hooks in a classic charm: + +.. code-block:: python + + @hooks.hook('ha-relation-changed') + @restart_on_change(restart_map(), restart_functions=restart_function_map()) + def ha_changed(): + + # Added logic to pause keystone unit when hacluster is paused + for rid in relation_ids('ha'): + for unit in related_units(rid): + paused = relation_get('paused', rid=rid, unit=unit) + clustered = relation_get('clustered', rid=rid, unit=unit) + if clustered and is_db_ready(): + if paused == 'True': + pause_unit_helper(register_configs()) + + elif paused == 'False': + resume_unit_helper(register_configs()) + + update_nrpe_config() + inform_peers_if_ready(check_api_unit_ready) + # inform subordinate unit that is paused or resumed + relation_set(relation_id=rid, paused=is_unit_paused_set()) + +By informing peers and updating the nrpe config this will be enough to trigger +the necessary logic to remove the services checks. + +In a situation where the principal unit is paused, hacluster should also be +paused. For this to happen, it can use the ha-relation-changed from +charm-ha-cluster: + +.. code-block:: python + + @hooks.hook('ha-relation-joined', + 'ha-relation-changed', + 'peer-availability-relation-joined', + 'peer-availability-relation-changed', + 'pacemaker-remote-relation-changed') + def ha_relation_changed(): + # Inserted logic + # pauses if the principal unit is paused + paused = relation_get('paused') + if paused == 'True': + pause_unit() + elif paused == 'False': + resume_unit() + + # share if the subordinate unit status + for rel_id in relation_ids('ha'): + relation_set( + relation_id=rel_id, + clustered="yes", + paused=is_unit_paused_set() + ) + +Alternatives +------------ +One alternative to services from the principal unit checks is to change +systemd.py in charm-nrpe to accept flag -w like the proposal for the +check_haproxy.py + +This way would not be necessary to remove the .cfg files for services from +the principal unit, but would be necessary to adapt the function +`add_init_service_checks` to be able to accept services with the warning flag. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + gabrielcocenza + +Gerrit Topic +------------ + +Use Gerrit topic "pausing-charms-hacluster-no-false-alerts" for all patches +related to this spec. + +.. code-block:: bash + + git-review -t pausing-charms-hacluster-no-false-alerts + +Work Items +---------- +- charmhelpers + + - nrpe.py + + - check_haproxy.py + +- charm-ceilometer +- charm-ceph-radosgw +- charm-designate +- charm-keystone +- charm-neutron-api +- charm-nova-cloud-controller +- charm-openstack-dashboard +- charm-cinder +- charm-glance +- charm-heat +- charm-swift-proxy + +- charm-nrpe (Alternative) + + - systemd.py + +- charm-hacluster + + - actions.py + +Repositories +------------ + +No new git Repository is required. + +Documentation +------------- + +It will be necessary to document the impact of pausing/resuming a +subordinate hacluster and the side effect on Openstack API charms. + +Security +-------- + +No additional security concerns. + +Testing +------- + +Code changes will be covered by unit and functional tests. For functional +tests, it will use a bundle with keystone, hacluster, nrpe and nagios. + +Dependencies +============ +None