charm-specs/specs/2023.1/backlog/pausing-charms-hacluster-no...

..
  Copyright 2021 Canonical Ltd

  This work is licensed under a Creative Commons Attribution 3.0
  Unported License.
  http://creativecommons.org/licenses/by/3.0/legalcode

..
  This template should be in ReSTructured text. Please do not delete
  any of the sections in this template.  If you have nothing to say
  for a whole section, just write: "None". For help with syntax, see
  http://sphinx-doc.org/rest.html To test out your formatting, see
  http://www.tele3.cz/jbar/rest/rest.html

======================================================================
Pausing Charms with subordinate hacluster without sending false alerts
======================================================================

Overall, the goal is to leave “warning” alerts instead of “critical” that
will help a human operator understand that all services are not completely
healthy while reducing the criticality due to an on-going operation. Nrpe
checks will be reconfigured once services under a maintenance operation
are set back to normal (resume).

The following logic will be applied when pausing/resuming a unit:

- Pausing a principal unit, pauses the subordinate hacluster;
- Resuming a principal unit, resumes the subordinate hacluster;
- Pausing a hacluster unit,  pauses the principal unit;
- Resuming a hacluster unit, resumes the principal unit;


Problem Description
===================

We need to stop sending false alerts when a hacluster subordinate of an
Openstack charm unit is paused or when the principal unit is also paused
for maintenance. This may help operations to receive more effective alerts.

There are several charms that use hacluster and NRPE that may benefit from
this:

- charm-ceilometer
- charm-ceph-radosgw
- charm-designate
- charm-keystone
- charm-neutron-api
- charm-nova-cloud-controller
- charm-openstack-dashboard
- charm-cinder
- charm-glance
- charm-heat
- charm-swift-proxy


Pausing Principal Unit
----------------------
If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed
and keystone/0 is paused:

1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert,
because apache2 service on keystone/0 is down

2) haproxy, apache2.service and memcached.service in keystone/0 will also alert

3) it's possible that corosync and pacemaker have the VIP placed on the same
unit at which point the service will fail as haproxy is disabled. So hacluster
subordinate unit should also be paused.

Note: Services affected when pausing a principal unit may change depending on
the principal charm

Pausing hacluster unit
----------------------

Pausing hacluster set the cluster node, e.g keystone, in standby mode.
A standby node will have its resources stopped (hacluster, apache2) which will
fire false alerts. To solve this issue, the units of hacluster should inform
the keystone unit that they are paused. A way of doing this is through the ha
relation.


Proposed Change
===============

Pausing Pausing Principal Unit
------------------------------
Pause action on a principal unit should share the event with its peers to
modify the behavior on them (until the resume action is triggered).  It should
also share the status (paused/resumed) to the subordinate unit to be able to
catch-up the same status.

File actions.py in the principal unit

.. code-block:: python

  def pause(args):
      pause_unit_helper(register_configs())

      # Logic added to share the event with peers
      inform_peers_if_ready(check_api_unit_ready)
      if is_nrpe_joined():
        update_nrpe_config()

      # logic added to inform hacluster subordinate unit has been paused
      relid = relation_ids('ha')
      for r_id in relid:
          relation_set(relation_id=r_id, paused=True)

  def resume(args):
      resume_unit_helper(register_configs())

      # Logic added to share the event with peers
      inform_peers_if_ready(check_api_unit_ready)
      if is_nrpe_joined():
        update_nrpe_config()

      # logic added to inform hacluster subordinate unit has been resumed
      relid = relation_ids('ha')
      for r_id in relid:
          relation_set(relation_id=r_id, paused=False)

After pausing a principal unit, it will change the unit-state-{unit_name}
to NOTREADY. E.g:

.. code-block:: yaml

  juju show-unit keystone/0 --endpoint cluster
  keystone/0:
    workload-version: 17.0.0
    machine: "1"
    opened-ports:
    - 5000/tcp
    public-address: 10.5.2.64
    charm: cs:~openstack-charmers-next/keystone-562
    leader: true
    relation-info:
    - endpoint: cluster
      related-endpoint: cluster
      application-data: {}
      local-unit:
        in-scope: true
        data:
          admin-address: 10.5.2.64
          egress-subnets: 10.5.2.64/32
          ingress-address: 10.5.2.64
          internal-address: 10.5.2.64
          private-address: 10.5.2.64
          public-address: 10.5.2.64
          unit-state-keystone-0: NOTREADY

Note: unit-state-{unit_name} field is already implemented, I’m just proposing
to use this field and change the value to NOTREADY when a unit is paused and
return to READY when resumed.


With every unit knowing which one is paused, it is possible to change the
script check_haproxy.sh to accept a flag to warn the keystone units that
are paused. The bash script is not able now to receive flags.

Check_haproxy.sh could be rewritten from Bash to Python to accept a flag
to warn specific hostname (e.g. check_haproxy.py --warning keystone-0) is
under maintenance.

The file nrpe.py on charmhelpers/contrib/charmsupport should have changes
to first check if there is any paused unit in the cluster and then add the
warning flag if necessary

.. code-block:: python

  def add_haproxy_checks(nrpe, unit_name):
      """
      Add checks for each service in list

      :param NRPE nrpe: NRPE object to add check to
      :param str unit_name: Unit name to use in check description
      """
      cmd = "check_haproxy.py"

      peers_states = get_peers_unit_state()
      units_not_ready = [
          unit.replace('/', '-')
          for unit, state in peers_states.items()
          if state == UNIT_NOTREADY
      ]

      if is_unit_paused_set():
          units_not_ready.append(local_unit().replace('/', '-'))

      if units_not_ready:
          cmd += " --warning {}".format(','.join(units_not_ready))

      nrpe.add_check(
          shortname='haproxy_servers',
          description='Check HAProxy {%s}' % unit_name,
          check_cmd=cmd)
      nrpe.add_check(
          shortname='haproxy_queue',
          description='Check HAProxy queue depth {%s}' % unit_name,
          check_cmd='check_haproxy_queue_depth.sh')

When a principal unit changes the state e.g READY to NOTREADY, it’s necessary
to rewrite the nrpe files on the other principal units in the cluster because,
otherwise, they won’t be able to warn that a unit is under maintenance.

File responsible for hooks in the classic charms:

.. code-block:: python

  @hooks.hook('cluster-relation-changed')
  @restart_on_change(restart_map(), stopstart=True)
  def cluster_changed():
      # logic added to update nrpe_config in all principal units when
      # a status is changed
      update_nrpe_config()

Note: In reactive charms, it might be slightly different using handlers, but
the mean idea is to update_nrpe_config every time that a config in the cluster
is changed. This will prevent false alerts in the other units in the cluster.


Services from Principal Unit
------------------------------

Removing the .cfg files, when the unit is paused, for those services at
/etc/nagios/nrpe.d would stop sending critical errors. The downside of this
approach is that it won’t have user friendly messages in Nagios saying that
the specific services (apache2, memcached and etc) is under maintenance, on
the other hand, it’s simpler to be achieved.

File responsible for hooks in a classic charm:

.. code-block:: python

  @hooks.hook('nrpe-external-master-relation-joined',
              'nrpe-external-master-relation-changed')
  def update_nrpe_config():
      # logic before change
      # ...

      nrpe_setup = nrpe.NRPE(hostname=hostname)
      nrpe.copy_nrpe_checks()

      # added logic to remove services
      if is_unit_paused_set():
          nrpe.remove_init_service_checks(
              nrpe_setup,
              _services,
              current_unit
          )

      else:
          nrpe.add_init_service_checks(
              nrpe_setup,
              _services,
              current_unit
          )

      # end of added logic

      nrpe.add_haproxy_checks(nrpe_setup, current_unit)
      nrpe_setup.write()

The new logic to remove those services is presented below.

File charmhelpers/contrib/charmsupport/nrpe.py

.. code-block:: python

  # added logic to remove apache2, memcached and etc...
  def remove_init_service_checks(nrpe, services, unit_name):
      for svc in services:
          if host.init_is_systemd(service_name=svc):
              nrpe.remove_check(
                  shortname=svc,
                  description='process check {%s}' % unit_name,
                  check_cmd='check_systemd.py %s' % svc
              )

The status of the services will disappear from nagios after some minutes.
When the resume action is used, the services are restored initially as
PENDING, but after some minutes the check is done.

Pausing hacluster unit
----------------------

File actions.py in charm-hacluster:

.. code-block:: python

  def pause(args):
      """Pause the hacluster services.
      @raises Exception should the service fail to stop.
      """
      pause_unit()
      # logic added to inform keystone that unit has been paused
      relid = relation_ids('ha')
      for r_id in relid:
          relation_set(relation_id=r_id, paused=True)


  def resume(args):
      """Resume the hacluster services.
      @raises Exception should the service fail to start."""
      resume_unit()
      # logic added to inform keystone that unit has been resumed
      relid = relation_ids('ha')
      for r_id in relid:
          relation_set(relation_id=r_id, paused=False)


Pausing a hacluster would result in sharing a new variable paused that can be
used in the principal units.


File responsible for hooks in a classic charm:

.. code-block:: python

  @hooks.hook('ha-relation-changed')
  @restart_on_change(restart_map(), restart_functions=restart_function_map())
  def ha_changed():

      # Added logic to pause keystone unit when hacluster is paused
      for rid in relation_ids('ha'):
          for unit in related_units(rid):
              paused = relation_get('paused', rid=rid, unit=unit)
              clustered = relation_get('clustered', rid=rid, unit=unit)
              if clustered and is_db_ready():
                  if paused == 'True':
                      pause_unit_helper(register_configs())

                  elif paused == 'False':
                      resume_unit_helper(register_configs())

                  update_nrpe_config()
                  inform_peers_if_ready(check_api_unit_ready)
                  # inform subordinate unit that is paused or resumed
                  relation_set(relation_id=rid, paused=is_unit_paused_set())

By informing peers and updating the nrpe config this will be enough to trigger
the necessary logic to remove the services checks.

In a situation where the principal unit is paused, hacluster should also be
paused. For this to happen, it can use the ha-relation-changed from
charm-ha-cluster:

.. code-block:: python

  @hooks.hook('ha-relation-joined',
              'ha-relation-changed',
              'peer-availability-relation-joined',
              'peer-availability-relation-changed',
              'pacemaker-remote-relation-changed')
  def ha_relation_changed():
      # Inserted logic
      # pauses if the principal unit is paused
      paused = relation_get('paused')
      if paused == 'True':
          pause_unit()
      elif paused == 'False':
          resume_unit()

      # share if the subordinate unit status
      for rel_id in relation_ids('ha'):
          relation_set(
              relation_id=rel_id,
              clustered="yes",
              paused=is_unit_paused_set()
          )

Alternatives
------------
One alternative to services from the principal unit checks is to change
systemd.py in charm-nrpe to accept flag -w like the proposal for the
check_haproxy.py

This way would not be necessary to remove the .cfg files for services from
the principal unit, but would be necessary to adapt the function
`add_init_service_checks` to be able to accept services with the warning flag.

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  gabrielcocenza

Gerrit Topic
------------

Use Gerrit topic "pausing-charms-hacluster-no-false-alerts" for all patches
related to this spec.

.. code-block:: bash

    git-review -t pausing-charms-hacluster-no-false-alerts

Work Items
----------
- charmhelpers

  - nrpe.py

  - check_haproxy.py

- charm-ceilometer
- charm-ceph-radosgw
- charm-designate
- charm-keystone
- charm-neutron-api
- charm-nova-cloud-controller
- charm-openstack-dashboard
- charm-cinder
- charm-glance
- charm-heat
- charm-swift-proxy

- charm-nrpe (Alternative)

  - systemd.py

- charm-hacluster

  - actions.py

Repositories
------------

No new git Repository is required.

Documentation
-------------

It will be necessary to document the impact of pausing/resuming a
subordinate hacluster and the side effect on Openstack API charms.

Security
--------

No additional security concerns.

Testing
-------

Code changes will be covered by unit and functional tests. For functional
tests, it will use a bundle with keystone, hacluster, nrpe and nagios.

Dependencies
============
None