Merge "Pausing Principal units and hacluster without sending false alerts"

2021-12-16 16:35:11 +00:00 · 2021-12-16 16:35:11 +00:00 · 85afbf9397
parent b6f5a1c9d2 498c4e7520
commit 85afbf9397
1 changed files with 453 additions and 0 deletions
--- a/specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst
+++ b/specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst
@ -0,0 +1,453 @@
 ..
  Copyright 2021 Canonical Ltd
  This work is licensed under a Creative Commons Attribution 3.0
  Unported License.
  http://creativecommons.org/licenses/by/3.0/legalcode
 ..
  This template should be in ReSTructured text. Please do not delete
  any of the sections in this template.  If you have nothing to say
  for a whole section, just write: "None". For help with syntax, see
  http://sphinx-doc.org/rest.html To test out your formatting, see
  http://www.tele3.cz/jbar/rest/rest.html
 ======================================================================
 Pausing Charms with subordinate hacluster without sending false alerts
 ======================================================================
 Overall, the goal is to leave “warning” alerts instead of “critical” that
 will help a human operator understand that all services are not completely
 healthy while reducing the criticality due to an on-going operation. Nrpe
 checks will be reconfigured once services under a maintenance operation
 are set back to normal (resume).
 The following logic will be applied when pausing/resuming a unit:
 - Pausing a principal unit, pauses the subordinate hacluster;
 - Resuming a principal unit, resumes the subordinate hacluster;
 - Pausing a hacluster unit,  pauses the principal unit;
 - Resuming a hacluster unit, resumes the principal unit;
 Problem Description
 ===================
 We need to stop sending false alerts when a hacluster subordinate of an
 Openstack charm unit is paused or when the principal unit is also paused
 for maintenance. This may help operations to receive more effective alerts.
 There are several charms that use hacluster and NRPE that may benefit from
 this:
 - charm-ceilometer
 - charm-ceph-radosgw
 - charm-designate
 - charm-keystone
 - charm-neutron-api
 - charm-nova-cloud-controller
 - charm-openstack-dashboard
 - charm-cinder
 - charm-glance
 - charm-heat
 - charm-swift-proxy
 Pausing Principal Unit
 ----------------------
 If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed
 and keystone/0 is paused:
 1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert,
 because apache2 service on keystone/0 is down
 2) haproxy, apache2.service and memcached.service in keystone/0 will also alert
 3) it's possible that corosync and pacemaker have the VIP placed on the same
 unit at which point the service will fail as haproxy is disabled. So hacluster
 subordinate unit should also be paused.
 Note: Services affected when pausing a principal unit may change depending on
 the principal charm
 Pausing hacluster unit
 ----------------------
 Pausing hacluster set the cluster node, e.g keystone, in standby mode.
 A standby node will have its resources stopped (hacluster, apache2) which will
 fire false alerts. To solve this issue, the units of hacluster should inform
 the keystone unit that they are paused. A way of doing this is through the ha
 relation.
 Proposed Change
 ===============
 Pausing Pausing Principal Unit
 ------------------------------
 Pause action on a principal unit should share the event with its peers to
 modify the behavior on them (until the resume action is triggered).  It should
 also share the status (paused/resumed) to the subordinate unit to be able to
 catch-up the same status.
 File actions.py in the principal unit
 .. code-block:: python
  def pause(args):
      pause_unit_helper(register_configs())
      # Logic added to share the event with peers
      inform_peers_if_ready(check_api_unit_ready)
      if is_nrpe_joined():
        update_nrpe_config()
      # logic added to inform hacluster subordinate unit has been paused
      relid = relation_ids('ha')
      for r_id in relid:
          relation_set(relation_id=r_id, paused=True)
  def resume(args):
      resume_unit_helper(register_configs())
      # Logic added to share the event with peers
      inform_peers_if_ready(check_api_unit_ready)
      if is_nrpe_joined():
        update_nrpe_config()
      # logic added to inform hacluster subordinate unit has been resumed
      relid = relation_ids('ha')
      for r_id in relid:
          relation_set(relation_id=r_id, paused=False)
 After pausing a principal unit, it will change the unit-state-{unit_name}
 to NOTREADY. E.g:
 .. code-block:: yaml
  juju show-unit keystone/0 --endpoint cluster
  keystone/0:
    workload-version: 17.0.0
    machine: "1"
    opened-ports:
    - 5000/tcp
    public-address: 10.5.2.64
    charm: cs:~openstack-charmers-next/keystone-562
    leader: true
    relation-info:
    - endpoint: cluster
      related-endpoint: cluster
      application-data: {}
      local-unit:
        in-scope: true
        data:
          admin-address: 10.5.2.64
          egress-subnets: 10.5.2.64/32
          ingress-address: 10.5.2.64
          internal-address: 10.5.2.64
          private-address: 10.5.2.64
          public-address: 10.5.2.64
          unit-state-keystone-0: NOTREADY
 Note: unit-state-{unit_name} field is already implemented, I’m just proposing
 to use this field and change the value to NOTREADY when a unit is paused and
 return to READY when resumed.
 With every unit knowing which one is paused, it is possible to change the
 script check_haproxy.sh to accept a flag to warn the keystone units that
 are paused. The bash script is not able now to receive flags.
 Check_haproxy.sh could be rewritten from Bash to Python to accept a flag
 to warn specific hostname (e.g. check_haproxy.py --warning keystone-0) is
 under maintenance.
 The file nrpe.py on charmhelpers/contrib/charmsupport should have changes
 to first check if there is any paused unit in the cluster and then add the
 warning flag if necessary
 .. code-block:: python
  def add_haproxy_checks(nrpe, unit_name):
      """
      Add checks for each service in list
      :param NRPE nrpe: NRPE object to add check to
      :param str unit_name: Unit name to use in check description
      """
      cmd = "check_haproxy.py"
      peers_states = get_peers_unit_state()
      units_not_ready = [
          unit.replace('/', '-')
          for unit, state in peers_states.items()
          if state == UNIT_NOTREADY
      ]
      if is_unit_paused_set():
          units_not_ready.append(local_unit().replace('/', '-'))
      if units_not_ready:
          cmd += " --warning {}".format(','.join(units_not_ready))
      nrpe.add_check(
          shortname='haproxy_servers',
          description='Check HAProxy {%s}' % unit_name,
          check_cmd=cmd)
      nrpe.add_check(
          shortname='haproxy_queue',
          description='Check HAProxy queue depth {%s}' % unit_name,
          check_cmd='check_haproxy_queue_depth.sh')
 When a principal unit changes the state e.g READY to NOTREADY, it’s necessary
 to rewrite the nrpe files on the other principal units in the cluster because,
 otherwise, they won’t be able to warn that a unit is under maintenance.
 File responsible for hooks in the classic charms:
 .. code-block:: python
  @hooks.hook('cluster-relation-changed')
  @restart_on_change(restart_map(), stopstart=True)
  def cluster_changed():
      # logic added to update nrpe_config in all principal units when
      # a status is changed
      update_nrpe_config()
 Note: In reactive charms, it might be slightly different using handlers, but
 the mean idea is to update_nrpe_config every time that a config in the cluster
 is changed. This will prevent false alerts in the other units in the cluster.
 Services from Principal Unit
 ------------------------------
 Removing the .cfg files, when the unit is paused, for those services at
 /etc/nagios/nrpe.d would stop sending critical errors. The downside of this
 approach is that it won’t have user friendly messages in Nagios saying that
 the specific services (apache2, memcached and etc) is under maintenance, on
 the other hand, it’s simpler to be achieved.
 File responsible for hooks in a classic charm:
 .. code-block:: python
  @hooks.hook('nrpe-external-master-relation-joined',
              'nrpe-external-master-relation-changed')
  def update_nrpe_config():
      # logic before change
      # ...
      nrpe_setup = nrpe.NRPE(hostname=hostname)
      nrpe.copy_nrpe_checks()
      # added logic to remove services
      if is_unit_paused_set():
          nrpe.remove_init_service_checks(
              nrpe_setup,
              _services,
              current_unit
          )
      else:
          nrpe.add_init_service_checks(
              nrpe_setup,
              _services,
              current_unit
          )
      # end of added logic
      nrpe.add_haproxy_checks(nrpe_setup, current_unit)
      nrpe_setup.write()
 The new logic to remove those services is presented below.
 File charmhelpers/contrib/charmsupport/nrpe.py
 .. code-block:: python
  # added logic to remove apache2, memcached and etc...
  def remove_init_service_checks(nrpe, services, unit_name):
      for svc in services:
          if host.init_is_systemd(service_name=svc):
              nrpe.remove_check(
                  shortname=svc,
                  description='process check {%s}' % unit_name,
                  check_cmd='check_systemd.py %s' % svc
              )
 The status of the services will disappear from nagios after some minutes.
 When the resume action is used, the services are restored initially as
 PENDING, but after some minutes the check is done.
 Pausing hacluster unit
 ----------------------
 File actions.py in charm-hacluster:
 .. code-block:: python
  def pause(args):
      """Pause the hacluster services.
      @raises Exception should the service fail to stop.
      """
      pause_unit()
      # logic added to inform keystone that unit has been paused
      relid = relation_ids('ha')
      for r_id in relid:
          relation_set(relation_id=r_id, paused=True)
  def resume(args):
      """Resume the hacluster services.
      @raises Exception should the service fail to start."""
      resume_unit()
      # logic added to inform keystone that unit has been resumed
      relid = relation_ids('ha')
      for r_id in relid:
          relation_set(relation_id=r_id, paused=False)
 Pausing a hacluster would result in sharing a new variable paused that can be
 used in the principal units.
 File responsible for hooks in a classic charm:
 .. code-block:: python
  @hooks.hook('ha-relation-changed')
  @restart_on_change(restart_map(), restart_functions=restart_function_map())
  def ha_changed():
      # Added logic to pause keystone unit when hacluster is paused
      for rid in relation_ids('ha'):
          for unit in related_units(rid):
              paused = relation_get('paused', rid=rid, unit=unit)
              clustered = relation_get('clustered', rid=rid, unit=unit)
              if clustered and is_db_ready():
                  if paused == 'True':
                      pause_unit_helper(register_configs())
                  elif paused == 'False':
                      resume_unit_helper(register_configs())
                  update_nrpe_config()
                  inform_peers_if_ready(check_api_unit_ready)
                  # inform subordinate unit that is paused or resumed
                  relation_set(relation_id=rid, paused=is_unit_paused_set())
 By informing peers and updating the nrpe config this will be enough to trigger
 the necessary logic to remove the services checks.
 In a situation where the principal unit is paused, hacluster should also be
 paused. For this to happen, it can use the ha-relation-changed from
 charm-ha-cluster:
 .. code-block:: python
  @hooks.hook('ha-relation-joined',
              'ha-relation-changed',
              'peer-availability-relation-joined',
              'peer-availability-relation-changed',
              'pacemaker-remote-relation-changed')
  def ha_relation_changed():
      # Inserted logic
      # pauses if the principal unit is paused
      paused = relation_get('paused')
      if paused == 'True':
          pause_unit()
      elif paused == 'False':
          resume_unit()
      # share if the subordinate unit status
      for rel_id in relation_ids('ha'):
          relation_set(
              relation_id=rel_id,
              clustered="yes",
              paused=is_unit_paused_set()
          )
 Alternatives
 ------------
 One alternative to services from the principal unit checks is to change
 systemd.py in charm-nrpe to accept flag -w like the proposal for the
 check_haproxy.py
 This way would not be necessary to remove the .cfg files for services from
 the principal unit, but would be necessary to adapt the function
 `add_init_service_checks` to be able to accept services with the warning flag.
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  gabrielcocenza
 Gerrit Topic
 ------------
 Use Gerrit topic "pausing-charms-hacluster-no-false-alerts" for all patches
 related to this spec.
 .. code-block:: bash
    git-review -t pausing-charms-hacluster-no-false-alerts
 Work Items
 ----------
 - charmhelpers
  - nrpe.py
  - check_haproxy.py
 - charm-ceilometer
 - charm-ceph-radosgw
 - charm-designate
 - charm-keystone
 - charm-neutron-api
 - charm-nova-cloud-controller
 - charm-openstack-dashboard
 - charm-cinder
 - charm-glance
 - charm-heat
 - charm-swift-proxy
 - charm-nrpe (Alternative)
  - systemd.py
 - charm-hacluster
  - actions.py
 Repositories
 ------------
 No new git Repository is required.
 Documentation
 -------------
 It will be necessary to document the impact of pausing/resuming a
 subordinate hacluster and the side effect on Openstack API charms.
 Security
 --------
 No additional security concerns.
 Testing
 -------
 Code changes will be covered by unit and functional tests. For functional
 tests, it will use a bundle with keystone, hacluster, nrpe and nagios.
 Dependencies
 ============
 None