Merge "Pausing Principal units and hacluster without sending false alerts"

2021-12-16 16:35:11 +00:00 · 2021-12-16 16:35:11 +00:00 · 85afbf9397
parent b6f5a1c9d2 498c4e7520
commit 85afbf9397
1 changed files with 453 additions and 0 deletions
--- a/specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst
+++ b/specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst
@ -0,0 +1,453 @@
+..
+  Copyright 2021 Canonical Ltd
+
+  This work is licensed under a Creative Commons Attribution 3.0
+  Unported License.
+  http://creativecommons.org/licenses/by/3.0/legalcode
+
+..
+  This template should be in ReSTructured text. Please do not delete
+  any of the sections in this template.  If you have nothing to say
+  for a whole section, just write: "None". For help with syntax, see
+  http://sphinx-doc.org/rest.html To test out your formatting, see
+  http://www.tele3.cz/jbar/rest/rest.html
+
+======================================================================
+Pausing Charms with subordinate hacluster without sending false alerts
+======================================================================
+
+Overall, the goal is to leave “warning” alerts instead of “critical” that
+will help a human operator understand that all services are not completely
+healthy while reducing the criticality due to an on-going operation. Nrpe
+checks will be reconfigured once services under a maintenance operation
+are set back to normal (resume).
+
+The following logic will be applied when pausing/resuming a unit:
+
+- Pausing a principal unit, pauses the subordinate hacluster;
+- Resuming a principal unit, resumes the subordinate hacluster;
+- Pausing a hacluster unit,  pauses the principal unit;
+- Resuming a hacluster unit, resumes the principal unit;
+
+
+Problem Description
+===================
+
+We need to stop sending false alerts when a hacluster subordinate of an
+Openstack charm unit is paused or when the principal unit is also paused
+for maintenance. This may help operations to receive more effective alerts.
+
+There are several charms that use hacluster and NRPE that may benefit from
+this:
+
+- charm-ceilometer
+- charm-ceph-radosgw
+- charm-designate
+- charm-keystone
+- charm-neutron-api
+- charm-nova-cloud-controller
+- charm-openstack-dashboard
+- charm-cinder
+- charm-glance
+- charm-heat
+- charm-swift-proxy
+
+
+Pausing Principal Unit
+----------------------
+If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed
+and keystone/0 is paused:
+
+1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert,
+because apache2 service on keystone/0 is down
+
+2) haproxy, apache2.service and memcached.service in keystone/0 will also alert
+
+3) it's possible that corosync and pacemaker have the VIP placed on the same
+unit at which point the service will fail as haproxy is disabled. So hacluster
+subordinate unit should also be paused.
+
+Note: Services affected when pausing a principal unit may change depending on
+the principal charm
+
+Pausing hacluster unit
+----------------------
+
+Pausing hacluster set the cluster node, e.g keystone, in standby mode.
+A standby node will have its resources stopped (hacluster, apache2) which will
+fire false alerts. To solve this issue, the units of hacluster should inform
+the keystone unit that they are paused. A way of doing this is through the ha
+relation.
+
+
+Proposed Change
+===============
+
+Pausing Pausing Principal Unit
+------------------------------
+Pause action on a principal unit should share the event with its peers to
+modify the behavior on them (until the resume action is triggered).  It should
+also share the status (paused/resumed) to the subordinate unit to be able to
+catch-up the same status.
+
+File actions.py in the principal unit
+
+.. code-block:: python
+
+  def pause(args):
+      pause_unit_helper(register_configs())
+
+      # Logic added to share the event with peers
+      inform_peers_if_ready(check_api_unit_ready)
+      if is_nrpe_joined():
+        update_nrpe_config()
+
+      # logic added to inform hacluster subordinate unit has been paused
+      relid = relation_ids('ha')
+      for r_id in relid:
+          relation_set(relation_id=r_id, paused=True)
+
+  def resume(args):
+      resume_unit_helper(register_configs())
+
+      # Logic added to share the event with peers
+      inform_peers_if_ready(check_api_unit_ready)
+      if is_nrpe_joined():
+        update_nrpe_config()
+
+      # logic added to inform hacluster subordinate unit has been resumed
+      relid = relation_ids('ha')
+      for r_id in relid:
+          relation_set(relation_id=r_id, paused=False)
+
+After pausing a principal unit, it will change the unit-state-{unit_name}
+to NOTREADY. E.g:
+
+.. code-block:: yaml
+
+  juju show-unit keystone/0 --endpoint cluster
+  keystone/0:
+    workload-version: 17.0.0
+    machine: "1"
+    opened-ports:
+    - 5000/tcp
+    public-address: 10.5.2.64
+    charm: cs:~openstack-charmers-next/keystone-562
+    leader: true
+    relation-info:
+    - endpoint: cluster
+      related-endpoint: cluster
+      application-data: {}
+      local-unit:
+        in-scope: true
+        data:
+          admin-address: 10.5.2.64
+          egress-subnets: 10.5.2.64/32
+          ingress-address: 10.5.2.64
+          internal-address: 10.5.2.64
+          private-address: 10.5.2.64
+          public-address: 10.5.2.64
+          unit-state-keystone-0: NOTREADY
+
+Note: unit-state-{unit_name} field is already implemented, I’m just proposing
+to use this field and change the value to NOTREADY when a unit is paused and
+return to READY when resumed.
+
+
+With every unit knowing which one is paused, it is possible to change the
+script check_haproxy.sh to accept a flag to warn the keystone units that
+are paused. The bash script is not able now to receive flags.
+
+Check_haproxy.sh could be rewritten from Bash to Python to accept a flag
+to warn specific hostname (e.g. check_haproxy.py --warning keystone-0) is
+under maintenance.
+
+The file nrpe.py on charmhelpers/contrib/charmsupport should have changes
+to first check if there is any paused unit in the cluster and then add the
+warning flag if necessary
+
+.. code-block:: python
+
+  def add_haproxy_checks(nrpe, unit_name):
+      """
+      Add checks for each service in list
+
+      :param NRPE nrpe: NRPE object to add check to
+      :param str unit_name: Unit name to use in check description
+      """
+      cmd = "check_haproxy.py"
+
+      peers_states = get_peers_unit_state()
+      units_not_ready = [
+          unit.replace('/', '-')
+          for unit, state in peers_states.items()
+          if state == UNIT_NOTREADY
+      ]
+
+      if is_unit_paused_set():
+          units_not_ready.append(local_unit().replace('/', '-'))
+
+      if units_not_ready:
+          cmd += " --warning {}".format(','.join(units_not_ready))
+
+      nrpe.add_check(
+          shortname='haproxy_servers',
+          description='Check HAProxy {%s}' % unit_name,
+          check_cmd=cmd)
+      nrpe.add_check(
+          shortname='haproxy_queue',
+          description='Check HAProxy queue depth {%s}' % unit_name,
+          check_cmd='check_haproxy_queue_depth.sh')
+
+When a principal unit changes the state e.g READY to NOTREADY, it’s necessary
+to rewrite the nrpe files on the other principal units in the cluster because,
+otherwise, they won’t be able to warn that a unit is under maintenance.
+
+File responsible for hooks in the classic charms:
+
+.. code-block:: python
+
+  @hooks.hook('cluster-relation-changed')
+  @restart_on_change(restart_map(), stopstart=True)
+  def cluster_changed():
+      # logic added to update nrpe_config in all principal units when
+      # a status is changed
+      update_nrpe_config()
+
+Note: In reactive charms, it might be slightly different using handlers, but
+the mean idea is to update_nrpe_config every time that a config in the cluster
+is changed. This will prevent false alerts in the other units in the cluster.
+
+
+Services from Principal Unit
+------------------------------
+
+Removing the .cfg files, when the unit is paused, for those services at
+/etc/nagios/nrpe.d would stop sending critical errors. The downside of this
+approach is that it won’t have user friendly messages in Nagios saying that
+the specific services (apache2, memcached and etc) is under maintenance, on
+the other hand, it’s simpler to be achieved.
+
+File responsible for hooks in a classic charm:
+
+.. code-block:: python
+
+  @hooks.hook('nrpe-external-master-relation-joined',
+              'nrpe-external-master-relation-changed')
+  def update_nrpe_config():
+      # logic before change
+      # ...
+
+      nrpe_setup = nrpe.NRPE(hostname=hostname)
+      nrpe.copy_nrpe_checks()
+
+      # added logic to remove services
+      if is_unit_paused_set():
+          nrpe.remove_init_service_checks(
+              nrpe_setup,
+              _services,
+              current_unit
+          )
+
+      else:
+          nrpe.add_init_service_checks(
+              nrpe_setup,
+              _services,
+              current_unit
+          )
+
+      # end of added logic
+
+      nrpe.add_haproxy_checks(nrpe_setup, current_unit)
+      nrpe_setup.write()
+
+The new logic to remove those services is presented below.
+
+File charmhelpers/contrib/charmsupport/nrpe.py
+
+.. code-block:: python
+
+  # added logic to remove apache2, memcached and etc...
+  def remove_init_service_checks(nrpe, services, unit_name):
+      for svc in services:
+          if host.init_is_systemd(service_name=svc):
+              nrpe.remove_check(
+                  shortname=svc,
+                  description='process check {%s}' % unit_name,
+                  check_cmd='check_systemd.py %s' % svc
+              )
+
+The status of the services will disappear from nagios after some minutes.
+When the resume action is used, the services are restored initially as
+PENDING, but after some minutes the check is done.
+
+Pausing hacluster unit
+----------------------
+
+File actions.py in charm-hacluster:
+
+.. code-block:: python
+
+  def pause(args):
+      """Pause the hacluster services.
+      @raises Exception should the service fail to stop.
+      """
+      pause_unit()
+      # logic added to inform keystone that unit has been paused
+      relid = relation_ids('ha')
+      for r_id in relid:
+          relation_set(relation_id=r_id, paused=True)
+
+
+  def resume(args):
+      """Resume the hacluster services.
+      @raises Exception should the service fail to start."""
+      resume_unit()
+      # logic added to inform keystone that unit has been resumed
+      relid = relation_ids('ha')
+      for r_id in relid:
+          relation_set(relation_id=r_id, paused=False)
+
+
+Pausing a hacluster would result in sharing a new variable paused that can be
+used in the principal units.
+
+
+File responsible for hooks in a classic charm:
+
+.. code-block:: python
+
+  @hooks.hook('ha-relation-changed')
+  @restart_on_change(restart_map(), restart_functions=restart_function_map())
+  def ha_changed():
+
+      # Added logic to pause keystone unit when hacluster is paused
+      for rid in relation_ids('ha'):
+          for unit in related_units(rid):
+              paused = relation_get('paused', rid=rid, unit=unit)
+              clustered = relation_get('clustered', rid=rid, unit=unit)
+              if clustered and is_db_ready():
+                  if paused == 'True':
+                      pause_unit_helper(register_configs())
+
+                  elif paused == 'False':
+                      resume_unit_helper(register_configs())
+
+                  update_nrpe_config()
+                  inform_peers_if_ready(check_api_unit_ready)
+                  # inform subordinate unit that is paused or resumed
+                  relation_set(relation_id=rid, paused=is_unit_paused_set())
+
+By informing peers and updating the nrpe config this will be enough to trigger
+the necessary logic to remove the services checks.
+
+In a situation where the principal unit is paused, hacluster should also be
+paused. For this to happen, it can use the ha-relation-changed from
+charm-ha-cluster:
+
+.. code-block:: python
+
+  @hooks.hook('ha-relation-joined',
+              'ha-relation-changed',
+              'peer-availability-relation-joined',
+              'peer-availability-relation-changed',
+              'pacemaker-remote-relation-changed')
+  def ha_relation_changed():
+      # Inserted logic
+      # pauses if the principal unit is paused
+      paused = relation_get('paused')
+      if paused == 'True':
+          pause_unit()
+      elif paused == 'False':
+          resume_unit()
+
+      # share if the subordinate unit status
+      for rel_id in relation_ids('ha'):
+          relation_set(
+              relation_id=rel_id,
+              clustered="yes",
+              paused=is_unit_paused_set()
+          )
+
+Alternatives
+------------
+One alternative to services from the principal unit checks is to change
+systemd.py in charm-nrpe to accept flag -w like the proposal for the
+check_haproxy.py
+
+This way would not be necessary to remove the .cfg files for services from
+the principal unit, but would be necessary to adapt the function
+`add_init_service_checks` to be able to accept services with the warning flag.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  gabrielcocenza
+
+Gerrit Topic
+------------
+
+Use Gerrit topic "pausing-charms-hacluster-no-false-alerts" for all patches
+related to this spec.
+
+.. code-block:: bash
+
+    git-review -t pausing-charms-hacluster-no-false-alerts
+
+Work Items
+----------
+- charmhelpers
+
+  - nrpe.py
+
+  - check_haproxy.py
+
+- charm-ceilometer
+- charm-ceph-radosgw
+- charm-designate
+- charm-keystone
+- charm-neutron-api
+- charm-nova-cloud-controller
+- charm-openstack-dashboard
+- charm-cinder
+- charm-glance
+- charm-heat
+- charm-swift-proxy
+
+- charm-nrpe (Alternative)
+
+  - systemd.py
+
+- charm-hacluster
+
+  - actions.py
+
+Repositories
+------------
+
+No new git Repository is required.
+
+Documentation
+-------------
+
+It will be necessary to document the impact of pausing/resuming a
+subordinate hacluster and the side effect on Openstack API charms.
+
+Security
+--------
+
+No additional security concerns.
+
+Testing
+-------
+
+Code changes will be covered by unit and functional tests. For functional
+tests, it will use a bundle with keystone, hacluster, nrpe and nagios.
+
+Dependencies
+============
+None