Merge "NVMe monitoring and healing agent for NVMe connector."

2020-12-23 13:46:20 +00:00 · 2020-12-23 13:46:20 +00:00 · bb8e6bc242
parent 4056c8102d ce4aa86453
commit bb8e6bc242
1 changed files with 211 additions and 0 deletions
--- a/specs/wallaby/nvme-agent.rst
+++ b/specs/wallaby/nvme-agent.rst
@ -0,0 +1,211 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+============================
+NVMe Connector Healing Agent
+============================
+
+https://blueprints.launchpad.net/cinder/+spec/nvmeof-client-raid-healing-agent
+
+Daemon that monitors NVMe connections and MDRAID arrays created by the
+NVMe connector, identifies faulted volume replicas, requests new replicas
+and replaces faulted replicas with new ones.
+
+
+Problem description
+===================
+
+When the NVMe connector connects a replicated volume, OpenStack will see it
+as one volume, and has no way of monitoring managing and healing the replicas
+in these MDRAID arrays. This agent will take care of that.
+
+It will monitor the state of the MDRAID arrays and reconcile their physical
+state on the host with expected state from the volume provisioner, replacing
+broken legs.
+
+For backend volume replicas, it's the storage array that takes care of
+monitoring and replacing unhealthy replicas.
+
+NVMe MDRAID moves the data replication responsibility from the backend to
+the consumer.
+
+Currently there's no mechanism to monitor and heal these replicated volumes.
+
+We cannot do it on the Cinder side, because even if the Cinder driver detected
+the issue and created a replacing volume, we have no mechanism to report the
+connection information of the replacing volume to the consumer.
+
+So the monitoring and healing needs to be on the volume consumer side.
+
+This agent will also be greatly beneficial for scenarios where certain replicas
+of an attached replicated volume go faulty, by notifying the volume provisioner
+of the faulty devices, they can be marked as faulty to avoid using old data on
+re-attachments and to replace them entirely.
+
+
+Use Cases
+=========
+
+When working with replicated NVMe volumes that are attached to an instance
+for a long time, one of the replicas may go faulty.
+This agent will detect it and attempt to replace it (self heal the MDRAID,
+without the need to detach and re-attach the volume).
+
+
+Proposed change
+===============
+
+Add an "NVMe agent" class that will be initialized by the NVMe connector
+during volume connection on a host.
+
+Initializing this agent will spawn a monitoring task which will repeat
+periodically. We are proposing this to be a native thread if possible,
+but if necessary it can be an independent process.
+
+First proposal was to use python Event Scheduler `sched.scheduler`, but other
+alternatives, such as spawning a separate process communicated to via socket,
+may be chosen instead.
+One key problem that would need to be addressed by this selection is a scenario
+where compute service goes down, while the VMs continue operating (and their
+volumes remain attached) - we don't want to lose this agent in this case.
+
+When initialized, the agent will read access information to the volume
+provisioner from a pre-determined config file location, with vendor specific
+format, the content of which should be provided there by the systems operator.
+
+The task will monitor NVMe devices and MDRAID arrays built over them.
+
+It will know which NVMe devices and MDRAID arrays to monitor based on metadata
+from the volume provisioner (backend) - which it will have a custom interface
+to.
+
+It will notify volume provisioner if necessary of failed devices.
+
+It will attempt to connect to new NVMe devices / replicas, replacing them
+in the MDRAID.
+
+Typical self healing flow:
+
+1. volume replica goes faulty
+2. agent notices faulty replica, reports to provisioner
+3. provisioner marks replica as bad (so it wont be used later unless synced)
+4. agent keeps pulling volume information from provisioner
+5. certain grace period passes, agent sees no state changes of faulty replica
+   from provisioner, so it sends explicit request to replace replica
+6. provisioner replaces replica and updates volume information
+7. agent pulls volume replica information, notices a replica has changed
+8. agent carries out replica replacement
+
+Alternatives
+------------
+
+Operator could use some own script to monitor connections and fix them manually
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+Will call NVMe connector methods that do sudo executions of `nvme` and `mdadm`
+This will happen in the new agent task that will be spawned from os-brick.
+
+Active/Active HA impact
+-----------------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+To allow multiple vendor implementations, the specific methods / logic for:
+
+- probing the volume provisioner
+- pulling / parsing volume metadata from provisioner
+- reporting volume state changes to provisioner
+- requesting provisioner to replace replica
+
+Will need to be implemented on a per vendor basis.
+
+The architecture is such that the agent will be a generic class that will
+provide the interface, and the kioxia implementation will be the first
+example of vendor-specific implementation.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Zohar Mamedov
+  zoharm
+
+Work Items
+----------
+
+NVMe connector will launch monitoring task on connect_volume if not running.
+
+Task monitors NVMe devices and MDRAID arrays created by the connector.
+
+When a replica goes faulty (as well as other events such as disconnects)
+call interface method for notifying volume provisioner.
+
+When replicated volume devices are changed by the volume provisioner,
+reconcile the physical state of NVMe devices and MDRAID arrays on the host.
+
+
+Dependencies
+============
+
+None
+
+
+Testing
+=======
+
+We should be able to accept this with just unit tests.
+
+
+Documentation Impact
+====================
+
+Document that using NVMe connector with replicated volumes will optionally
+launch this agent.
+
+
+References
+==========
+
+Architectural diagram
+https://wiki.openstack.org/wiki/File:Nvme-of-add-client-raid1-detail.png