Merge "Mechanism to Prevent Race Conditions Spec"

2016-12-15 21:06:30 +00:00 · 2016-12-15 21:06:30 +00:00 · 3c764415e0
parent 6fd17c1ceb 1b377ede1c
commit 3c764415e0
2 changed files with 291 additions and 0 deletions
--- a/doc/source/conf.py
+++ b/doc/source/conf.py
@ -23,6 +23,7 @@ sys.path.insert(0, os.path.abspath('../..'))
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
 extensions = [
    'sphinx.ext.autodoc',
+    'sphinx.ext.graphviz',
    'oslosphinx',
    'yasfb',
 ]
--- a/specs/ocata/mechanism-to-prevent-race-conditions.rst
+++ b/specs/ocata/mechanism-to-prevent-race-conditions.rst
@ -0,0 +1,290 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+====================================
+Mechanism to Prevent Race Conditions
+====================================
+
+https://blueprints.launchpad.net/manila/+spec/eliminate-race-conditions
+
+This proposal is to develop a general solution for preventing race
+conditions which will work across services and in deployments where
+there are multiple copies of the same services (commonly known as
+Active/Active HA deployments).
+
+The focus is on keeping all state in the database, and protecting changes to
+database state using briefly-held locks. Also concurrent operations which
+are mutually exclusive should fail as early as possible with a helpful error
+code to simplify the retry logic of upper layers.
+
+Problem description
+===================
+
+Certain operations in Manila should not be allowed to proceed in parallel,
+because the result of one operation would prevent the other operation from
+completing successfully.
+
+For example, taking a snapshot of a share cannot happen simultaneously while
+deleting that share. Either the snapshot must occur first, which prevents the
+delete -- or the delete must occur first, which prevents the snapshot.
+Unfortunately, not enough state is stored in the database to prevent these
+operations from racing with each other, so in practice two API calls can both
+proceed through the API service to the share manager, where eventually an
+error will occur and one or both operations will fail mysteriously.
+
+There are multiple scenarios like the above where undefined behavior results.
+This specification does not attempt to enumerate all of them because the goal
+is to describe a mechanism for fixing these kinds of issues rather than
+explicitly fixing all such issues. Generally speaking, race conditions should
+be treated as bugs, but up until now Manila has lacked the tools to fix these
+bugs reliably.
+
+Use cases
+=========
+
+Specific cases:
+
+* Two snapshot operations should not be able to occur at the same time. One
+  must complete before the second can begin. This ensures that snapshots occur
+  in a known order, and prevents the useless situation of having 2 identical
+  snapshots.
+
+* Taking a snapshot of a share should prevent a delete of that share.
+
+* Valid changes to access rules should always be accepted regardless of the
+  state of the existing access rules. Although rules are applied to the backend
+  asynchronously, it's valid to add multiple rules faster than the system can
+  apply them and expect Manila to catch up.
+
+General cases:
+
+* These guarantees must be enforceable using a database running in a clustered
+  configuration such as Galera. This prevents obvious solutions such as relying
+  on DB row-level locking.
+
+* These guarantees must be enforceable while running multiple copies of Manila
+  services, including multiple API, scheduler, and share manager services.
+  This prevents obvious solutions like in-process locks.
+
+* These guarantees must be enforceable while running in a distributed
+  configuration where cooperating services are on different nodes (physical,
+  VMs, or containers). This prevents our existing approach of using file
+  locks. Even though network-based file locking solutions exist, they represent
+  a single point of failure and are unacceptable in properly distributed
+  environments.
+
+* The Manila services should be able to automatically and gracefully recover
+  from crashes and other unplanned downtime. This means that implicit state
+  should be avoided and because long-held locks are implicit state they should
+  be avoided in favor of short-held locks with explicit state.
+
+Proposed change
+===============
+
+More transitional states will be added so that operations which can conflict
+with other operations can be explicitly detected by looking at the share
+state. This spec only proposes one specific new state to address the races
+involving snapshots, but more generally provides a framework for resolving
+similar races as they are discovered.
+
+Transitions between states will always be done while holding a distributed
+lock -- a lock implemented by a distributed lock manager (DLM). Use of
+distributed locks ensures that all services see the same locking state even if
+services run on different nodes, and even across transient failures such as
+node failures and network partitions. The lock will be held only for the
+duration of the database test-and-set operation to minimize lock contention.
+
+No locks will be held during calls from the share manager to the share driver.
+Mutual exclusion between driver calls will be achieved with state checks.
+
+No locks will be held during RPC calls or casts.
+
+Alternatives
+------------
+
+The approach used by Cinder which relies on elaborate SQL calls to
+compare-and-swap fields was considered but rejected for the following reasons:
+
+* The code in Cinder can't be shared with Manila because it relies on OVO (Oslo
+  Versioned Objects)
+
+* Not enough people understand how it works so it's likely to be hard to
+  maintain.
+
+* Cinder's compare-and-swap approach limits the kind of state changes you can
+  make because updating multiple tables atomically is impossible. Locks
+  don't suffer from this restriction.
+
+Data model impact
+-----------------
+
+New states will be added:
+
+* Snapshotting
+
+* States for access rules covered in `Access rules spec`_
+
+
+.. graphviz::
+
+    digraph share_states {
+        label="Share States"
+        // Transitional States
+        creating[shape=hexagon];
+        manage_starting[shape=hexagon];
+        deleting[shape=hexagon];
+        snapshotting[shape=hexagon,color=gold4, fontcolor=gold4];
+        migrating[shape=hexagon];
+        shrinking[shape=hexagon];
+        extending[shape=hexagon];
+        unmanage_starting[shape=hexagon];
+        replication_change[shape=hexagon];
+        // Error states
+        error[color=red4, fontcolor=red4];
+        shrinking_error[color=red4, fontcolor=red4];
+        shrinking_possible_data_loss_error[color=red4, fontcolor=red4];
+        extending_error[color=red4, fontcolor=red4];
+        unmanage_error[color=red4, fontcolor=red4];
+        manage_error[color=red4, fontcolor=red4];
+        error_deleting[color=red4, fontcolor=red4];
+        // Other states
+        new[color=blue, fontcolor=blue];
+        available[color=darkgreen, fontcolor=darkgreen];
+        deleted[shape=box, color=navy, fontcolor=navy];
+        unmanaged[shape=box, color=navy, fontcolor=navy];
+        // User requested transitions
+        new -> creating[label="create"];
+        new -> manage_starting[label="manage"];
+        available -> deleting[label="delete"];
+        available -> snapshotting[label="create snapshot", color=gold4, fontcolor=gold4];
+        available -> migrating[label="migrate"];
+        available -> shrinking[label="shrink"];
+        available -> extending[label="extend"];
+        available -> unmanage_starting[label="unmanage"];
+        available -> replication_change[label="add replica"];
+        // Automatic transitions
+        creating -> available[label="success", color=darkgreen, fontcolor=darkgreen];
+        deleting -> deleted[label="success", color=darkgreen, fontcolor=darkgreen];
+        snapshotting -> available[label="success", color=darkgreen, fontcolor=darkgreen];
+        manage_starting -> available[label="success", color=darkgreen, fontcolor=darkgreen];
+        unmanage_starting -> unmanaged[label="success", color=darkgreen, fontcolor=darkgreen];
+        extending -> available[label="success", color=darkgreen, fontcolor=darkgreen];
+        shrinking -> available[label="success", color=darkgreen, fontcolor=darkgreen];
+        replication_change -> available[label="success", color=darkgreen, fontcolor=darkgreen];
+        // Reset transitions
+        error -> available[label="reset"];
+        shrinking_error -> available[label="reset"];
+        extending_error -> available[label="reset"];
+        unmanage_error -> available[label="reset"];
+        manage_error -> available[label="reset"];
+        error_deleting -> available[label="reset"];
+        // Error transitions
+        creating -> error[label="fail", color=red4, fontcolor=red4];
+        migrating -> error[label="fail", color=red4, fontcolor=red4];
+        shrinking -> shrinking_error[label="fail", color=red4, fontcolor=red4];
+        shrinking -> shrinking_possible_data_loss_error[label="fail", color=red4, fontcolor=red4];
+        extending -> extending_error[label="fail", color=red4, fontcolor=red4];
+        unmanage_starting -> unmanage_error[label="fail", color=red4, fontcolor=red4];
+        manage_starting -> manage_error[label="fail", color=red4, fontcolor=red4];
+        snapshotting -> error[label="fail", color=red4, fontcolor=red4];
+        deleting -> error_deleting[label="fail", color=red4, fontcolor=red4];
+    }
+
+REST API impact
+---------------
+
+New states will be visible through any API that shows states. Also new
+error conditions will become possible as we detect races earlier and
+report them directly.
+
+The behavioral changes related to locking will not be microversioned, as it
+won't be possible or desirable to emulate the old behavior once the changes
+are implemented. However in cases where new states are added, those changes
+will be microversioned so that clients which depend on the new states can
+detect that the server supports them.
+
+Driver impact
+-------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+Distributed locking is expected to moderately slow down state changes. Also
+adding more state changes will slow down operations that require them.
+
+Other deployer impact
+---------------------
+
+Requirement to deploy and configure suitable `Tooz`_ backend. Since Manila will
+depend on tooz for correctness, tooz backends that fail to meet the API
+contract won't be suitable.
+
+Developer impact
+----------------
+
+This will be significant. Developers will need to follow the new model for
+all features that involve state changes. Also care will be needed with locks
+to avoid deadlock situations. Holding locks for very limited time will help
+avoid deadlocks but in case 2 locks are ever held at the same time, they need
+to be deadlock-proofed by establishing a lock order.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+  bswartz
+
+Work Items
+----------
+
+* Add snapshotting state
+
+* Complete tooz integration
+
+* Wrap state changes with tooz locks
+
+Dependencies
+============
+
+Tooz
+
+Testing
+=======
+
+Existing tests will help ensure no regressions, but to detect race
+conditions we need rally tests or similarly high-concurrency tests.
+
+Documentation Impact
+====================
+
+Admin guide - need to document tooz requirements.
+
+Developer reference - need to document state machines and locking protocol
+
+References
+==========
+
+_`Access rules spec`: https://review.openstack.org/#/c/399049/
+
+_`Tooz` integration: https://review.openstack.org/#/c/318336/