diff --git a/specs/ocata/failure-remoting.rst b/specs/ocata/failure-remoting.rst new file mode 100644 index 0000000..85f43ce --- /dev/null +++ b/specs/ocata/failure-remoting.rst @@ -0,0 +1,498 @@ +=========================================== + Unifying and cleaning up failure remoting +=========================================== + +https://blueprints.launchpad.net/oslo.messaging/+spec/failure-remoting + +We currently have a couple ways of remoting failures (exceptions + +traceback information) that occur on remote systems back to their +source. These different ways have differences that make each solution +valid and applicable to its problem area. To encourage unification, this +spec will work on a proposal that can take the best aspects of both +implementations and leave the weaknesses of both behind to make a +best of breed implementation. + +Problem description +=================== + +There is a repeated desire to be able to serialize an exception, an +exception type, and as much information about the exceptions cause (ie +its traceback) when a creator on a remote system fails to some +other system (typically transmitted over some RPC or REST or other +non-local interface). For brevity sake let us call the tuple +of ``(exception_type, value, traceback)`` (typically created from some +call to `sys.exc_info`_) a **failure** object. When on a local machine +and the failure is created inside its own process the exception, its class +and its traceback are natively supported and can be examined, output, +logged (typically using the `traceback`_ module), handled (via ``try/catch`` +blocks) and analyzed; but when that exception is remotely created and +sent to a receiver the recreation of that failure becomes that much more +complicated for a few reasons: + +* Serialization of a traceback object (which typically contains references + to local stack frames) into some serializable format typically means that + the reconstructed traceback will not be as *rich* as it was when created + on the local process due to the fact that those local stack frames + will *not* exist in the receivers process. This implies that traceback + serialization/deserialization is a lossy process and by side-effect + this means that for remote exceptions the `traceback`_ module + can *not* be used and/or that the information it produces may + not be accurate. +* Input validation must now be performed, ensuring that the serialized format + created by the sender is actually valid (this excludes using `pickle`_ + for serialization/deserialization due to its widely known security + vulnerabilities). +* The receiver of the failure, if it desires to *try* to recreate an + exception object from the serialized version **must** have access to the + same exception type/class that was used to create the original + exception; this may not always be possible (depending on modules and classes + accessible from the receivers ``sys.path``). +* Any contained exception value (typically a ``string``, but not limited to) + will need to be reconstructed (this may not always be possible, for + example if the originating exception value references some local file + handle or other non-serializable object, such as a local threading lock). + +.. _sys.exc_info: https://docs.python.org/2/library/sys.html#sys.exc_info +.. _pickle: https://docs.python.org/2/library/pickle.html +.. _traceback: https://docs.python.org/2/library/traceback.html + +What exists +=========== + +There are a few known implementations of failure capturing, serialization +and deserialization/reconstruction. Let us dive into how each one works and +analyze the benefits and drawbacks of each approach. + +Oslo.privsep +------------ + +Source: + +* https://github.com/openstack/oslo.privsep/blob/1.13.0/oslo_privsep/daemon.py#L181-L187 +* https://github.com/openstack/oslo.privsep/blob/1.13.0/oslo_privsep/daemon.py#L181-L187 + +Commentary +~~~~~~~~~~ + +* Sends back class + module name across socket channel + exception arguments. +* Drops traceback (logs it on priviliged side). +* Recreates new class object with sent across arguments (and reraises) + on unpriviliged side (ideally nothing leaks across?). + +Oslo.messaging +-------------- + +Source: + +* https://github.com/openstack/oslo.messaging/blob/2.5.0/oslo_messaging/_drivers/common.py#L164 +* https://github.com/openstack/oslo.messaging/blob/2.5.0/oslo_messaging/_drivers/common.py#L204 + +A similar (same?) copy seems to be in nova (for cells?): + +* https://github.com/openstack/nova/blob/stable/liberty/nova/cells/messaging.py#L1878 +* https://github.com/openstack/nova/blob/stable/liberty/nova/cells/messaging.py#L1918 + +Docs: unknown + +Commentary +~~~~~~~~~~ + +Serializes: yes (to json); keyword arguments of exception are extracted +from optional exception attribute ``kwargs``, class name and module name +of exception are captured with final data format being:: + + data = { + 'class': cls_name, + 'module': mod_name, + 'message': six.text_type(exception), + 'tb': tb, + 'args': exception.args, + 'kwargs': kwargs + } + +Deserializes: yes; previous json data is loaded as a dictionary. + +Validates: No; `jsonschema`_ validation is not currently performed. + +Reconstructs: yes (with limitations); message of exception from +``message`` in ``data`` is loaded and concated with traceback from ``tb`` +dictionary element, module received is then verified against a provided list +and if module received is not allowed a generic exception is raised which +attempts to encapsulate the received failure. This generic +exception (which does retain the traceback) is created via:: + + oslo_messaging.RemoteError(data.get('class'), data.get('message'), + data.get('tb')) + +Otherwise if the module is one of the allowed types the exception class +object is recreated by using:: + + klass = + exception = klass(*data.get('args', []), **data.get('kwargs', {})) + +Then if this works, to ensure the ``__str__`` and ``__unicode__`` methods +correctly return the ``message`` key in the previously mentioned ``data`` +dictionary a dynamic exception type is created with a dynamically created +function that returns provided ``message``; then the ``exception`` created +above has its ``__class__`` attribute replaced to be this new dynamic +exception type (woah!):: + + exc_type = type(exception) + str_override = lambda self: message + new_ex_type = type(ex_type.__name__ + _REMOTE_POSTFIX, (ex_type,), + {'__str__': str_override, '__unicode__': str_override}) + new_ex_type.__module__ = '%s%s' % (module, _REMOTE_POSTFIX) + exception.__class__ = new_ex_type + +if this doesn't work then ``exception`` is returned +untouched and instead the ``exception.args`` list is replaced with a new +``args`` list that has the ``message`` from the ``data`` dict as its first +entry (replacing the prior ``args`` first entry with its own). + +Notes: + +* Appears to lose remote traceback info during above reconstruction + process (unless `RemoteError`_ is returned, which does not + lose the traceback, but does lose the original type + associated + information). +* Does not capture `chained`_ exception information. +* Copied (or some version of it) into nova cells (currently unknown what + version/sha the nova folks copied from). + +.. _RemoteError: http://docs.openstack.org/developer/\ + oslo.messaging/rpcclient.html#oslo_messaging.RemoteError + +TaskFlow +-------- + +Source: + +* https://github.com/openstack/taskflow/blob/1.21.0/taskflow/types/failure.py + +Docs: + +* http://docs.openstack.org/developer/taskflow/types.html#module-taskflow.types.failure + +Commentary +~~~~~~~~~~ + +Serializes: True; translates exception (or ``sys.exc_info`` call) into +a dictionary using ``to_dict`` method. Example:: + + >>> from taskflow.types import failure + >>> try: + ... raise IOError("I have broke") + ... except Exception: + ... f = failure.Failure() + ... + >>> print(json.dumps(f.to_dict(), indent=4, sort_keys=True)) + { + "causes": [], + "exc_type_names": [ + "IOError", + "EnvironmentError", + "StandardError", + "Exception" + ], + "exception_str": "I have broke", + "traceback_str": " File \"\", line 2, in \n", + "version": 1 + } + +Deserializes: True; loads from json into dictionary. + +Validates: True; uses `jsonschema`_ with schema:: + + SCHEMA = { + "$ref": "#/definitions/cause", + "definitions": { + "cause": { + "type": "object", + 'properties': { + 'version': { + "type": "integer", + "minimum": 0, + }, + 'exception_str': { + "type": "string", + }, + 'traceback_str': { + "type": "string", + }, + 'exc_type_names': { + "type": "array", + "items": { + "type": "string", + }, + "minItems": 1, + }, + 'causes': { + "type": "array", + "items": { + "$ref": "#/definitions/cause", + }, + } + }, + "required": [ + "exception_str", + 'traceback_str', + 'exc_type_names', + ], + "additionalProperties": True, + }, + }, + } + +Reconstructs: True when failure objects are raised locally (when serialization +is not used). False when serialized using ``to_dict``; Instead of going +through process like defined in ``oslo.messaging`` above this object +instead wraps originating exception(s) in a new exception `WrappedFailure`_ and +exposes its type (string version of) information and its traceback in a +new exception and provides accessors and useful methods (defined on the +failure class) to contained information for introspection purposes. + +Notes: + +* Captures (and serializes and deserializes) `chained`_ exceptions (as + nested failure objects). Seen in above schema as ``causes`` key (which + self-references the schema object). + +.. _chained: https://www.python.org/dev/peps/pep-3134/ +.. _WrappedFailure: http://docs.openstack.org/developer/\ + taskflow/exceptions.html#taskflow.exceptions.WrappedFailure +.. _jsonschema: http://json-schema.org/ + +Twisted +------- + +Source: + +* https://github.com/twisted/twisted/blob/twisted-15.4.0/twisted/python/failure.py + +Docs: + +* http://twistedmatrix.com/documents/current/api/twisted.python.failure.html + +Commentary +~~~~~~~~~~ + +Example:: + + >>> from twisted.python import failure + >>> import pickle + >>> import traceback + >>> def blow_up(): + ... raise ValueError("broken") + >>> try: + ... blow_up() + ... except ValueError: + ... f = failure.Failure() + >>> print(f) + [Failure instance: Traceback: : broken + --- --- + :2: + :2:blow_up + ] + >>> f.raiseException() + Traceback (most recent call last): + File "", line 1, in + File "", line 2, in + File "", line 2, in blow_up + ValueError: broken + >>> f_p = pickle.dumps(f) + >>> f_2 = pickle.loads(f_p) + >>> f_2.raiseException() + Traceback (most recent call last): + File "", line 1, in + File "", line 2, in raiseException + ValueError: broken + >>> print(f_2.tb) + None + >>> traceback.print_tb(f_2.getTracebackObject()) + File "", line 2, in + File "", line 2, in blow_up + +Serializes: `pickle`_ supported via ``__getstate__`` method. Since +they have created a *mostly* working replacement for the frame information +that a traceback stores it becomes possible to better integrate with +the `traceback`_ module (which accesses that frame information to try to +create useful traceback details). + +Deserializes: Yes, via `pickle`_. + +Validates: No (`pickle`_ is known to be vulnerable anyway to loading +arbitrary code). + +Reconstructs: Partially, a frame-like replica structure is created that +*mostly* works like the original (except it can't be re-raised, but it +can be passed to the `traceback`_ module to have its functions seemingly +work). + +Proposed change +=============== + +Create a new library, https://pypi.python.org/pypi/failure (or other better +named library) that encompasses the combination of the 3-4 models described +above. + +It would primarily provide a ``Failure`` object (like provided by +taskflow and twisted) as its main exposed API. That failure +class would have a ``__get_state__`` method so that it can be pickled (for +situations where this is desired) and a ``to_dict`` and ``from_dict`` that +can be used for json serialization and deserialization. It would also have +introspection APIs (similar to what is provided by twisted and taskflow) so +that the underlying exception information can be accessed in nice manner. + +Basic examples of these API(s) that would be great to have (and have +proven themselves useful):: + + @classmethod + def validate(cls, data): + """Validate input data matches expected failure format.""" + + def check(self, *exc_classes): + """Check if any of ``exc_classes`` caused the failure. + + ... + + """ + + def reraise(self): + """Re-raise captured exception.""" + + @property + def causes(self): + """Tuple of all *inner* failure *causes* of this failure. + + ... + + """ + + def pformat(self, traceback=False): + """Pretty formats the failure object into a string.""" + + @classmethod + def from_dict(cls, data): + """Converts this from a dictionary to a object.""" + + def to_dict(self): + """Converts this object to a dictionary.""" + + def copy(self): + """Copies this object.""" + +To take advantage of the re-raising capabilities in oslo.messaging this +class should also have a ``reraise`` method that can attempt to reraise the +given failure (if and only if it matches a given list of exception types). It +would **not** attempt to dynamically create a ``__str__`` and ``__repr__`` +method (the class manipulation magic happening in oslo.messaging) to avoid +the peculiarities of this chunk of code. If the contained failure does +not match a known list of failures, then ``reraise`` will return false and +it will not re-raise anything (leaving it up to the caller to decide what +to do in this situation, perhaps at this point a common `WrappedFailure`_ +like exception should be raised?). + +The validation logic using `jsonschema`_ would be taken from taskflow and +used when deserializing so that errors with *bad* data can be found +earlier (at data load time) rather than later (at data access time). + +To provide the twisted like integration with the traceback module (by +turning the internal format of a traceback into a pure python object +representation) there has been discussed if the `traceback2`_ module can +provide equivalent functionality, if it can then it should be used to +achieve similar integration (it would be even better if the integration +would also allow for re-raising this pure python trackback and frame +representation as an actual traceback, although this may not be a reasonable +expectation). + +.. _traceback2: https://pypi.python.org/pypi/traceback2/ + +Alternatives +------------ + +Keep having multiple variations, each with their own weaknesses and +benefits, instead of unifying them under a single library. + +Impact on Existing APIs +----------------------- + +Ideally none, as the users should still get the same functionality, but +if this is done correctly they will get more meaningful tracebacks, more +meaningful introspection on failure objects and overall better and more +consistent failures. + +Security impact +--------------- + +Performance Impact +------------------ + +N/A + +Configuration Impact +-------------------- + +N/A + +Developer Impact +---------------- + +This should make developers lives better. + +Testing Impact +-------------- + +Having the failure code in its own library, allows it to be easily mocked +and tested (vs say having it deeply embedded in oslo.messaging where it is +not so easily testable/reviewable...); so overall this should improve +test coverage (and overall code quality). + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: harlowja + +Milestones +---------- + +Target Milestone for completion: Mikita + +Work Items +---------- + +#. Create skeleton library. +#. Get skeleton up on gerrit and integrated into oslo pipelines. +#. Start to move around code from oslo.messaging and taskflow + and refactor to start to form this new library; use concepts and + learning from twisted and bolt-ons (and others) to help make this + library the best it can be. +#. Review and code and repeat. +#. Release and integrate. +#. Delete older dead code. +#. Profit! + +Incubation +========== + +N/A + +Documentation Impact +==================== + +Dependencies +============ + +References +========== + +N/A (all inline) + +.. note:: + + This work is licensed under a Creative Commons Attribution 3.0 + Unported License. + http://creativecommons.org/licenses/by/3.0/legalcode +