===========================================
 Unifying and cleaning up failure remoting
===========================================

https://blueprints.launchpad.net/oslo.messaging/+spec/failure-remoting

We currently have a couple ways of remoting failures (exceptions +
traceback information) that occur on remote systems back to their
source. These different ways have differences that make each solution
valid and applicable to its problem area. To encourage unification, this
spec will work on a proposal that can take the best aspects of both
implementations and leave the weaknesses of both behind to make a
best of breed implementation.

Problem description
===================

There is a repeated desire to be able to serialize an exception, an
exception type, and as much information about the exceptions cause (ie
its traceback) when a creator on a remote system fails to some
other system (typically transmitted over some RPC or REST or other
non-local interface). For brevity sake let us call the tuple
of ``(exception_type, value, traceback)`` (typically created from some
call to `sys.exc_info`_) a **failure** object. When on a local machine
and the failure is created inside its own process the exception, its class
and its traceback are natively supported and can be examined, output,
logged (typically using the `traceback`_ module), handled (via ``try/catch``
blocks) and analyzed; but when that exception is remotely created and
sent to a receiver the recreation of that failure becomes that much more
complicated for a few reasons:

* Serialization of a traceback object (which typically contains references
  to local stack frames) into some serializable format typically means that
  the reconstructed traceback will not be as *rich* as it was when created
  on the local process due to the fact that those local stack frames
  will *not* exist in the receivers process. This implies that traceback
  serialization/deserialization is a lossy process and by side-effect
  this means that for remote exceptions the `traceback`_ module
  can *not* be used and/or that the information it produces may
  not be accurate.
* Input validation must now be performed, ensuring that the serialized format
  created by the sender is actually valid (this excludes using `pickle`_
  for serialization/deserialization due to its widely known security
  vulnerabilities).
* The receiver of the failure, if it desires to *try* to recreate an
  exception object from the serialized version **must** have access to the
  same exception type/class that was used to create the original
  exception; this may not always be possible (depending on modules and classes
  accessible from the receivers ``sys.path``).
* Any contained exception value (typically a ``string``, but not limited to)
  will need to be reconstructed (this may not always be possible, for
  example if the originating exception value references some local file
  handle or other non-serializable object, such as a local threading lock).

.. _sys.exc_info: https://docs.python.org/2/library/sys.html#sys.exc_info
.. _pickle: https://docs.python.org/2/library/pickle.html
.. _traceback: https://docs.python.org/2/library/traceback.html

What exists
===========

There are a few known implementations of failure capturing, serialization
and deserialization/reconstruction. Let us dive into how each one works and
analyze the benefits and drawbacks of each approach.

Oslo.privsep
------------

Source:

* https://github.com/openstack/oslo.privsep/blob/1.13.0/oslo_privsep/daemon.py#L181-L187
* https://github.com/openstack/oslo.privsep/blob/1.13.0/oslo_privsep/daemon.py#L181-L187

Commentary
~~~~~~~~~~

* Sends back class + module name across socket channel + exception arguments.
* Drops traceback (logs it on priviliged side).
* Recreates new class object with sent across arguments (and reraises)
  on unpriviliged side (ideally nothing leaks across?).

Oslo.messaging
--------------

Source:

* https://github.com/openstack/oslo.messaging/blob/2.5.0/oslo_messaging/_drivers/common.py#L164
* https://github.com/openstack/oslo.messaging/blob/2.5.0/oslo_messaging/_drivers/common.py#L204

A similar (same?) copy seems to be in nova (for cells?):

* https://github.com/openstack/nova/blob/stable/liberty/nova/cells/messaging.py#L1878
* https://github.com/openstack/nova/blob/stable/liberty/nova/cells/messaging.py#L1918

Docs: unknown

Commentary
~~~~~~~~~~

Serializes: yes (to json); keyword arguments of exception are extracted
from optional exception attribute ``kwargs``, class name and module name
of exception are captured with final data format being::

    data = {
        'class': cls_name,
        'module': mod_name,
        'message': six.text_type(exception),
        'tb': tb,
        'args': exception.args,
        'kwargs': kwargs
    }

Deserializes: yes; previous json data is loaded as a dictionary.

Validates: No; `jsonschema`_ validation is not currently performed.

Reconstructs: yes (with limitations);  message of exception from
``message`` in ``data`` is loaded and concated with traceback from ``tb``
dictionary element, module received is then verified against a provided list
and if module received is not allowed a generic exception is raised which
attempts to encapsulate the received failure. This generic
exception (which does retain the traceback) is created via::

    oslo_messaging.RemoteError(data.get('class'), data.get('message'),
                               data.get('tb'))

Otherwise if the module is one of the allowed types the exception class
object is recreated by using::

    klass = <load module and class and verify class is an exception type>
    exception = klass(*data.get('args', []), **data.get('kwargs', {}))

Then if this works, to ensure the ``__str__`` and ``__unicode__`` methods
correctly return the ``message`` key in the previously mentioned ``data``
dictionary a dynamic exception type is created with a dynamically created
function that returns provided ``message``; then the ``exception`` created
above has its ``__class__`` attribute replaced to be this new dynamic
exception type (woah!)::

    exc_type = type(exception)
    str_override = lambda self: message
    new_ex_type = type(ex_type.__name__ + _REMOTE_POSTFIX, (ex_type,),
                       {'__str__': str_override, '__unicode__': str_override})
    new_ex_type.__module__ = '%s%s' % (module, _REMOTE_POSTFIX)
    exception.__class__ = new_ex_type

if this doesn't work then ``exception`` is returned
untouched and instead the ``exception.args`` list is replaced with a new
``args`` list that has the ``message`` from the ``data`` dict as its first
entry (replacing the prior ``args`` first entry with its own).

Notes:

* Appears to lose remote traceback info during above reconstruction
  process (unless `RemoteError`_ is returned, which does not
  lose the traceback, but does lose the original type + associated
  information).
* Does not capture `chained`_ exception information.
* Copied (or some version of it) into nova cells (currently unknown what
  version/sha the nova folks copied from).

.. _RemoteError: http://docs.openstack.org/developer/\
                 oslo.messaging/rpcclient.html#oslo_messaging.RemoteError

TaskFlow
--------

Source:

* https://github.com/openstack/taskflow/blob/1.21.0/taskflow/types/failure.py

Docs:

* http://docs.openstack.org/developer/taskflow/types.html#module-taskflow.types.failure

Commentary
~~~~~~~~~~

Serializes: True; translates exception (or ``sys.exc_info`` call) into
a dictionary using ``to_dict`` method. Example::

    >>> from taskflow.types import failure
    >>> try:
    ...    raise IOError("I have broke")
    ... except Exception:
    ...    f = failure.Failure()
    ...
    >>> print(json.dumps(f.to_dict(), indent=4, sort_keys=True))
    {
        "causes": [],
        "exc_type_names": [
            "IOError",
            "EnvironmentError",
            "StandardError",
            "Exception"
        ],
        "exception_str": "I have broke",
        "traceback_str": "  File \"<stdin>\", line 2, in <module>\n",
        "version": 1
    }

Deserializes: True; loads from json into dictionary.

Validates: True; uses `jsonschema`_ with schema::

    SCHEMA = {
        "$ref": "#/definitions/cause",
        "definitions": {
            "cause": {
                "type": "object",
                'properties': {
                    'version': {
                        "type": "integer",
                        "minimum": 0,
                    },
                    'exception_str': {
                        "type": "string",
                    },
                    'traceback_str': {
                        "type": "string",
                    },
                    'exc_type_names': {
                        "type": "array",
                        "items": {
                            "type": "string",
                        },
                        "minItems": 1,
                    },
                    'causes': {
                        "type": "array",
                        "items": {
                            "$ref": "#/definitions/cause",
                        },
                    }
                },
                "required": [
                    "exception_str",
                    'traceback_str',
                    'exc_type_names',
                ],
                "additionalProperties": True,
            },
        },
    }

Reconstructs: True when failure objects are raised locally (when serialization
is not used). False when serialized using ``to_dict``; Instead of going
through process like defined in ``oslo.messaging`` above this object
instead wraps originating exception(s) in a new exception `WrappedFailure`_ and
exposes its type (string version of) information and its traceback in a
new exception and provides accessors and useful methods (defined on the
failure class) to contained information for introspection purposes.

Notes:

* Captures (and serializes and deserializes) `chained`_ exceptions (as
  nested failure objects). Seen in above schema as ``causes`` key (which
  self-references the schema object).

.. _chained: https://www.python.org/dev/peps/pep-3134/
.. _WrappedFailure: http://docs.openstack.org/developer/\
                    taskflow/exceptions.html#taskflow.exceptions.WrappedFailure
.. _jsonschema: http://json-schema.org/

Twisted
-------

Source:

* https://github.com/twisted/twisted/blob/twisted-15.4.0/twisted/python/failure.py

Docs:

* http://twistedmatrix.com/documents/current/api/twisted.python.failure.html

Commentary
~~~~~~~~~~

Example::

    >>> from twisted.python import failure
    >>> import pickle
    >>> import traceback
    >>> def blow_up():
    ...    raise ValueError("broken")
    >>> try:
    ...    blow_up()
    ... except ValueError:
    ...    f = failure.Failure()
    >>> print(f)
    [Failure instance: Traceback: <type 'exceptions.ValueError'>: broken
    --- <exception caught here> ---
    <stdin>:2:<module>
    <stdin>:2:blow_up
    ]
    >>> f.raiseException()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 2, in <module>
      File "<stdin>", line 2, in blow_up
    ValueError: broken
    >>> f_p = pickle.dumps(f)
    >>> f_2 = pickle.loads(f_p)
    >>> f_2.raiseException()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 2, in raiseException
    ValueError: broken
    >>> print(f_2.tb)
    None
    >>> traceback.print_tb(f_2.getTracebackObject())
      File "<stdin>", line 2, in <module>
      File "<stdin>", line 2, in blow_up

Serializes: `pickle`_ supported via ``__getstate__`` method. Since
they have created a *mostly* working replacement for the frame information
that a traceback stores it becomes possible to better integrate with
the `traceback`_ module (which accesses that frame information to try to
create useful traceback details).

Deserializes: Yes, via `pickle`_.

Validates: No (`pickle`_ is known to be vulnerable anyway to loading
arbitrary code).

Reconstructs: Partially, a frame-like replica structure is created that
*mostly* works like the original (except it can't be re-raised, but it
can be passed to the `traceback`_ module to have its functions seemingly
work).

Proposed change
===============

Create a new library, https://pypi.python.org/pypi/failure (or other better
named library) that encompasses the combination of the 3-4 models described
above.

It would primarily provide a ``Failure`` object (like provided by
taskflow and twisted) as its main exposed API. That failure
class would have a ``__get_state__`` method so that it can be pickled (for
situations where this is desired) and a ``to_dict`` and ``from_dict`` that
can be used for json serialization and deserialization. It would also have
introspection APIs (similar to what is provided by twisted and taskflow) so
that the underlying exception information can be accessed in nice manner.

Basic examples of these API(s) that would be great to have (and have
proven themselves useful)::

    @classmethod
    def validate(cls, data):
        """Validate input data matches expected failure format."""

    def check(self, *exc_classes):
        """Check if any of ``exc_classes`` caused the failure.

        ...

        """

    def reraise(self):
        """Re-raise captured exception."""

    @property
    def causes(self):
        """Tuple of all *inner* failure *causes* of this failure.

        ...

        """

    def pformat(self, traceback=False):
        """Pretty formats the failure object into a string."""

    @classmethod
    def from_dict(cls, data):
        """Converts this from a dictionary to a object."""

    def to_dict(self):
        """Converts this object to a dictionary."""

    def copy(self):
        """Copies this object."""

To take advantage of the re-raising capabilities in oslo.messaging this
class should also have a ``reraise`` method that can attempt to reraise the
given failure (if and only if it matches a given list of exception types). It
would **not** attempt to dynamically create a ``__str__`` and ``__repr__``
method (the class manipulation magic happening in oslo.messaging) to avoid
the peculiarities of this chunk of code. If the contained failure does
not match a known list of failures, then ``reraise`` will return false and
it will not re-raise anything (leaving it up to the caller to decide what
to do in this situation, perhaps at this point a common  `WrappedFailure`_
like exception should be raised?).

The validation logic using `jsonschema`_ would be taken from taskflow and
used when deserializing so that errors with *bad* data can be found
earlier (at data load time) rather than later (at data access time).

To provide the twisted like integration with the traceback module (by
turning the internal format of a traceback into a pure python object
representation) there has been discussed if the `traceback2`_ module can
provide equivalent functionality, if it can then it should be used to
achieve similar integration (it would be even better if the integration
would also allow for re-raising this pure python trackback and frame
representation as an actual traceback, although this may not be a reasonable
expectation).

.. _traceback2: https://pypi.python.org/pypi/traceback2/

Alternatives
------------

Keep having multiple variations, each with their own weaknesses and
benefits, instead of unifying them under a single library.

Impact on Existing APIs
-----------------------

Ideally none, as the users should still get the same functionality, but
if this is done correctly they will get more meaningful tracebacks, more
meaningful introspection on failure objects and overall better and more
consistent failures.

Security impact
---------------

Performance Impact
------------------

N/A

Configuration Impact
--------------------

N/A

Developer Impact
----------------

This should make developers lives better.

Testing Impact
--------------

Having the failure code in its own library, allows it to be easily mocked
and tested (vs say having it deeply embedded in oslo.messaging where it is
not so easily testable/reviewable...); so overall this should improve
test coverage (and overall code quality).

Implementation
==============

Assignee(s)
-----------

Primary assignee: harlowja

Milestones
----------

Target Milestone for completion: Mikita

Work Items
----------

#. Create skeleton library.
#. Get skeleton up on gerrit and integrated into oslo pipelines.
#. Start to move around code from oslo.messaging and taskflow
   and refactor to start to form this new library; use concepts and
   learning from twisted and bolt-ons (and others) to help make this
   library the best it can be.
#. Review and code and repeat.
#. Release and integrate.
#. Delete older dead code.
#. Profit!

Incubation
==========

N/A

Documentation Impact
====================

Dependencies
============

References
==========

N/A (all inline)

.. note::

  This work is licensed under a Creative Commons Attribution 3.0
  Unported License.
  http://creativecommons.org/licenses/by/3.0/legalcode