17 KiB
Unifying and cleaning up failure remoting
https://blueprints.launchpad.net/oslo.messaging/+spec/failure-remoting
We currently have a couple ways of remoting failures (exceptions + traceback information) that occur on remote systems back to their source. These different ways have differences that make each solution valid and applicable to its problem area. To encourage unification, this spec will work on a proposal that can take the best aspects of both implementations and leave the weaknesses of both behind to make a best of breed implementation.
Problem description
There is a repeated desire to be able to serialize an exception, an
exception type, and as much information about the exceptions cause (ie
its traceback) when a creator on a remote system fails to some other
system (typically transmitted over some RPC or REST or other non-local
interface). For brevity sake let us call the tuple of
(exception_type, value, traceback)
(typically created from
some call to sys.exc_info)
a failure object. When on a local machine and the
failure is created inside its own process the exception, its class and
its traceback are natively supported and can be examined, output, logged
(typically using the traceback
module), handled (via try/catch
blocks) and analyzed; but
when that exception is remotely created and sent to a receiver the
recreation of that failure becomes that much more complicated for a few
reasons:
- Serialization of a traceback object (which typically contains references to local stack frames) into some serializable format typically means that the reconstructed traceback will not be as rich as it was when created on the local process due to the fact that those local stack frames will not exist in the receivers process. This implies that traceback serialization/deserialization is a lossy process and by side-effect this means that for remote exceptions the traceback module can not be used and/or that the information it produces may not be accurate.
- Input validation must now be performed, ensuring that the serialized format created by the sender is actually valid (this excludes using pickle for serialization/deserialization due to its widely known security vulnerabilities).
- The receiver of the failure, if it desires to try to
recreate an exception object from the serialized version
must have access to the same exception type/class that
was used to create the original exception; this may not always be
possible (depending on modules and classes accessible from the receivers
sys.path
). - Any contained exception value (typically a
string
, but not limited to) will need to be reconstructed (this may not always be possible, for example if the originating exception value references some local file handle or other non-serializable object, such as a local threading lock).
What exists
There are a few known implementations of failure capturing, serialization and deserialization/reconstruction. Let us dive into how each one works and analyze the benefits and drawbacks of each approach.
Oslo.privsep
Source:
- https://github.com/openstack/oslo.privsep/blob/1.13.0/oslo_privsep/daemon.py#L181-L187
- https://github.com/openstack/oslo.privsep/blob/1.13.0/oslo_privsep/daemon.py#L181-L187
Commentary
- Sends back class + module name across socket channel + exception arguments.
- Drops traceback (logs it on priviliged side).
- Recreates new class object with sent across arguments (and reraises) on unpriviliged side (ideally nothing leaks across?).
Oslo.messaging
Source:
- https://github.com/openstack/oslo.messaging/blob/2.5.0/oslo_messaging/_drivers/common.py#L164
- https://github.com/openstack/oslo.messaging/blob/2.5.0/oslo_messaging/_drivers/common.py#L204
A similar (same?) copy seems to be in nova (for cells?):
- https://github.com/openstack/nova/blob/stable/liberty/nova/cells/messaging.py#L1878
- https://github.com/openstack/nova/blob/stable/liberty/nova/cells/messaging.py#L1918
Docs: unknown
Commentary
Serializes: yes (to json); keyword arguments of exception are
extracted from optional exception attribute kwargs
, class
name and module name of exception are captured with final data format
being:
data = {
'class': cls_name,
'module': mod_name,
'message': six.text_type(exception),
'tb': tb,
'args': exception.args,
'kwargs': kwargs
}
Deserializes: yes; previous json data is loaded as a dictionary.
Validates: No; jsonschema validation is not currently performed.
Reconstructs: yes (with limitations); message of exception from
message
in data
is loaded and concated with
traceback from tb
dictionary element, module received is
then verified against a provided list and if module received is not
allowed a generic exception is raised which attempts to encapsulate the
received failure. This generic exception (which does retain the
traceback) is created via:
oslo_messaging.RemoteError(data.get('class'), data.get('message'),
data.get('tb'))
Otherwise if the module is one of the allowed types the exception class object is recreated by using:
klass = <load module and class and verify class is an exception type>
exception = klass(*data.get('args', []), **data.get('kwargs', {}))
Then if this works, to ensure the __str__
and
__unicode__
methods correctly return the
message
key in the previously mentioned data
dictionary a dynamic exception type is created with a dynamically
created function that returns provided message
; then the
exception
created above has its __class__
attribute replaced to be this new dynamic exception type (woah!):
exc_type = type(exception)
str_override = lambda self: message
new_ex_type = type(ex_type.__name__ + _REMOTE_POSTFIX, (ex_type,),
{'__str__': str_override, '__unicode__': str_override})
new_ex_type.__module__ = '%s%s' % (module, _REMOTE_POSTFIX)
exception.__class__ = new_ex_type
if this doesn't work then exception
is returned
untouched and instead the exception.args
list is replaced
with a new args
list that has the message
from
the data
dict as its first entry (replacing the prior
args
first entry with its own).
Notes:
- Appears to lose remote traceback info during above reconstruction process (unless RemoteError is returned, which does not lose the traceback, but does lose the original type + associated information).
- Does not capture chained exception information.
- Copied (or some version of it) into nova cells (currently unknown what version/sha the nova folks copied from).
TaskFlow
Source:
Docs:
Commentary
Serializes: True; translates exception (or sys.exc_info
call) into a dictionary using to_dict
method. Example:
>>> from taskflow.types import failure
>>> try:
... raise IOError("I have broke")
... except Exception:
... f = failure.Failure()
...
>>> print(json.dumps(f.to_dict(), indent=4, sort_keys=True))
{
"causes": [],
"exc_type_names": [
"IOError",
"EnvironmentError",
"StandardError",
"Exception"
],
"exception_str": "I have broke",
"traceback_str": " File \"<stdin>\", line 2, in <module>\n",
"version": 1
}
Deserializes: True; loads from json into dictionary.
Validates: True; uses jsonschema with schema:
SCHEMA = {
"$ref": "#/definitions/cause",
"definitions": {
"cause": {
"type": "object",
'properties': {
'version': {
"type": "integer",
"minimum": 0,
},
'exception_str': {
"type": "string",
},
'traceback_str': {
"type": "string",
},
'exc_type_names': {
"type": "array",
"items": {
"type": "string",
},
"minItems": 1,
},
'causes': {
"type": "array",
"items": {
"$ref": "#/definitions/cause",
},
}
},
"required": [
"exception_str",
'traceback_str',
'exc_type_names',
],
"additionalProperties": True,
},
},
}
Reconstructs: True when failure objects are raised locally (when
serialization is not used). False when serialized using
to_dict
; Instead of going through process like defined in
oslo.messaging
above this object instead wraps originating
exception(s) in a new exception WrappedFailure
and exposes its type (string version of) information and its traceback
in a new exception and provides accessors and useful methods (defined on
the failure class) to contained information for introspection
purposes.
Notes:
- Captures (and serializes and deserializes) chained exceptions
(as nested failure objects). Seen in above schema as
causes
key (which self-references the schema object).
Twisted
Source:
Docs:
Commentary
Example:
>>> from twisted.python import failure
>>> import pickle
>>> import traceback
>>> def blow_up():
... raise ValueError("broken")
>>> try:
... blow_up()
... except ValueError:
... f = failure.Failure()
>>> print(f)
[Failure instance: Traceback: <type 'exceptions.ValueError'>: broken
--- <exception caught here> ---
<stdin>:2:<module>
<stdin>:2:blow_up
]
>>> f.raiseException()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in <module>
File "<stdin>", line 2, in blow_up
ValueError: broken
>>> f_p = pickle.dumps(f)
>>> f_2 = pickle.loads(f_p)
>>> f_2.raiseException()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in raiseException
ValueError: broken
>>> print(f_2.tb)
None
>>> traceback.print_tb(f_2.getTracebackObject())
File "<stdin>", line 2, in <module>
File "<stdin>", line 2, in blow_up
Serializes: pickle
supported via __getstate__
method. Since they have created
a mostly working replacement for the frame information that a
traceback stores it becomes possible to better integrate with the traceback
module (which accesses that frame information to try to create useful
traceback details).
Deserializes: Yes, via pickle.
Validates: No (pickle is known to be vulnerable anyway to loading arbitrary code).
Reconstructs: Partially, a frame-like replica structure is created that mostly works like the original (except it can't be re-raised, but it can be passed to the traceback module to have its functions seemingly work).
Proposed change
Create a new library, https://pypi.org/project/failure (or other better named library) that encompasses the combination of the 3-4 models described above.
It would primarily provide a Failure
object (like
provided by taskflow and twisted) as its main exposed API. That failure
class would have a __get_state__
method so that it can be
pickled (for situations where this is desired) and a
to_dict
and from_dict
that can be used for
json serialization and deserialization. It would also have introspection
APIs (similar to what is provided by twisted and taskflow) so that the
underlying exception information can be accessed in nice manner.
Basic examples of these API(s) that would be great to have (and have proven themselves useful):
@classmethod
def validate(cls, data):
"""Validate input data matches expected failure format."""
def check(self, *exc_classes):
"""Check if any of ``exc_classes`` caused the failure.
...
"""
def reraise(self):
"""Re-raise captured exception."""
@property
def causes(self):
"""Tuple of all *inner* failure *causes* of this failure.
...
"""
def pformat(self, traceback=False):
"""Pretty formats the failure object into a string."""
@classmethod
def from_dict(cls, data):
"""Converts this from a dictionary to a object."""
def to_dict(self):
"""Converts this object to a dictionary."""
def copy(self):
"""Copies this object."""
To take advantage of the re-raising capabilities in oslo.messaging
this class should also have a reraise
method that can
attempt to reraise the given failure (if and only if it matches a given
list of exception types). It would not attempt to
dynamically create a __str__
and __repr__
method (the class manipulation magic happening in oslo.messaging) to
avoid the peculiarities of this chunk of code. If the contained failure
does not match a known list of failures, then reraise
will
return false and it will not re-raise anything (leaving it up to the
caller to decide what to do in this situation, perhaps at this point a
common WrappedFailure
like exception should be raised?).
The validation logic using jsonschema would be taken from taskflow and used when deserializing so that errors with bad data can be found earlier (at data load time) rather than later (at data access time).
To provide the twisted like integration with the traceback module (by turning the internal format of a traceback into a pure python object representation) there has been discussed if the traceback2 module can provide equivalent functionality, if it can then it should be used to achieve similar integration (it would be even better if the integration would also allow for re-raising this pure python trackback and frame representation as an actual traceback, although this may not be a reasonable expectation).
Alternatives
Keep having multiple variations, each with their own weaknesses and benefits, instead of unifying them under a single library.
Impact on Existing APIs
Ideally none, as the users should still get the same functionality, but if this is done correctly they will get more meaningful tracebacks, more meaningful introspection on failure objects and overall better and more consistent failures.
Security impact
Performance Impact
N/A
Configuration Impact
N/A
Developer Impact
This should make developers lives better.
Testing Impact
Having the failure code in its own library, allows it to be easily mocked and tested (vs say having it deeply embedded in oslo.messaging where it is not so easily testable/reviewable...); so overall this should improve test coverage (and overall code quality).
Implementation
Assignee(s)
Primary assignee: harlowja
Milestones
Target Milestone for completion: Mikita
Work Items
- Create skeleton library.
- Get skeleton up on gerrit and integrated into oslo pipelines.
- Start to move around code from oslo.messaging and taskflow and refactor to start to form this new library; use concepts and learning from twisted and bolt-ons (and others) to help make this library the best it can be.
- Review and code and repeat.
- Release and integrate.
- Delete older dead code.
- Profit!
Incubation
N/A
Documentation Impact
Dependencies
References
N/A (all inline)
Note
This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode