Add some S3 doc
Perhaps some more to follow, someday, but this is nice to have. Change-Id: I2235f903105049432de24d89a88b40f753fd93d6
This commit is contained in:
parent
2c6232c9ad
commit
141a67e7c7
|
@ -0,0 +1,87 @@
|
|||
==============================
|
||||
EDP with S3-like Object Stores
|
||||
==============================
|
||||
|
||||
Overview and rationale of S3 integration
|
||||
========================================
|
||||
Since the Rocky release, Sahara clusters have full support for interaction with
|
||||
S3-like object stores, for example Ceph Rados Gateway. Through the abstractions
|
||||
offered by EDP, a Sahara job execution may consume input data and job binaries
|
||||
stored in S3, as well as write back its output data to S3.
|
||||
|
||||
The copying of job binaries from S3 to a cluster is performed by the botocore
|
||||
library. A job's input and output to and from S3 is handled by the Hadoop-S3A
|
||||
driver.
|
||||
|
||||
It's also worth noting that the Hadoop-S3A driver may be more mature and
|
||||
performant than the Hadoop-SwiftFS driver (either as hosted by Apache or in
|
||||
the sahara-extra respository).
|
||||
|
||||
Sahara clusters are also provisioned such that data in S3-like storage can also
|
||||
be accessed when manually interacting with the cluster; in other words: the
|
||||
needed libraries are properly situated.
|
||||
|
||||
Considerations for deployers
|
||||
============================
|
||||
The S3 integration features can function without any specific deployment
|
||||
requirement. This is because the EDP S3 abstractions can point to an arbitrary
|
||||
S3 endpoint.
|
||||
|
||||
Deployers may want to consider using Sahara's optional integration with secret
|
||||
storage to protect the S3 access and secret keys that users will provide. Also,
|
||||
if using Rados Gateway for S3, deployers may want to use Keystone for RGW auth
|
||||
so that users can simply request Keystone EC2 credentials to access RGW's S3.
|
||||
|
||||
S3 user experience
|
||||
==================
|
||||
Below, details about how to use the S3 integration features are discussed.
|
||||
|
||||
EDP job binaries in S3
|
||||
----------------------
|
||||
The ``url`` must be in the format ``s3://bucket/path/to/object``, similar to
|
||||
the format used for binaries in Swift. The ``extra`` structure must contain
|
||||
``accesskey``, ``secretkey``, and ``endpoint``, which is the URL of the S3
|
||||
service, including the protocol ``http`` or ``https``.
|
||||
|
||||
As mentioned above, the binary will be copied to the cluster before execution,
|
||||
by use of the botocore library. This also means that the set of credentials
|
||||
used to access this binary may be entirely different than those for accessing
|
||||
a data source.
|
||||
|
||||
EDP data sources in S3
|
||||
----------------------
|
||||
The ``url`` should be in the format ``s3://bucket/path/to/object``, although
|
||||
upon execution the protocol will be automatically changed to ``s3a``. The
|
||||
``credentials`` does not have any required values, although the following may
|
||||
be set:
|
||||
|
||||
* ``accesskey`` and ``secretkey``
|
||||
* ``endpoint``, which is the URL of the S3 service, without the protocl
|
||||
* ``ssl``, which must be a boolean
|
||||
* ``bucket_in_path``, to indicate whether the S3 service uses
|
||||
virtual-hosted-style or path-style URLs, and must be a boolean
|
||||
|
||||
The values above are optional, as they may be set in the cluster's
|
||||
``core-site.xml`` or as configuration values of the job execution, as follows,
|
||||
as dictated by the options understood by the Hadoop-S3A driver:
|
||||
|
||||
* ``fs.s3a.access.key``, corresponding to ``accesskey``
|
||||
* ``fs.s3a.secret.key``, corresponding to ``secretkey``
|
||||
* ``fs.s3a.endpoint``, corresponding to ``endpoint``
|
||||
* ``fs.s3a.connection.ssl.enabled``, corresponding to ``ssl``
|
||||
* ``fs.s3a.path.style.access``, corresponding to ``bucket_in_path``
|
||||
|
||||
In the case of ``fs.s3a.path.style.access``, a default value is determined by
|
||||
the Hadoop-S3A driver if none is set: virtual-hosted-style URLs are assumed
|
||||
unless told otherwise, or if the endpoint is a raw IP address.
|
||||
|
||||
Additional configuration values are supported by the Hadoop-S3A driver, and are
|
||||
discussed in its official documentation.
|
||||
|
||||
It is recommended that the EDP data source abstraction is used, rather than
|
||||
handling bare arguments and configuration values.
|
||||
|
||||
If any S3 configuration values are to be set at execution time, including such
|
||||
situations in which those values are contained by the EDP data source
|
||||
abstraction, then ``edp.spark.adapt_for_swift`` or ``edp.java.adapt_for_oozie``
|
||||
must be set to ``true`` as appropriate.
|
|
@ -39,6 +39,7 @@ Elastic Data Processing
|
|||
:maxdepth: 2
|
||||
|
||||
edp
|
||||
edp-s3
|
||||
|
||||
|
||||
Guest Images
|
||||
|
|
Loading…
Reference in New Issue