diff options
authorJeremy Freudberg <>2018-08-07 17:30:26 -0400
committerJeremy Freudberg <>2018-08-07 21:31:04 +0000
commit141a67e7c77908099b6870fd517f32682b3bc846 (patch)
parent2c6232c9ad37079020ac72a29a54ab794cc80500 (diff)
Add some S3 doc
Perhaps some more to follow, someday, but this is nice to have. Change-Id: I2235f903105049432de24d89a88b40f753fd93d6
Notes (review): Code-Review+2: Luigi Toscano <> Code-Review+2: Telles Mota Vidal Nóbrega <> Workflow+1: Telles Mota Vidal Nóbrega <> Verified+2: Zuul Submitted-by: Zuul Submitted-at: Thu, 09 Aug 2018 15:19:24 +0000 Reviewed-on: Project: openstack/sahara Branch: refs/heads/master
2 files changed, 88 insertions, 0 deletions
diff --git a/doc/source/user/edp-s3.rst b/doc/source/user/edp-s3.rst
new file mode 100644
index 0000000..20507b6
--- /dev/null
+++ b/doc/source/user/edp-s3.rst
@@ -0,0 +1,87 @@
2EDP with S3-like Object Stores
5Overview and rationale of S3 integration
7Since the Rocky release, Sahara clusters have full support for interaction with
8S3-like object stores, for example Ceph Rados Gateway. Through the abstractions
9offered by EDP, a Sahara job execution may consume input data and job binaries
10stored in S3, as well as write back its output data to S3.
12The copying of job binaries from S3 to a cluster is performed by the botocore
13library. A job's input and output to and from S3 is handled by the Hadoop-S3A
16It's also worth noting that the Hadoop-S3A driver may be more mature and
17performant than the Hadoop-SwiftFS driver (either as hosted by Apache or in
18the sahara-extra respository).
20Sahara clusters are also provisioned such that data in S3-like storage can also
21be accessed when manually interacting with the cluster; in other words: the
22needed libraries are properly situated.
24Considerations for deployers
26The S3 integration features can function without any specific deployment
27requirement. This is because the EDP S3 abstractions can point to an arbitrary
28S3 endpoint.
30Deployers may want to consider using Sahara's optional integration with secret
31storage to protect the S3 access and secret keys that users will provide. Also,
32if using Rados Gateway for S3, deployers may want to use Keystone for RGW auth
33so that users can simply request Keystone EC2 credentials to access RGW's S3.
35S3 user experience
37Below, details about how to use the S3 integration features are discussed.
39EDP job binaries in S3
41The ``url`` must be in the format ``s3://bucket/path/to/object``, similar to
42the format used for binaries in Swift. The ``extra`` structure must contain
43``accesskey``, ``secretkey``, and ``endpoint``, which is the URL of the S3
44service, including the protocol ``http`` or ``https``.
46As mentioned above, the binary will be copied to the cluster before execution,
47by use of the botocore library. This also means that the set of credentials
48used to access this binary may be entirely different than those for accessing
49a data source.
51EDP data sources in S3
53The ``url`` should be in the format ``s3://bucket/path/to/object``, although
54upon execution the protocol will be automatically changed to ``s3a``. The
55``credentials`` does not have any required values, although the following may
56be set:
58* ``accesskey`` and ``secretkey``
59* ``endpoint``, which is the URL of the S3 service, without the protocl
60* ``ssl``, which must be a boolean
61* ``bucket_in_path``, to indicate whether the S3 service uses
62 virtual-hosted-style or path-style URLs, and must be a boolean
64The values above are optional, as they may be set in the cluster's
65``core-site.xml`` or as configuration values of the job execution, as follows,
66as dictated by the options understood by the Hadoop-S3A driver:
68* ``fs.s3a.access.key``, corresponding to ``accesskey``
69* ``fs.s3a.secret.key``, corresponding to ``secretkey``
70* ``fs.s3a.endpoint``, corresponding to ``endpoint``
71* ``fs.s3a.connection.ssl.enabled``, corresponding to ``ssl``
72* ````, corresponding to ``bucket_in_path``
74In the case of ````, a default value is determined by
75the Hadoop-S3A driver if none is set: virtual-hosted-style URLs are assumed
76unless told otherwise, or if the endpoint is a raw IP address.
78Additional configuration values are supported by the Hadoop-S3A driver, and are
79discussed in its official documentation.
81It is recommended that the EDP data source abstraction is used, rather than
82handling bare arguments and configuration values.
84If any S3 configuration values are to be set at execution time, including such
85situations in which those values are contained by the EDP data source
86abstraction, then ``edp.spark.adapt_for_swift`` or ````
87must be set to ``true`` as appropriate.
diff --git a/doc/source/user/index.rst b/doc/source/user/index.rst
index a8e5a0a..2c15fb0 100644
--- a/doc/source/user/index.rst
+++ b/doc/source/user/index.rst
@@ -39,6 +39,7 @@ Elastic Data Processing
39 :maxdepth: 2 39 :maxdepth: 2
40 40
41 edp 41 edp
42 edp-s3
42 43
43 44
44Guest Images 45Guest Images