* Create S3 data source type for EDP
* Support storing S3 secret key in Castellan
* Unit tests for new data source type
* Document new data source type and related ideas
* Add support of S3 configs into Spark and Oozie workflows
* Hide S3 credentials in job execution info, like for Swift
* Release note
Change-Id: I3ae5b9879b54f81d34bc7cd6a6f754347ce82f33
Ubuntu Xenial or later server won't install Python2 anymore by default.
Sahara should have the ability to dynamically edit the remotely executed
python script based on what python is available.
Change-Id: Ie0fdd829d1b0ff019329957fbdbbfd150320b8ab
Closes-Bug: #1739009
Changes to make the integration of the existing code with the data source
and job binary abstractions possible.
Change-Id: I524f25ac95bb634b0583113792460c17217acc34
Implements: blueprint data-source-plugin
We have faced with issues that passing jars to
driver classpath has no affect in case of ambari clusters.
Passing drivers to jars should solve this issue.
Change-Id: I17828fee9d17b6bddbbf6d3e9bdcf7d40c2d28a1
Closes-bug: #1535106
Spark 1.6.0 is available now for deployment.
Also added the current working directory to the driver class path for
proper reading of the spark.xml file
Change-Id: I9a46a503c7e52d756c7de8c8694dbfc51f80f2be
Co-Authored-By: Vitaly Gridnev <vgridnev@mirantis.com>
bp: support-spark-160
this change adds a utils module to the castellan service package. this
module contains 3 wrapper functions to help reduce the overhead for
working with castellan.
* add sahara.service.castellan.utils module
* fixup previous usages of the castellan key manager
Change-Id: I6ad4e98ab41788022104ad2886e0ab74e4061ec3
Partial-Implements: blueprint improved-secret-storage
Add run_scheduled_job in base_engine, and implement it in oozie engine.
Implements bp: enable-scheduled-edp-jobs
Change-Id: I2a0b3724396b4bed5cd2a4bc1392f849eb902e3e
This change adds the sahara key manager and converts the proxy passwords
and swift passwords to use the castellan interface.
* adding sahara key manager
* adding castellan to requirements
* removing barbicanclient from requirements
* removing sahara.utils.keymgr and related tests
* adding castellan wrapper configs to sahara list_opts
* creating a castellan validate_config to help setup
* updating documentation for castellan usage
* fixing up tests to work with castellan
* converting all proxy password usages to use castellan
* converting job binaries to use castellan when user credentials are
applied
* converting data source to use castellan when user credentials are
applied
Change-Id: I8cb08a365c6175744970b1037501792fe1ddb0c7
Partial-Implements: blueprint improved-secret-storage
Closes-Bug: #1431944
It might be useful to have ability to override default value for cluster
of driver-class-path option for the particular job execution. The change
add new config option: "edp.spark.driver.classpath" that
can override driver-class-path for users' needs.
Implements blueprint: spark-override-classpath
Change-Id: I94055c2ccb70c953620b62ed18c27895c3588327
We added a ":" to the spark driver-classpath to fix spark/swift
integration for spark 1.3.1. However, the ":" is only necessary
for client deploy mode so we should leave it out in other cases.
Note, this is a refinement to the original bug fix. The ":" in
other deployment modes does not break anything. However, it is
better not to include it since it is unnecessary.
Partial-bug: #1486544
Change-Id: Iaacbb090d0065922fab034d9d4a1f765ad7e05e3
The "oozie_job_id" column in table "job_executions" represents
oozie_job_id only when the edp engine is oozie. When it is spark
engin, oozie_job_id = pid@instance_id, when it is storm engine,
oozie_job_id = topology_name@instance_id.
Rename oozie_job_id to engine_job_id to aviod confusing.
Change-Id: I2671b91a315b2c7a2b805ce4d494252860a7fe6c
Closes-bug: 1479575
For Spark/Swift integration, we use a wrapper class to set up
the hadoop environment. For this to succeed, the current
working directory must be on the classpath. Newer versions of
Spark have changed how the default classpath is generated, so
Sahara must ensure explicitly that the working dir will be
included.
Change-Id: I6680bf8736cada93e87821ef37de3c3b4202ead4
Close-Bug: #1486544
This change will allow data sources with urls of the form
"manila://share-id/path", similar to manila urls for job binaries.
The Sahara native url will be logged in the JobExecution, but
the true runtime url (file:///mnt/path) for manila shares will
be used in the cluster.
Partial-implements: blueprint manila-as-a-data-source
Change-Id: I0b43491decbe6cb0ec0b84314cf9b407b9e3fb4a
When we consider things like Manila nfs shares as datasources,
the possibility arises that the native form of a datasource url
in Sahara may not be the same as the form of the url needed at
runtime in the cluster. Allow them to differ, so that we can
still record the Sahara native url form in JobExecution objects
for accurate reference while passing the correct runtime url
in job arguments, etc.
This is a base CR that will be further built on later. In this
change, native urls and runtime urls are always identical.
Change-Id: I53f4cf11320e112ffd0c4ae93b7d1f300df86878
Partial-implements: blueprint manila-as-a-data-source
Changes to support manila shares as a binary store.
Oozie, Spark and Storm jobs can now run with job
binaries stored in manila shares.
Change-Id: I2f5fbe3d36ef4b87e5cadd337854e95ed95ebaa0
Implements: bp manila-as-binary-store
We should define a set of CLUSTER_STATUS in stead of using direct string
in code.
1. Add cluster.py in utils/
2. Add cluster status.
3. move cluster operation related methods from general.py to cluster.py
Change-Id: Id95d982a911ab5d0f789265e03bff2256cf75856
1) Fixed the path to hadoop-swift.jar - in cloudera
it's named as hadoop-openstack.jar
2) Fixed the options for launch wrapper with yarn-cluster
(more details at http://spark.apache.org/docs/latest/running-on-yarn.html
'Important notes' section).
3) Fixed the issue of swift credentials visibility in Yarn cluster.
4) Fixed related unit-test with the same error in it.
Change-Id: I5e8c72f0e362792f06245b3744a32342abc42389
Closes-bug: 1474128
Spark jobs in Cloudera 5.3.0 and 5.4.0 plugins are now supported.
Required unit tests have been added. Merged with current
master HEAD.
Change-Id: Ic8fde97e424e45c6f31f7794749793b26c844915
Implements: blueprint spark-jobs-for-cdh-5-3-0
The EDP Spark engine was importing a config helper from the Spark
plugin.
The helper was moved to common plugin utils and now is imported from
there by both the plugin and the engine.
This is the part of sahara and plugins split.
Partially-implements bp: move-plugins-to-separate-repo
Change-Id: Ie84cc163a09bf1e7b58fcdb08e0647a85492593b
Added ability to use placeholders in datasource URLs. Currently
supported placeholders:
* %RANDSTR(len)% - will be replaced with random string of
lowercase letters of length `len`.
* %JOB_EXEC_ID% - will be replaced with the job execution ID.
Resulting URLs will be stored in a new field at job_execution
table. Using 'info' field doesn't look as good solution since it
is reserved for oozie status.
Next steps:
* write documentation
* update horizon
Implements blueprint: edp-datasource-placeholders
Change-Id: I1d9282b210047982c062b24bd03cf2331ab7599e
Adding configuration of the hosts file for HDFS access (already added
to Oozie engine) to Spark.
Change-Id: I3d2a372d3f4a4e502e2c0e111a1e29fb4f9b9fcf
Partially-implements: blueprint edp-spark-external-hdfs
Changes:
* using oslo_config instead of oslo.config
* using oslo_concurrency instead of oslo.concurrency
* using oslo_db instead of oslo.db
* using oslo_i18n instead of oslo.i18n
* using oslo_messaging instead of oslo.messaging
* using oslo_middleware instead of oslo.middleware
* using oslo_serialization instead of oslo.serialization
* using oslo_utils instead of oslo.utils
Change-Id: Ib0f18603ca5b0885256a39a96a3620d05260a272
Closes-bug: #1414587
This change allows Spark jobs to access Swift URLs without
any need to modify the Spark job code itself. There are a
number of things necessary to make this work:
* add a "edp.spark.adapt_for_swift" config value to control the
feature
* generate a modified spark-submit command when the feature is
enabled
* add the hadoop-swift.jar to the Spark classpaths for the
driver and executors (cluster launch)
* include the general Swift configs in the Hadoop core-site-xml
and make Spark read the Hadoop core-site.xml (cluster launch)
* upload an xml file containing the Swift authentication configs
for Hadoop
* run a wrapper class that reads the extra Hadoop configuration
and adds it to the configuration for the job
Changes in other CRs:
* add the hadoop-swift.jar to the Spark images
* add the SparkWrapper code to sahara-extra
Partial-Implements: blueprint edp-spark-swift-integration
Change-Id: I03dca4400c832f3ba8bc508d4fb2aa98dede8d80
The command issued by Sahara to run jobs with spark-submit does
not put the application jar in the right place according to the
help text of spark-submit. This does not make jobs fail, but it
is good to be consistent in case something changes.
Change-Id: I50c2a969e4f747820c06d5dba39b6a8442bb5c30
Closes-Bug: #1410247
This change adds options that allow DataSource objects to be
referenced by name or uuid in the job_configs dictionary of a
job_execution. If a reference to a DataSource is found, the path
information replaces the reference.
Note, references are partially resolved in early processing to
determine whether or not a proxy user must be created. References
are fully resolved in run_job().
Implements: blueprint edp-data-sources-in-job-configs
Change-Id: I5be62b798b86a8aaf933c2cc6b6d5a252f0a8627
Plugins dir contains 'general' module which looks like yet another
plugin along with vanilla, fake, hdp, spark and cdh. But it doesn't
and contains two files only. Moved them one level up to avoid such
confusion.
Closes-Bug: #1378178
Change-Id: Ia600e4c584d48a3227552f0051cc3bf906206bed
Now EDP engine is fully responsible on validation of data for
job execution.
Other changes:
* Removed API calls from validation to remove circular dependancy
* Removed plugins patching in validation to allow non-vanilla
plugins testing
* Renamed job_executor to job_execution
Change-Id: I14c86f33b355cb4317e96a70109d8d72d52d3c00
Closes-Bug: #1357512
Changes
* refactoring get_raw_binary to accept proxy configs
* refactoring get_raw_data to use proxy Swift connection when necessary
* adding function to get a Swift Connection object from proxy user
* refactoring upload_job_files_to_hdfs and upload_job_files to use proxy
user when necessary
* changing JobBinary JSON schema to allow blank username/password if
proxy domains are being used
* adding function to get the Swift public endpoint for the current
project
* adding test for JobBinary creation without credentials
Partial-implements: blueprint edp-swift-trust-authentication
Change-Id: I02e76016194fbbb62b8ab7b304eecc53d580a79c
+ Moved 'get_hdfs_user' method from plugin SPI to EDP engine
Futher steps: move other EDP-specific method to EDP engine
Change-Id: I0537397894012f496ea4abc2661aa8331fbf6bd3
Partial-Bug: #1357512
To help standardize using job statuses across modules this patch
introduces a set of constants in sahara.utils.edp for the statuses
currently in use.
Changes
* add job status constants for DONEWITHERROR, FAILED, KILLED,
PENDING, RUNNING, and SUCCEEDED
* add a list constant for the terminated statuses
* update references from string variables to constants
Partial-implements: blueprint edp-swift-trust-authentication
Change-Id: Ib0c47a5c002e135f2e2eed0a9066144c830926b3
* Added support for JOB_TYPE_SPARK
* Modifed create_workflow_dir to not use job_workflow_postfix (only for hdfs)
* Rewrote unit tests in test_job.py to be oriented around
validation behavior for classes of job types rather than
features of validation, and included JOB_TYPE_SPARK.
* Moved test_job_executor_java.py into test_job_executor and added JOB_TYPE_SPARK
* Added tests for create_workflow_dir and upload_job_files
Partial-implements: blueprint edp-spark-job-type
Change-Id: Ifd91123afea9e921ac441751a37aa6afae0bbd66
This change adds an EDP engine for a Spark standalone cluster.
The engine uses the spark-submit script and various linux
commands via ssh to run, monitor, and terminate Spark jobs.
Currently, the Spark engine can launch "Java" job types (this is
the same type used to submit Oozie Java action on Hadoop clusters)
A directory is created for each Spark job on the master node which
contains jar files, the script used to launch the job, the
job's stderr and stdout, and a result file containing the exit
status of spark-submit. The directory is named after the Sahara
job and the job execution id so it is easy to locate. Preserving
these files is a big help in debugging jobs.
A few general improvements are included:
* engine.cancel_job() may return updated job status
* engine.run_job() may return job status and fields for job_execution.extra
in addition to job id
Still to do:
* create a proper Spark job type (new CR)
* make the job dir location on the master node configurable (new CR)
* add something to clean up job directories on the master node (new CR)
* allows users to pass some general options to spark-submit itself (new CR)
Partial implements: blueprint edp-spark-standalone
Change-Id: I2c84e9cdb75e846754896d7c435e94bc6cc397ff
This change creates an abstract base class that defines three
simple operations on jobs -- run, check status, and cancel. The
existing Oozie implementation becomes one implementation of this
class, and a stub for Spark clusters has been added.
The EDP job engine will be chosen based on information in the
cluster object.
Implements: blueprint edp-refactor-job-manager
Change-Id: I725688b0071b2c2a133cd167ae934f59e488c734