Commit Graph

46 Commits

Author SHA1 Message Date
Telles Nobrega abc8f57055 Python 3 fixes
String to Bytes compatibility.

Story: #2006258
Task: #35875
Change-Id: Id0ad0f3c644af52f41217105b249df78d0b722cc
2019-10-02 08:29:03 -03:00
Jeremy Freudberg a449558ac0 S3 data source
* Create S3 data source type for EDP
* Support storing S3 secret key in Castellan
* Unit tests for new data source type
* Document new data source type and related ideas
* Add support of S3 configs into Spark and Oozie workflows
* Hide S3 credentials in job execution info, like for Swift
* Release note

Change-Id: I3ae5b9879b54f81d34bc7cd6a6f754347ce82f33
2018-07-02 14:27:46 -04:00
Shu Yingya d02e61aa68 Dynamically add python version into launch_command
Ubuntu Xenial or later server won't install Python2 anymore by default.
Sahara should have the ability to dynamically edit the remotely executed
python script based on what python is available.

Change-Id: Ie0fdd829d1b0ff019329957fbdbbfd150320b8ab
Closes-Bug: #1739009
2018-01-24 10:55:15 +08:00
Marianne Linhares Monteiro 2def30f412 Code integration with the abstractions
Changes to make the integration of the existing code with the data source
and job binary abstractions possible.

Change-Id: I524f25ac95bb634b0583113792460c17217acc34
Implements: blueprint data-source-plugin
2017-03-19 19:43:40 -03:00
Luong Anh Tuan 158bd893b9 Replaces uuid.uuid4 with uuidutils.generate_uuid()
Change-Id: Ib72d9e74c70437678c72cebc31aee60a9e140e23
Closes-Bug: #1082248
2016-11-07 13:13:57 +07:00
Vitaly Gridnev e577f12f8e Inject drivers to jars in Ambari Spark engine
We have faced with issues that passing jars to
driver classpath has no affect in case of ambari clusters.
Passing drivers to jars should solve this issue.

Change-Id: I17828fee9d17b6bddbbf6d3e9bdcf7d40c2d28a1
Closes-bug: #1535106
2016-03-14 17:29:49 +03:00
Michael Ionkin 47d9e68d6e Added support of Spark 1.6.0
Spark 1.6.0 is available now for deployment.
Also added the current working directory to the driver class path for
proper reading of the spark.xml file

Change-Id: I9a46a503c7e52d756c7de8c8694dbfc51f80f2be
Co-Authored-By: Vitaly Gridnev <vgridnev@mirantis.com>
bp: support-spark-160
2016-02-08 19:51:25 +03:00
Jenkins 11ffd666e2 Merge "[EDP] Add scheduling EDP jobs in sahara(oozie engine implementation)" 2016-01-20 16:34:03 +00:00
Michael McCune 423d80498b add helper functions for key manager
this change adds a utils module to the castellan service package. this
module contains 3 wrapper functions to help reduce the overhead for
working with castellan.

* add sahara.service.castellan.utils module
* fixup previous usages of the castellan key manager

Change-Id: I6ad4e98ab41788022104ad2886e0ab74e4061ec3
Partial-Implements: blueprint improved-secret-storage
2016-01-11 10:12:01 -05:00
luhuichun 330157c299 [EDP] Add scheduling EDP jobs in sahara(oozie engine implementation)
Add run_scheduled_job in base_engine, and implement it in oozie engine.

Implements bp: enable-scheduled-edp-jobs

Change-Id: I2a0b3724396b4bed5cd2a4bc1392f849eb902e3e
2015-12-24 22:09:33 +08:00
Michael McCune d148dd4d55 Initial key manager implementation
This change adds the sahara key manager and converts the proxy passwords
and swift passwords to use the castellan interface.

* adding sahara key manager
* adding castellan to requirements
* removing barbicanclient from requirements
* removing sahara.utils.keymgr and related tests
* adding castellan wrapper configs to sahara list_opts
* creating a castellan validate_config to help setup
* updating documentation for castellan usage
* fixing up tests to work with castellan
* converting all proxy password usages to use castellan
* converting job binaries to use castellan when user credentials are
  applied
* converting data source to use castellan when user credentials are
  applied

Change-Id: I8cb08a365c6175744970b1037501792fe1ddb0c7
Partial-Implements: blueprint improved-secret-storage
Closes-Bug: #1431944
2015-12-22 15:07:12 -05:00
Vitaly Gridnev 14cece1ead Support overriding of driver classpath in Spark jobs
It might be useful to have ability to override default value for cluster
of driver-class-path option for the particular job execution. The change
add new config option: "edp.spark.driver.classpath" that
can override driver-class-path for users' needs.

Implements blueprint: spark-override-classpath
Change-Id: I94055c2ccb70c953620b62ed18c27895c3588327
2015-11-06 13:56:49 +03:00
Jenkins b5bc56cac4 Merge "Rename oozie_job_id" 2015-09-18 05:19:33 +00:00
Trevor McKay bad0476f2a Only add current directory to classpath for client deploy mode
We added a ":" to the spark driver-classpath to fix spark/swift
integration for spark 1.3.1. However, the ":" is only necessary
for client deploy mode so we should leave it out in other cases.

Note, this is a refinement to the original bug fix.  The ":" in
other deployment modes does not break anything.  However, it is
better not to include it since it is unnecessary.

Partial-bug: #1486544
Change-Id: Iaacbb090d0065922fab034d9d4a1f765ad7e05e3
2015-09-09 10:29:56 -04:00
Sergey Gotliv e7d6799155 Adding support for the Spark Shell job
Implements: blueprint edp-add-spark-shell-action
Change-Id: I6d2ec02f854ab2eeeab2413bb56f1a359a3837c1
2015-08-27 13:29:36 +03:00
Jenkins 99d7629e9e Merge "Ensure working dir is on driver class path for Spark/Swift" 2015-08-27 05:18:35 +00:00
Li, Chen 6c0db8e84a Rename oozie_job_id
The "oozie_job_id" column in table "job_executions" represents
oozie_job_id only when the edp engine is oozie. When it is spark
engin, oozie_job_id = pid@instance_id, when it is storm engine,
oozie_job_id = topology_name@instance_id.

Rename oozie_job_id to engine_job_id to aviod confusing.

Change-Id: I2671b91a315b2c7a2b805ce4d494252860a7fe6c
Closes-bug: 1479575
2015-08-26 14:58:48 +08:00
Trevor McKay 1018a540a5 Ensure working dir is on driver class path for Spark/Swift
For Spark/Swift integration, we use a wrapper class to set up
the hadoop environment.  For this to succeed, the current
working directory must be on the classpath. Newer versions of
Spark have changed how the default classpath is generated, so
Sahara must ensure explicitly that the working dir will be
included.

Change-Id: I6680bf8736cada93e87821ef37de3c3b4202ead4
Close-Bug: #1486544
2015-08-21 21:52:54 +03:00
Trevor McKay 7e10f34cad Add manila nfs data sources
This change will allow data sources with urls of the form
"manila://share-id/path", similar to manila urls for job binaries.
The Sahara native url will be logged in the JobExecution, but
the true runtime url (file:///mnt/path) for manila shares will
be used in the cluster.

Partial-implements: blueprint manila-as-a-data-source
Change-Id: I0b43491decbe6cb0ec0b84314cf9b407b9e3fb4a
2015-08-18 17:12:05 -04:00
Jenkins 033e2228e5 Merge "Allow Sahara native urls and runtime urls to differ for datasources" 2015-08-18 15:51:29 +00:00
Jenkins 7e6df0b0c7 Merge "Support manila shares as binary store" 2015-08-17 18:49:52 +00:00
Trevor McKay 660eb7f295 Allow Sahara native urls and runtime urls to differ for datasources
When we consider things like Manila nfs shares as datasources,
the possibility arises that the native form of a datasource url
in Sahara may not be the same as the form of the url needed at
runtime in the cluster. Allow them to differ, so that we can
still record the Sahara native url form in JobExecution objects
for accurate reference while passing the correct runtime url
in job arguments, etc.

This is a base CR that will be further built on later. In this
change, native urls and runtime urls are always identical.

Change-Id: I53f4cf11320e112ffd0c4ae93b7d1f300df86878
Partial-implements: blueprint manila-as-a-data-source
2015-08-11 09:36:27 -04:00
Chad Roberts 6761a01b09 Support manila shares as binary store
Changes to support manila shares as a binary store.
Oozie, Spark and Storm jobs can now run with job
binaries stored in manila shares.

Change-Id: I2f5fbe3d36ef4b87e5cadd337854e95ed95ebaa0
Implements: bp manila-as-binary-store
2015-08-11 09:36:27 -04:00
Li, Chen e1f5bcf08c Add CLUSTER_STATUS
We should define a set of CLUSTER_STATUS in stead of using direct string
in code.

1. Add cluster.py in utils/
2. Add cluster status.
3. move cluster operation related methods from general.py to cluster.py

Change-Id: Id95d982a911ab5d0f789265e03bff2256cf75856
2015-08-03 09:12:36 +08:00
Oleg Borisenko 1bc9ec4656 EDP Spark jobs work with Swift
1) Fixed the path to hadoop-swift.jar - in cloudera
   it's named as hadoop-openstack.jar
2) Fixed the options for launch wrapper with yarn-cluster
   (more details at http://spark.apache.org/docs/latest/running-on-yarn.html
    'Important notes' section).
3) Fixed the issue of swift credentials visibility in Yarn cluster.
4) Fixed related unit-test with the same error in it.

Change-Id: I5e8c72f0e362792f06245b3744a32342abc42389
Closes-bug: 1474128
2015-07-28 17:57:41 +03:00
Alexander Aleksiyants 74159dfdd2 Spark job for Cloudera 5.3.0 and 5.4.0 added
Spark jobs in Cloudera 5.3.0 and 5.4.0 plugins are now supported.
Required unit tests have been added. Merged with current
master HEAD.

Change-Id: Ic8fde97e424e45c6f31f7794749793b26c844915
Implements: blueprint spark-jobs-for-cdh-5-3-0
2015-07-10 17:45:11 +03:00
Nikita Konovalov f7d1ec55a8 Removed dependency on Spark plugin in edp code
The EDP Spark engine was importing a config helper from the Spark
plugin.
The helper was moved to common plugin utils and now is imported from
there by both the plugin and the engine.

This is the part of sahara and plugins split.

Partially-implements bp: move-plugins-to-separate-repo
Change-Id: Ie84cc163a09bf1e7b58fcdb08e0647a85492593b
2015-06-17 09:18:22 +00:00
Andrew Lazarev 7bae4261d0 Implemented support of placeholders in datasource URLs
Added ability to use placeholders in datasource URLs. Currently
supported placeholders:
* %RANDSTR(len)% - will be replaced with random string of
  lowercase letters of length `len`.
* %JOB_EXEC_ID% - will be replaced with the job execution ID.

Resulting URLs will be stored in a new field at job_execution
table. Using 'info' field doesn't look as good solution since it
is reserved for oozie status.

Next steps:
* write documentation
* update horizon

Implements blueprint: edp-datasource-placeholders

Change-Id: I1d9282b210047982c062b24bd03cf2331ab7599e
2015-05-06 20:50:03 +00:00
Ethan Gafford f197165e82 [EDP][Spark] Configure cluster for external hdfs
Adding configuration of the hosts file for HDFS access (already added
to Oozie engine) to Spark.

Change-Id: I3d2a372d3f4a4e502e2c0e111a1e29fb4f9b9fcf
Partially-implements: blueprint edp-spark-external-hdfs
2015-03-05 17:30:34 -05:00
Andrey Pavlov 5c5491f9de Using oslo_* instead of oslo.*
Changes:
* using oslo_config instead of oslo.config
* using oslo_concurrency instead of oslo.concurrency
* using oslo_db instead of oslo.db
* using oslo_i18n instead of oslo.i18n
* using oslo_messaging instead of oslo.messaging
* using oslo_middleware instead of oslo.middleware
* using oslo_serialization instead of oslo.serialization
* using oslo_utils instead of oslo.utils

Change-Id: Ib0f18603ca5b0885256a39a96a3620d05260a272
Closes-bug: #1414587
2015-02-04 13:19:28 +03:00
Trevor McKay bfe01ead79 Add Swift integration with Spark
This change allows Spark jobs to access Swift URLs without
any need to modify the Spark job code itself. There are a
number of things necessary to make this work:

* add a "edp.spark.adapt_for_swift" config value to control the
  feature
* generate a modified spark-submit command when the feature is
  enabled
* add the hadoop-swift.jar to the Spark classpaths for the
  driver and executors (cluster launch)
* include the general Swift configs in the Hadoop core-site-xml
  and make Spark read the Hadoop core-site.xml (cluster launch)
* upload an xml file containing the Swift authentication configs
  for Hadoop
* run a wrapper class that reads the extra Hadoop configuration
  and adds it to the configuration for the job

Changes in other CRs:
* add the hadoop-swift.jar to the Spark images
* add the SparkWrapper code to sahara-extra

Partial-Implements: blueprint edp-spark-swift-integration
Change-Id: I03dca4400c832f3ba8bc508d4fb2aa98dede8d80
2015-02-03 10:34:32 -05:00
Trevor McKay 7eac9f188d Follow the argument order specified in spark-submit help
The command issued by Sahara to run jobs with spark-submit does
not put the application jar in the right place according to the
help text of spark-submit. This does not make jobs fail, but it
is good to be consistent in case something changes.

Change-Id: I50c2a969e4f747820c06d5dba39b6a8442bb5c30
Closes-Bug: #1410247
2015-01-26 12:43:44 -05:00
Trevor McKay 8750ddc121 Add options supporting DataSource identifiers in job_configs
This change adds options that allow DataSource objects to be
referenced by name or uuid in the job_configs dictionary of a
job_execution. If a reference to a DataSource is found, the path
information replaces the reference.

Note, references are partially resolved in early processing to
determine whether or not a proxy user must be created.  References
are fully resolved in run_job().

Implements: blueprint edp-data-sources-in-job-configs
Change-Id: I5be62b798b86a8aaf933c2cc6b6d5a252f0a8627
2015-01-14 18:20:05 +00:00
Andrey Pavlov 89fbce96f1 Fixed problem with canceling during pending
Change-Id: Icbe3cd39fa28d6561607e679e3e7cd5b2a64751a
Closes-bug: #1369979
2014-10-10 18:07:41 +04:00
Alexander Ignatov 1a9bf1f24e Moved exceptions.py and utils.py up to plugins dir
Plugins dir contains 'general' module which looks like yet another
plugin along with vanilla, fake, hdp, spark and cdh. But it doesn't
and contains two files only. Moved them one level up to avoid such
confusion.

Closes-Bug: #1378178

Change-Id: Ia600e4c584d48a3227552f0051cc3bf906206bed
2014-10-07 09:00:19 +04:00
Jenkins ed4e658522 Merge "Moved validate_edp from plugin SPI to edp_engine" 2014-09-16 15:09:49 +00:00
Jenkins 40b4772fd8 Merge "Added missed translation for service.edp.spark" 2014-09-12 05:28:50 +00:00
Andrew Lazarev e55238a881 Moved validate_edp from plugin SPI to edp_engine
Now EDP engine is fully responsible on validation of data for
job execution.

Other changes:
* Removed API calls from validation to remove circular dependancy
* Removed plugins patching in validation to allow non-vanilla
  plugins testing
* Renamed job_executor to job_execution

Change-Id: I14c86f33b355cb4317e96a70109d8d72d52d3c00
Closes-Bug: #1357512
2014-09-10 10:10:41 -07:00
Andrey Pavlov 2904333584 Added missed translation for service.edp.spark
Change-Id: I3d4edcd1715578abc6f582bb085c09544707bd8d
2014-09-09 17:41:20 +04:00
Michael McCune f3b2a30309 Updating JobBinaries to use proxy for Swift access
Changes
* refactoring get_raw_binary to accept proxy configs
* refactoring get_raw_data to use proxy Swift connection when necessary
* adding function to get a Swift Connection object from proxy user
* refactoring upload_job_files_to_hdfs and upload_job_files to use proxy
  user when necessary
* changing JobBinary JSON schema to allow blank username/password if
  proxy domains are being used
* adding function to get the Swift public endpoint for the current
  project
* adding test for JobBinary creation without credentials

Partial-implements: blueprint edp-swift-trust-authentication
Change-Id: I02e76016194fbbb62b8ab7b304eecc53d580a79c
2014-09-09 09:11:58 -04:00
Andrew Lazarev 42526b808b Made EDP engine plugin specific
+ Moved 'get_hdfs_user' method from plugin SPI to EDP engine

Futher steps: move other EDP-specific method to EDP engine

Change-Id: I0537397894012f496ea4abc2661aa8331fbf6bd3
Partial-Bug: #1357512
2014-08-21 12:45:43 -07:00
Jenkins ddc8482ac9 Merge "Add a Spark job type for EDP" 2014-08-06 11:13:34 +00:00
Michael McCune 67f60dae57 Adding job execution status constants
To help standardize using job statuses across modules this patch
introduces a set of constants in sahara.utils.edp for the statuses
currently in use.

Changes
* add job status constants for DONEWITHERROR, FAILED, KILLED,
PENDING, RUNNING, and SUCCEEDED
* add a list constant for the terminated statuses
* update references from string variables to constants

Partial-implements: blueprint edp-swift-trust-authentication
Change-Id: Ib0c47a5c002e135f2e2eed0a9066144c830926b3
2014-08-01 15:55:51 -04:00
Trevor McKay cb60f67a50 Add a Spark job type for EDP
* Added support for JOB_TYPE_SPARK
* Modifed create_workflow_dir to not use job_workflow_postfix (only for hdfs)
* Rewrote unit tests in test_job.py to be oriented around
  validation behavior for classes of job types rather than
  features of validation, and included JOB_TYPE_SPARK.
* Moved test_job_executor_java.py into test_job_executor and added JOB_TYPE_SPARK
* Added tests for create_workflow_dir and upload_job_files

Partial-implements: blueprint edp-spark-job-type
Change-Id: Ifd91123afea9e921ac441751a37aa6afae0bbd66
2014-08-01 13:42:54 -04:00
Trevor McKay 5698799ee3 Implement EDP for a Spark standalone cluster
This change adds an EDP engine for a Spark standalone cluster.
The engine uses the spark-submit script and various linux
commands via ssh to run, monitor, and terminate Spark jobs.

Currently, the Spark engine can launch "Java" job types (this is
the same type used to submit Oozie Java action on Hadoop clusters)

A directory is created for each Spark job on the master node which
contains jar files, the script used to launch the job, the
job's stderr and stdout, and a result file containing the exit
status of spark-submit.  The directory is named after the Sahara
job and the job execution id so it is easy to locate.  Preserving
these files is a big help in debugging jobs.

A few general improvements are included:
* engine.cancel_job() may return updated job status
* engine.run_job() may return job status and fields for job_execution.extra
in addition to job id

Still to do:
* create a proper Spark job type (new CR)
* make the job dir location on the master node configurable (new CR)
* add something to clean up job directories on the master node (new CR)
* allows users to pass some general options to spark-submit itself (new CR)

Partial implements: blueprint edp-spark-standalone

Change-Id: I2c84e9cdb75e846754896d7c435e94bc6cc397ff
2014-07-30 17:13:42 -04:00
Trevor McKay 9198e31187 Refactor the job manager to allow multiple execution engines
This change creates an abstract base class that defines three
simple operations on jobs -- run, check status, and cancel. The
existing Oozie implementation becomes one implementation of this
class, and a stub for Spark clusters has been added.

The EDP job engine will be chosen based on information in the
cluster object.

Implements: blueprint edp-refactor-job-manager

Change-Id: I725688b0071b2c2a133cd167ae934f59e488c734
2014-07-09 10:18:29 -04:00