Data Source and Job Binary Pluggability

Sahara allows multiple types of data source and job binary. However,
there's no clean abstraction around them, and the code to deal
with them is often very difficult to read and modify. This change
proposes to create clean abstractions that each data source
type and job binary type can implement differently depending
on its own needs.

Change-Id: Ia3c088e72a09d680ed87d10b4a68477f8cf9642e
Partial-implements: bp data-source-plugin
This commit is contained in:
Marianne Linhares 2016-12-28 01:51:01 -03:00 committed by Marianne Linhares Monteiro
parent 577b1ff357
commit 68a2203d69
1 changed files with 212 additions and 0 deletions

View File

@ -0,0 +1,212 @@
==========================================
Data Source and Job Binary Pluggability
==========================================
https://blueprints.launchpad.net/sahara/+spec/data-source-plugin
Sahara allows multiple types of data source and job binary. However, there's
no clean abstraction around them, and the code to deal with them is often very
difficult to read and modify. This change proposes to create clean
abstractions that each data source type and job binary type can implement
differently depending on its own needs.
Problem description
===================
Currently, the data source and job binary code are spread over different
folders and files in Sahara, making this code hard to change and to extend.
Right now, a developer who wants to create a new data source needs to look all
over the code and modify things in a lot of places (and it's almost impossible
to know all of them without deep experience with the code). Once this change
is complete, developers will be able to create code in a single directory and
will be able to write their data source by implementing an abstract class.
This will allow users to enable data sources that they write themselves
(and hopefully contribute upstream) much more easily, and it will allow
operators to disable data sources that their own stack does not support
as well.
Proposed change
===============
This change proposes to create the data source and job binary abstractions as
plugins, in order to provide loading code dynamically, with a well defined
interface. The existing types of data sources and job binaries will be
refactored.
The interfaces that will be implemented are described below:
*Data Source Interface*
* prepare_cluster(cluster, kwargs) - Makes a cluster ready to use this
data source. Different implementations for each data source: for Manila
it will be mount the share, for Swift verify credentials, etc.
* construct_url(url, job_exec_id) - Resolves placeholders in data source URL.
* get_urls(cluster, url, job_exec_id) - Returns the data source url and the
runtime url the data source must be referenced as. Returns: a tuple of the
form (url, runtime_url).
* get_runtime_url(url, cluster) - If needed it will construct a runtime
url for the data source, by the default if a runtime url is not needed
it will return the native url.
* validate(data) - Checks whether or not the data passed through the API
to create or update a data source is valid.
* _validate_url(url) - This method is optional and can be used by the
validate method in order to check whether or not the data source url
is valid.
*Job Binary Interface*
* prepare_cluster(cluster, kwargs) - Makes a cluster ready to use this
job binary. Different implementations for each data source: for Manila it
will be mount the share, for Swift verify credentials, etc.
* copy_binary_to_cluster(job_binary, cluster) - If necessary, pull
binary data from the binary store and copy that data to a useful path
on the cluster.
* get_local_path(cluster, job_binary) - Returns the path on the local
cluster for the binary.
* validate(data) - Checks whether or not the data passed through the API
to create or update a job binary is valid.
* _validate_url(url) - This method is optional and can be used by the
validate method in order to check whether or not the job binary url
is valid
* validate_job_location_format(entry) - Pre checks whether or not the API
entry is valid.
* get_raw_data(job_binary, kwargs) - Used only by the API, it returns
the raw binary. If the type doesn't support this operation it should
raise NotImplementedException.
These interfaces will be organized in the following folders structure:
* services/edp/data_sources - Will contain the data source interface and
the data source types implementations.
* services/edp/job_binaries - Will contain the job binary interface and
the job binary types implementations.
* services/edp/utils - Will contain utility code that can be shared
by data source implementations and job binary implementations.
Per example: Manila data source implementation will probably share some
code with the job binary implementation.
Probably some changes in the interface are possible until the changes are
implemented (parameters, method names, parameter names), but the main
structure and idea should stay the same.
Also a plugin manager will be needed to deal directly with the
different types of data sources and job binaries and to provide methods for
the operators to disable/enable data sources and job binaries dynamically.
This plugin manager was not detailed because is going to be similar to the
plugin manager already existent for the cluster plugins.
Alternatives
------------
A clear alternative is let things the way they are, but Sahara would be more
difficult to extend and to understand; An alternative for the abstractions
defined in the Proposed Change section would be to have only one abstraction
instead of two interfaces for data sources and job binaries since these
interfaces have a lot in common, implementing this alternative would remove
the edp/service/utilities folder letting the code more unified and compact,
but job binary and data source code would be considered only one plugin,
which could difficult the pluggability feature of this change (per example:
the provider would not be able to disable manila for data sources, but enable
it for job binaries) and because of this it was not considered the best
approach, instead we keep job binaries and data sources apart, but in
contrast we need the utilities folder to avoid code replication.
Data model impact
-----------------
None
REST API impact
---------------
Probably some new methods to manage supported types of data sources and
job binaries will be needed (similar to the methods already offered by
plugins).
* data_sources_types_list() ; job_binaries_types_list()
* data_sources_types_get(type_name) ; job_binaries_types_get(type_name)
* data_sources_types_update(type_name, data) ;
job_binaries_types_update(type_name, data)
Other end user impact
---------------------
None
Deployer impact
---------------
None
Developer impact
----------------
After this change is implemented developers will be able to add and enable
new data sources and job binaries easily, by just implementing the abstraction.
Sahara-image-elements impact
----------------------------
None
Sahara-dashboard / Horizon impact
---------------------------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee: mariannelinharesm
Other contributors: egafford
Work Items
----------
* Creation of a Plugin for Data Sources containing the Data Source Abstraction
* Creation of a Plugin for Job Binaries containing the Job Binary Abstraction
* HDFS Plugin Intern
* HDFS Plugin Extern
* Swift Plugin
* Manila Plugin
* Allow job engines to declare which data sources/job binaries they are
capable of using (this may be needed or not depending if exists a job type
that does not support a particular data source or job binary type)
* Changes in the API
Dependencies
============
None
Testing
=======
This change will require only changes in existing unit tests.
Documentation Impact
====================
Will be necessary to add a devref doc about the abstractions created.
References
==========
None