Fix HDFS url description, and other various edits

HDFS url description is wrong as a result of code changes. This was
the major motivation for this CR.

Additional changes

* formatted for 80 characters
* consistent use of '.' at the end of bullets
* added mention of Spark
* adding '.sahara' suffix is no longer necessary
* some other minor changes

Closes-Bug: 1376457
Change-Id: I72134bcdf6c42911d07e65952a9a56331d896699
(cherry picked from commit a718ec7ddf)
This commit is contained in:
Trevor McKay 2014-10-01 17:23:29 -04:00 committed by Andrew Lazarev
parent ff3bf76318
commit 3630ccffb2
1 changed files with 199 additions and 110 deletions

View File

@ -1,101 +1,124 @@
Sahara (Data Processing) UI User Guide
======================================
This guide assumes that you already have Sahara service and the Horizon dashboard up and running.
Don't forget to make sure that Sahara is registered in Keystone.
If you require assistance with that, please see the `installation guide <../installation.guide.html>`_.
This guide assumes that you already have the Sahara service and Horizon
dashboard up and running. Don't forget to make sure that Sahara is registered in
Keystone. If you require assistance with that, please see the
`installation guide <../installation.guide.html>`_.
Launching a cluster via the Sahara UI
-------------------------------------
Registering an Image
--------------------
1) Navigate to the "Project" dashboard, then the "Data Processing" tab, then click on the "Image Registry" panel.
1) Navigate to the "Project" dashboard, then the "Data Processing" tab, then
click on the "Image Registry" panel
2) From that page, click on the "Register Image" button at the top right.
2) From that page, click on the "Register Image" button at the top right
3) Choose the image that you'd like to register as a Hadoop Image
3) Choose the image that you'd like to register with Sahara
4) Enter the username of the cloud-init user on the image.
4) Enter the username of the cloud-init user on the image
5) Click on the tags that you want to add to the image. (A version ie: 1.2.1 and a type ie: vanilla are required for cluster functionality)
5) Click on the tags that you want to add to the image. (A version ie: 1.2.1 and
a type ie: vanilla are required for cluster functionality)
6) Click the "Done" button to finish the registration.
6) Click the "Done" button to finish the registration
Create Node Group Templates
---------------------------
1) Navigate to the "Project" dashboard, then the "Data Processing" tab, then click on the "Node Group Templates" panel.
1) Navigate to the "Project" dashboard, then the "Data Processing" tab, then
click on the "Node Group Templates" panel
2) From that page, click on the "Create Template" button at the top right.
2) From that page, click on the "Create Template" button at the top right
3) Choose your desired Plugin name and Version from the dropdowns and click "Create".
3) Choose your desired Plugin name and Version from the dropdowns and click
"Create"
4) Give your Node Group Template a name (description is optional)
5) Choose a flavor for this template (based on your CPU/memory/disk needs)
6) Choose the storage location for your instance, this can be either "Ephemeral Drive" or "Cinder Volume". If you choose "Cinder Volume", you will need to add additional configuration.
6) Choose the storage location for your instance, this can be either "Ephemeral
Drive" or "Cinder Volume". If you choose "Cinder Volume", you will need to add
additional configuration
7) Choose which processes should be run for any instances that are spawned from this Node Group Template.
7) Choose which processes should be run for any instances that are spawned from
this Node Group Template
8) Click on the "Create" button to finish creating your Node Group Template.
8) Click on the "Create" button to finish creating your Node Group Template
Create a Cluster Template
-------------------------
1) Navigate to the "Project" dashboard, then the "Data Processing" tab, then click on the "Cluster Templates" panel.
1) Navigate to the "Project" dashboard, then the "Data Processing" tab, then
click on the "Cluster Templates" panel
2) From that page, click on the "Create Template" button at the top right.
2) From that page, click on the "Create Template" button at the top right
3) Choose your desired Plugin name and Version from the dropdowns and click "Create".
3) Choose your desired Plugin name and Version from the dropdowns and click
"Create"
4) Under the "Details" tab, you must give your template a name.
4) Under the "Details" tab, you must give your template a name
5) Under the "Node Groups" tab, you should add one or more nodes that can be based on one or more templates.
5) Under the "Node Groups" tab, you should add one or more nodes that can be
based on one or more templates
- To do this, start by choosing a Node Group Template from the dropdown and click the "+" button.
- You can adjust the number of nodes to be spawned for this node group via the text box or the "-" and "+" buttons.
- Repeat these steps if you need nodes from additional node group templates.
- To do this, start by choosing a Node Group Template from the dropdown and
click the "+" button
- You can adjust the number of nodes to be spawned for this node group via
the text box or the "-" and "+" buttons
- Repeat these steps if you need nodes from additional node group templates
6) Optionally, you can adjust your configuration further by using the "General Parameters", "HDFS Parameters" and "MapReduce Parameters" tabs.
6) Optionally, you can adjust your configuration further by using the "General
Parameters", "HDFS Parameters" and "MapReduce Parameters" tabs
7) Click on the "Create" button to finish creating your Cluster Template.
7) Click on the "Create" button to finish creating your Cluster Template
Launching a Cluster
-------------------
1) Navigate to the "Project" dashboard, then the "Data Processing" tab, then click on the "Clusters" panel.
1) Navigate to the "Project" dashboard, then the "Data Processing" tab, then
click on the "Clusters" panel
2) Click on the "Launch Cluster" button at the top right.
2) Click on the "Launch Cluster" button at the top right
3) Choose your desired Plugin name and Version from the dropdowns and click "Create".
3) Choose your desired Plugin name and Version from the dropdowns and click
"Create"
4) Give your cluster a name. (required)
4) Give your cluster a name (required)
5) Choose which cluster template should be used for your cluster.
5) Choose which cluster template should be used for your cluster
6) Choose the image that should be used for your cluster (if you do not see any options here, see `Registering an Image`_ above).
6) Choose the image that should be used for your cluster (if you do not see any
options here, see `Registering an Image`_ above)
7) Optionally choose a keypair that can be used to authenticate to your cluster instances.
7) Optionally choose a keypair that can be used to authenticate to your cluster
instances
8) Click on the "Create" button to start your cluster.
8) Click on the "Create" button to start your cluster
- Your cluster's status will display on the Clusters table.
- It will likely take several minutes to reach the "Active" state.
- Your cluster's status will display on the Clusters table
- It will likely take several minutes to reach the "Active" state
Scaling a Cluster
-----------------
1) From the Data Processing/Clusters page, click on the "Scale Cluster" button of the row that contains the cluster that you want to scale.
1) From the Data Processing/Clusters page, click on the "Scale Cluster" button
of the row that contains the cluster that you want to scale
2) You can adjust the numbers of instances for existing Node Group Templates.
2) You can adjust the numbers of instances for existing Node Group Templates
3) You can also add a new Node Group Template and choose a number of instances to launch.
3) You can also add a new Node Group Template and choose a number of instances
to launch
- This can be done by selecting your desired Node Group Template from the dropdown and clicking the "+" button.
- Your new Node Group will appear below and you can adjust the number of instances via the text box or the +/- buttons.
- This can be done by selecting your desired Node Group Template from the
dropdown and clicking the "+" button
- Your new Node Group will appear below and you can adjust the number of
instances via the text box or the "+" and "-" buttons
4) To confirm the scaling settings and trigger the spawning/deletion of instances, click on "Scale".
4) To confirm the scaling settings and trigger the spawning/deletion of
instances, click on "Scale"
Elastic Data Processing (EDP)
-----------------------------
@ -103,113 +126,155 @@ Data Sources
------------
Data Sources are where the input and output from your jobs are housed.
1) From the Data Processing/Data Sources page, click on the "Create Data Source" button at the top right.
1) From the Data Processing/Data Sources page, click on the "Create Data Source"
button at the top right
2) Give your Data Source a name.
2) Give your Data Source a name
3) Enter the URL to the Data Source.
3) Enter the URL of the the Data Source
- For a Swift object, the url will look like <container>.sahara/<path> (ie: mycontainer.sahara/inputfile). The "swift://" is automatically added for you.
- For an HDFS object, the url will look like <host>/<path> (ie: myhost/user/hadoop/inputfile). The "hdfs://" is automatically added for you.
- For a Swift object, enter <container>/<path> (ie: *mycontainer/inputfile*).
Sahara will prepend *swift://* for you
- For an HDFS object, enter an absolute path, a relative path or a full URL:
4) Enter the username and password for the Data Source.
+ */my/absolute/path* indicates an absolute path in the cluster HDFS
+ *my/path* indicates the path */user/hadoop/my/path* in the cluster HDFS
assuming the defined HDFS user is *hadoop*
+ *hdfs://host:port/path* can be used to indicate any HDFS location
5) Enter an optional description.
4) Enter the username and password for the Data Source (also see
`Additional Notes`_)
6) Click on "Create".
5) Enter an optional description
7) Repeat for additional Data Sources.
6) Click on "Create"
7) Repeat for additional Data Sources
Job Binaries
------------
Job Binaries are where you define/upload the source code (mains and libraries) for your job.
Job Binaries are where you define/upload the source code (mains and libraries)
for your job.
1) From the Data Processing/Job Binaries page, click on the "Create Job Binary" button at the top right.
1) From the Data Processing/Job Binaries page, click on the "Create Job Binary"
button at the top right
2) Give your Job Binary a name (this can be different than the actual filename).
2) Give your Job Binary a name (this can be different than the actual filename)
3) Choose the type of storage for your Job Binary.
3) Choose the type of storage for your Job Binary
- For "Swift", you will need to enter the URL of your binary (<container>.sahara/<path>) as well as the username and password.
- For "Internal database", you can choose from "Create a script" or "Upload a new file".
- For "Swift", enter the URL of your binary (<container>/<path>) as well as
the username and password (also see `Additional Notes`_)
- For "Internal database", you can choose from "Create a script" or "Upload
a new file"
4) Enter an optional description.
4) Enter an optional description
5) Click on "Create".
5) Click on "Create"
6) Repeat for additional Job Binaries
Jobs
----
Jobs are where you define the type of job you'd like to run as well as which "Job Binaries" are required.
Jobs are where you define the type of job you'd like to run as well as which
"Job Binaries" are required
1) From the Data Processing/Jobs page, click on the "Create Job" button at the top right.
1) From the Data Processing/Jobs page, click on the "Create Job" button at the
top right
2) Give your Job a name.
2) Give your Job a name
3) Choose the type of job you'd like to run (Pig, Hive, MapReduce, Streaming MapReduce, Java Action)
3) Choose the type of job you'd like to run
4) Choose the main binary from the dropdown (not applicable for MapReduce or Java Action).
4) Choose the main binary from the dropdown
5) Enter an optional description for your Job.
- This is required for Hive, Pig, and Spark jobs
- Other job types do not use a main binary
6) Optionally, click on the "Libs" tab and add one or more libraries that are required for your job. Each library must be defined as a Job Binary.
5) Enter an optional description for your Job
7) Click on "Create".
6) Click on the "Libs" tab and choose any libraries needed by your job
- MapReduce and Java jobs require at least one library
- Other job types may optionally use libraries
7) Click on "Create"
Job Executions
--------------
Job Executions are what you get by "Launching" a job. You can monitor the status of your job to see when it has completed its run.
Job Executions are what you get by "Launching" a job. You can monitor the
status of your job to see when it has completed its run
1) From the Data Processing/Jobs page, find the row that contains the job you want to launch and click on the "Launch Job" button at the right side of that row.
1) From the Data Processing/Jobs page, find the row that contains the job you
want to launch and click on the "Launch Job" button at the right side of that
row
2) Choose the cluster (already running--see `Launching a Cluster`_ above) on which you would like the job to run.
2) Choose the cluster (already running--see `Launching a Cluster`_ above) on
which you would like the job to run
3) Choose the Input and Output Data Sources (Data Sources defined above).
3) Choose the Input and Output Data Sources (Data Sources defined above)
4) If additional configuration is required, click on the "Configure" tab.
4) If additional configuration is required, click on the "Configure" tab
- Additional configuration properties can be defined by clicking on the "Add" button.
- An example configuration entry might be mapred.mapper.class for the Name and org.apache.oozie.example.SampleMapper for the Value.
- Additional configuration properties can be defined by clicking on the "Add"
button
- An example configuration entry might be mapred.mapper.class for the Name and
org.apache.oozie.example.SampleMapper for the Value
5) Click on "Launch". To monitor the status of your job, you can navigate to the Sahara/Job Executions panel.
5) Click on "Launch". To monitor the status of your job, you can navigate to
the Sahara/Job Executions panel
6) You can relaunch a Job Execution from the Job Executions page by using the "Relaunch on New Cluster" or "Relaunch on Existing Cluster" links.
6) You can relaunch a Job Execution from the Job Executions page by using the
"Relaunch on New Cluster" or "Relaunch on Existing Cluster" links
- Relaunch on New Cluster will take you through the forms to start a new cluster before letting you specify input/output Data Sources and job configuration.
- Relaunch on Existing Cluster will prompt you for input/output Data Sources as well as allow you to change job configuration before launching the job.
- Relaunch on New Cluster will take you through the forms to start a new
cluster before letting you specify input/output Data Sources and job
configuration
- Relaunch on Existing Cluster will prompt you for input/output Data Sources
as well as allow you to change job configuration before launching the job
Example Jobs
------------
There are sample jobs located in the sahara repository. The instructions there guide you through running the jobs via the command line.
In this section, we will give a walkthrough on how to run those jobs via the Horizon UI.
These steps assume that you already have a cluster up and running (in the "Active" state).
There are sample jobs located in the sahara repository. In this section, we
will give a walkthrough on how to run those jobs via the Horizon UI. These steps
assume that you already have a cluster up and running (in the "Active" state).
1) Sample Pig job - https://github.com/openstack/sahara/tree/master/etc/edp-examples/pig-job
1) Sample Pig job -
https://github.com/openstack/sahara/tree/master/etc/edp-examples/pig-job
- Load the input data file from https://github.com/openstack/sahara/tree/master/etc/edp-examples/pig-job/data/input into swift
- Load the input data file from
https://github.com/openstack/sahara/tree/master/etc/edp-examples/pig-job/data/input
into swift
- Click on Projet/Object Store/Containers and create a container with any name ("samplecontainer" for our purposes here).
- Click on Projet/Object Store/Containers and create a container with any
name ("samplecontainer" for our purposes here)
- Click on Upload Object and give the object a name ("piginput" in this case)
- Click on Upload Object and give the object a name
("piginput" in this case)
- Navigate to Data Processing/Data Sources, Click on Create Data Source.
- Navigate to Data Processing/Data Sources, Click on Create Data Source
- Name your Data Source ("pig-input-ds" in this sample)
- Type = Swift, URL samplecontainer.sahara/piginput, fill-in the Source username/password fields with your username/password and click "Create"
- Type = Swift, URL samplecontainer/piginput, fill-in the Source
username/password fields with your username/password and click "Create"
- Create another Data Source to use as output for the job
- Create another Data Source to use as output for our job. Name = pig-output-ds, Type = Swift, URL = samplecontainer.sahara/pigoutput, Source username/password, "Create"
- Name = pig-output-ds, Type = Swift, URL = samplecontainer/pigoutput,
Source username/password, "Create"
- Store your Job Binaries in the Sahara database
- Navigate to Data Processing/Job Binaries, Click on Create Job Binary
- Name = example.pig, Storage type = Internal database, click Browse and find example.pig wherever you checked out the sahara project <sahara root>/etc/edp-examples/pig-job
- Name = example.pig, Storage type = Internal database, click Browse and
find example.pig wherever you checked out the sahara project
<sahara root>/etc/edp-examples/pig-job
- Create another Job Binary: Name = udf.jar, Storage type = Internal database, click Browse and find udf.jar wherever you checked out the sahara project <sahara root>/etc/edp-examples/pig-job
- Create another Job Binary: Name = udf.jar, Storage type = Internal
database, click Browse and find udf.jar wherever you checked out the
sahara project <sahara root>/etc/edp-examples/pig-job
- Create a Job
@ -217,59 +282,83 @@ These steps assume that you already have a cluster up and running (in the "Activ
- Name = pigsample, Job Type = Pig, Choose "example.pig" as the main binary
- Click on the "Libs" tab and choose "udf.jar", then hit the "Choose" button beneath the dropdown, then click on "Create"
- Click on the "Libs" tab and choose "udf.jar", then hit the "Choose" button
beneath the dropdown, then click on "Create"
- Launch your job
- To launch your job from the Jobs page, click on the down arrow at the far right of the screen and choose "Launch on Existing Cluster"
- To launch your job from the Jobs page, click on the down arrow at the far
right of the screen and choose "Launch on Existing Cluster"
- For the input, choose "pig-input-ds", for output choose "pig-output-ds". Also choose whichever cluster you'd like to run the job on.
- For the input, choose "pig-input-ds", for output choose "pig-output-ds".
Also choose whichever cluster you'd like to run the job on
- For this job, no additional configuration is necessary, so you can just click on "Launch"
- For this job, no additional configuration is necessary, so you can just
click on "Launch"
- You will be taken to the "Job Executions" page where you can see your job progress through "PENDING, RUNNING, SUCCEEDED" phases
- You will be taken to the "Job Executions" page where you can see your job
progress through "PENDING, RUNNING, SUCCEEDED" phases
- When your job finishes with "SUCCEEDED", you can navigate back to Object Store/Containers and browse to the samplecontainer to see your output. It should be in the "pigoutput" folder.
- When your job finishes with "SUCCEEDED", you can navigate back to Object
Store/Containers and browse to the samplecontainer to see your output.
It should be in the "pigoutput" folder
2) Sample Spark job - https://github.com/openstack/sahara/tree/master/etc/edp-examples/edp-spark
2) Sample Spark job -
https://github.com/openstack/sahara/tree/master/etc/edp-examples/edp-spark
- Store the Job Binary in the Sahara database
- Navigate to Data Processing/Job Binaries, Click on Create Job Binary
- Name = sparkexample.jar, Storage type = Internal database, Browse to the location <sahara root>/etc/edp-examples/edp-spark and choose spark-example.jar, Click "Create"
- Name = sparkexample.jar, Storage type = Internal database, Browse to the
location <sahara root>/etc/edp-examples/edp-spark and choose
spark-example.jar, Click "Create"
- Create a Job
- Name = sparkexamplejob, Job Type = Spark, Main binary = Choose sparkexample.jar, Click "Create"
- Name = sparkexamplejob, Job Type = Spark,
Main binary = Choose sparkexample.jar, Click "Create"
- Launch your job
- To launch your job from the Jobs page, click on the down arrow at the far right of the screen and choose "Launch on Existing Cluster"
- To launch your job from the Jobs page, click on the down arrow at the far
right of the screen and choose "Launch on Existing Cluster"
- Choose whichever cluster you'd like to run the job on.
- Choose whichever cluster you'd like to run the job on
- Click on the "Configure" tab
- Set the main class to be: org.apache.spark.examples.SparkPi
- Under Arguments, click Add and fill in the number of "Slices" you want to use for the job. For this example, let's use 100 as the value
- Under Arguments, click Add and fill in the number of "Slices" you want to
use for the job. For this example, let's use 100 as the value
- Click on Launch
- You will be taken to the "Job Executions" page where you can see your job progress through "PENDING, RUNNING, SUCCEEDED" phases
- You will be taken to the "Job Executions" page where you can see your job
progress through "PENDING, RUNNING, SUCCEEDED" phases
- When your job finishes with "SUCCEEDED", you can see your results by sshing to the Spark "master" node.
- When your job finishes with "SUCCEEDED", you can see your results by
sshing to the Spark "master" node
- The output is located at /tmp/spark-edp/<name of job>/<job execution id>. You can do ``cat stdout`` which should display something like "Pi is roughly 3.14156132"
- The output is located at /tmp/spark-edp/<name of job>/<job execution id>.
You can do ``cat stdout`` which should display something like
"Pi is roughly 3.14156132"
- It should be noted that for more complex jobs, the input/output may be elsewhere. This particular job just writes to stdout, which is logged in the folder under /tmp.
- It should be noted that for more complex jobs, the input/output may be
elsewhere. This particular job just writes to stdout, which is logged in
the folder under /tmp
Additional Notes
----------------
1) Throughout the Sahara UI, you will find that if you try to delete an object that you will not be able to delete it if another object depends on it.
An example of this would be trying to delete a Job that has an existing Job Execution. In order to be able to delete that job, you would first need to delete any Job Executions that relate to that job.
1) Throughout the Sahara UI, you will find that if you try to delete an object
that you will not be able to delete it if another object depends on it.
An example of this would be trying to delete a Job that has an existing Job
Execution. In order to be able to delete that job, you would first need to
delete any Job Executions that relate to that job.
2) In the examples above, we mention adding your username/password for the Swift Data Sources.
It should be noted that it is possible to configure Sahara such that the username/password credentials are *not* required.
For more information on that, please refer to: :doc:`Sahara Advanced Configuration Guide <../userdoc/advanced.configuration.guide>`
2) In the examples above, we mention adding your username/password for the Swift
Data Sources. It should be noted that it is possible to configure Sahara such
that the username/password credentials are *not* required. For more
information on that, please refer to:
:doc:`Sahara Advanced Configuration Guide <../userdoc/advanced.configuration.guide>`