New Pig example with a User Defined Function

Replace the existing Pig/UDF example with an original one.
The license of the replaced example was not totally clear.

The input and expected output of the old example has been kept as they
are used also by other tests.

Even if we are working towards the removal of jar file from this repository,
rebuilding Hadoop-based jars is not trivial, so the jar file (compiled with
target JVM 1.6) is part of the patch.

Change-Id: Ib86f63458797dc10b19334177dab01e16894ca57
This commit is contained in:
Luigi Toscano 2016-08-23 01:22:41 +02:00
parent 5fa799dde2
commit 77fa63e19e
16 changed files with 89 additions and 34 deletions

View File

@ -3,13 +3,13 @@ edp_jobs_flow:
- type: Pig
input_datasource:
type: swift
source: edp-examples/edp-pig/trim-spaces/data/input
source: edp-examples/edp-pig/cleanup-string/data/input
output_datasource:
type: hdfs
destination: /user/hadoop/edp-output
main_lib:
type: swift
source: edp-examples/edp-pig/trim-spaces/example.pig
source: edp-examples/edp-pig/cleanup-string/example.pig
additional_libs:
- type: swift
source: edp-examples/edp-pig/trim-spaces/udf.jar
source: edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar

View File

@ -0,0 +1,35 @@
=========================
Pig StringCleaner Example
=========================
Overview
--------
This is an (almost useless) example of Pig job which uses a custom UDF (User
Defined Function).
- ``StringCleaner.java`` is a Pig UDF which strips some characters from the
input.
- ``example.pig`` is the main Pig code which uses the UDF;
Compiling the UDF
-----------------
To build the jar, add ``pig`` to the classpath.
$ cd src
$ mkdir build
$ javac -source 1.6 -target 1.6 -cp /path/to/pig.jar -d build StringCleaner.java
$ jar -cvf edp-pig-stringcleaner.jar -C build/ .
Running from the Sahara UI
--------------------------
The procedure does not differ from the usual steps for other Pig jobs.
Create a job template where:
- the main library points to the job binary for ``example.pig``;
- additional library contains the job binary for ``edp-pig-udf-stringcleaner.jar``.
Create a job from that job template and attach the input and output data
sources.

View File

@ -0,0 +1,3 @@
StrangeVariable01WithGarbage
a-confusedInput
this_will_be_more_compact

View File

@ -0,0 +1,3 @@
Strange==Variable01 With Garbage
a"-confused Input
this _will_ be_ mo re _compact

View File

@ -0,0 +1,3 @@
A = load '$INPUT' using PigStorage() as (lines: chararray);
B = foreach A generate org.openstack.sahara.examples.pig.StringCleaner(lines);
store B into '$OUTPUT' USING PigStorage();

View File

@ -0,0 +1,25 @@
/*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy
* of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/
package org.openstack.sahara.examples.pig;
import org.apache.pig.PrimitiveEvalFunc;
public class StringCleaner extends PrimitiveEvalFunc<String, String>
{
public String exec(String input) {
// Useless example which removes all but few basic latin characters
// and separators
return input.replaceAll("[^A-Za-z0-9-_]+", "");
}
}

View File

@ -1,11 +0,0 @@
Example Pig job
===============
This script trims spaces in input text
This sample pig job is based on examples in Chapter 11 of
"Hadoop: The Definitive Guide" by Tom White, published by O'Reilly Media.
The original code can be found in a maven project at
https://github.com/tomwhite/hadoop-book

View File

@ -1,3 +0,0 @@
A = load '$INPUT' using PigStorage(':') as (fruit: chararray);
B = foreach A generate com.hadoopbook.pig.Trim(fruit);
store B into '$OUTPUT' USING PigStorage();

View File

@ -105,12 +105,12 @@ This example assumes the following:
3. In the demo user's project, the following files are stored in swift in the
container ``edp-examples``, as follows:
* The file at ``edp-examples/edp-pig/trim-spaces/example.pig`` is stored
at path ``swift://edp-examples/edp-pig/trim-spaces/example.pig``.
* The file at ``edp-pig/trim-spaces/udf.jar`` is stored at
path ``swift://edp-examples/edp-pig/trim-spaces/udf.jar``.
* The file at ``edp-examples/edp-pig/trim-spaces/data/input`` is stored at
path ``swift://edp-examples/edp-pig/trim-spaces/data/input``.
* The file at ``edp-examples/edp-pig/cleanup-string/example.pig`` is stored
at path ``swift://edp-examples/edp-pig/cleanup-string/example.pig``.
* The file at ``edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar`` is stored at
path ``swift://edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar``.
* The file at ``edp-examples/edp-pig/cleanup-string/data/input`` is stored at
path ``swift://edp-examples/edp-pig/cleanup-string/data/input``.
Steps
-----

View File

@ -2,7 +2,7 @@
"name": "demo-pig-input",
"description": "A data source for Pig input, stored in Swift",
"type": "swift",
"url": "swift://edp-examples.sahara/edp-pig/trim-spaces/data/input",
"url": "swift://edp-examples.sahara/edp-pig/cleanup-string/data/input",
"credentials": {
"user": "demo",
"password": "password"

View File

@ -2,7 +2,7 @@
"name": "demo-pig-output",
"description": "A data source for Pig output, stored in Swift",
"type": "swift",
"url": "swift://edp-examples.sahara/edp-pig/trim-spaces/data/output",
"url": "swift://edp-examples.sahara/edp-pig/cleanup-string/data/output",
"credentials": {
"user": "demo",
"password": "password"

View File

@ -1,7 +1,7 @@
{
"name": "example.pig",
"description": "An example pig script",
"url": "swift://edp-examples/edp-pig/trim-spaces/example.pig",
"url": "swift://edp-examples/edp-pig/cleanup-string/example.pig",
"extra": {
"user": "demo",
"password": "password"

View File

@ -1,7 +1,7 @@
{
"name": "udf.jar",
"name": "edp-pig-udf-stringcleaner.jar",
"description": "An example pig UDF library",
"url": "swift://edp-examples/edp-pig/trim-spaces/udf.jar",
"url": "swift://edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar",
"extra": {
"user": "demo",
"password": "password"

View File

@ -3,16 +3,16 @@ edp_jobs_flow:
- type: Pig
input_datasource:
type: swift
source: edp-examples/edp-pig/trim-spaces/data/input
source: edp-examples/edp-pig/cleanup-string/data/input
output_datasource:
type: swift
destination: edp-output
main_lib:
type: swift
source: edp-examples/edp-pig/trim-spaces/example.pig
source: edp-examples/edp-pig/cleanup-string/example.pig
additional_libs:
- type: swift
source: edp-examples/edp-pig/trim-spaces/udf.jar
source: edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar
mapreduce_job:
- type: MapReduce
input_datasource:
@ -89,16 +89,16 @@ edp_jobs_flow:
- type: Pig
input_datasource:
type: maprfs
source: edp-examples/edp-pig/trim-spaces/data/input
source: edp-examples/edp-pig/cleanup-string/data/input
output_datasource:
type: maprfs
destination: /user/hadoop/edp-output
main_lib:
type: swift
source: edp-examples/edp-pig/trim-spaces/example.pig
source: edp-examples/edp-pig/cleanup-string/example.pig
additional_libs:
- type: swift
source: edp-examples/edp-pig/trim-spaces/udf.jar
source: edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar
mapr:
- type: MapReduce
input_datasource: