New Pig example with a User Defined Function
Replace the existing Pig/UDF example with an original one. The license of the replaced example was not totally clear. The input and expected output of the old example has been kept as they are used also by other tests. Even if we are working towards the removal of jar file from this repository, rebuilding Hadoop-based jars is not trivial, so the jar file (compiled with target JVM 1.6) is part of the patch. Change-Id: Ib86f63458797dc10b19334177dab01e16894ca57
This commit is contained in:
parent
5fa799dde2
commit
77fa63e19e
|
@ -3,13 +3,13 @@ edp_jobs_flow:
|
|||
- type: Pig
|
||||
input_datasource:
|
||||
type: swift
|
||||
source: edp-examples/edp-pig/trim-spaces/data/input
|
||||
source: edp-examples/edp-pig/cleanup-string/data/input
|
||||
output_datasource:
|
||||
type: hdfs
|
||||
destination: /user/hadoop/edp-output
|
||||
main_lib:
|
||||
type: swift
|
||||
source: edp-examples/edp-pig/trim-spaces/example.pig
|
||||
source: edp-examples/edp-pig/cleanup-string/example.pig
|
||||
additional_libs:
|
||||
- type: swift
|
||||
source: edp-examples/edp-pig/trim-spaces/udf.jar
|
||||
source: edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar
|
||||
|
|
|
@ -0,0 +1,35 @@
|
|||
=========================
|
||||
Pig StringCleaner Example
|
||||
=========================
|
||||
|
||||
Overview
|
||||
--------
|
||||
This is an (almost useless) example of Pig job which uses a custom UDF (User
|
||||
Defined Function).
|
||||
|
||||
- ``StringCleaner.java`` is a Pig UDF which strips some characters from the
|
||||
input.
|
||||
- ``example.pig`` is the main Pig code which uses the UDF;
|
||||
|
||||
|
||||
Compiling the UDF
|
||||
-----------------
|
||||
|
||||
To build the jar, add ``pig`` to the classpath.
|
||||
|
||||
$ cd src
|
||||
$ mkdir build
|
||||
$ javac -source 1.6 -target 1.6 -cp /path/to/pig.jar -d build StringCleaner.java
|
||||
$ jar -cvf edp-pig-stringcleaner.jar -C build/ .
|
||||
|
||||
Running from the Sahara UI
|
||||
--------------------------
|
||||
|
||||
The procedure does not differ from the usual steps for other Pig jobs.
|
||||
|
||||
Create a job template where:
|
||||
- the main library points to the job binary for ``example.pig``;
|
||||
- additional library contains the job binary for ``edp-pig-udf-stringcleaner.jar``.
|
||||
|
||||
Create a job from that job template and attach the input and output data
|
||||
sources.
|
|
@ -0,0 +1,3 @@
|
|||
StrangeVariable01WithGarbage
|
||||
a-confusedInput
|
||||
this_will_be_more_compact
|
|
@ -0,0 +1,3 @@
|
|||
Strange==Variable01 With Garbage
|
||||
a"-confused Input
|
||||
this _will_ be_ mo re _compact
|
Binary file not shown.
|
@ -0,0 +1,3 @@
|
|||
A = load '$INPUT' using PigStorage() as (lines: chararray);
|
||||
B = foreach A generate org.openstack.sahara.examples.pig.StringCleaner(lines);
|
||||
store B into '$OUTPUT' USING PigStorage();
|
|
@ -0,0 +1,25 @@
|
|||
/*
|
||||
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
|
||||
* use this file except in compliance with the License. You may obtain a copy
|
||||
* of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||
* License for the specific language governing permissions and limitations under
|
||||
* the License.
|
||||
*/
|
||||
package org.openstack.sahara.examples.pig;
|
||||
|
||||
import org.apache.pig.PrimitiveEvalFunc;
|
||||
|
||||
public class StringCleaner extends PrimitiveEvalFunc<String, String>
|
||||
{
|
||||
public String exec(String input) {
|
||||
// Useless example which removes all but few basic latin characters
|
||||
// and separators
|
||||
return input.replaceAll("[^A-Za-z0-9-_]+", "");
|
||||
}
|
||||
}
|
|
@ -1,11 +0,0 @@
|
|||
Example Pig job
|
||||
===============
|
||||
|
||||
This script trims spaces in input text
|
||||
|
||||
This sample pig job is based on examples in Chapter 11 of
|
||||
"Hadoop: The Definitive Guide" by Tom White, published by O'Reilly Media.
|
||||
The original code can be found in a maven project at
|
||||
|
||||
https://github.com/tomwhite/hadoop-book
|
||||
|
|
@ -1,3 +0,0 @@
|
|||
A = load '$INPUT' using PigStorage(':') as (fruit: chararray);
|
||||
B = foreach A generate com.hadoopbook.pig.Trim(fruit);
|
||||
store B into '$OUTPUT' USING PigStorage();
|
Binary file not shown.
|
@ -105,12 +105,12 @@ This example assumes the following:
|
|||
3. In the demo user's project, the following files are stored in swift in the
|
||||
container ``edp-examples``, as follows:
|
||||
|
||||
* The file at ``edp-examples/edp-pig/trim-spaces/example.pig`` is stored
|
||||
at path ``swift://edp-examples/edp-pig/trim-spaces/example.pig``.
|
||||
* The file at ``edp-pig/trim-spaces/udf.jar`` is stored at
|
||||
path ``swift://edp-examples/edp-pig/trim-spaces/udf.jar``.
|
||||
* The file at ``edp-examples/edp-pig/trim-spaces/data/input`` is stored at
|
||||
path ``swift://edp-examples/edp-pig/trim-spaces/data/input``.
|
||||
* The file at ``edp-examples/edp-pig/cleanup-string/example.pig`` is stored
|
||||
at path ``swift://edp-examples/edp-pig/cleanup-string/example.pig``.
|
||||
* The file at ``edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar`` is stored at
|
||||
path ``swift://edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar``.
|
||||
* The file at ``edp-examples/edp-pig/cleanup-string/data/input`` is stored at
|
||||
path ``swift://edp-examples/edp-pig/cleanup-string/data/input``.
|
||||
|
||||
Steps
|
||||
-----
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
"name": "demo-pig-input",
|
||||
"description": "A data source for Pig input, stored in Swift",
|
||||
"type": "swift",
|
||||
"url": "swift://edp-examples.sahara/edp-pig/trim-spaces/data/input",
|
||||
"url": "swift://edp-examples.sahara/edp-pig/cleanup-string/data/input",
|
||||
"credentials": {
|
||||
"user": "demo",
|
||||
"password": "password"
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
"name": "demo-pig-output",
|
||||
"description": "A data source for Pig output, stored in Swift",
|
||||
"type": "swift",
|
||||
"url": "swift://edp-examples.sahara/edp-pig/trim-spaces/data/output",
|
||||
"url": "swift://edp-examples.sahara/edp-pig/cleanup-string/data/output",
|
||||
"credentials": {
|
||||
"user": "demo",
|
||||
"password": "password"
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
{
|
||||
"name": "example.pig",
|
||||
"description": "An example pig script",
|
||||
"url": "swift://edp-examples/edp-pig/trim-spaces/example.pig",
|
||||
"url": "swift://edp-examples/edp-pig/cleanup-string/example.pig",
|
||||
"extra": {
|
||||
"user": "demo",
|
||||
"password": "password"
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
{
|
||||
"name": "udf.jar",
|
||||
"name": "edp-pig-udf-stringcleaner.jar",
|
||||
"description": "An example pig UDF library",
|
||||
"url": "swift://edp-examples/edp-pig/trim-spaces/udf.jar",
|
||||
"url": "swift://edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar",
|
||||
"extra": {
|
||||
"user": "demo",
|
||||
"password": "password"
|
||||
|
|
|
@ -3,16 +3,16 @@ edp_jobs_flow:
|
|||
- type: Pig
|
||||
input_datasource:
|
||||
type: swift
|
||||
source: edp-examples/edp-pig/trim-spaces/data/input
|
||||
source: edp-examples/edp-pig/cleanup-string/data/input
|
||||
output_datasource:
|
||||
type: swift
|
||||
destination: edp-output
|
||||
main_lib:
|
||||
type: swift
|
||||
source: edp-examples/edp-pig/trim-spaces/example.pig
|
||||
source: edp-examples/edp-pig/cleanup-string/example.pig
|
||||
additional_libs:
|
||||
- type: swift
|
||||
source: edp-examples/edp-pig/trim-spaces/udf.jar
|
||||
source: edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar
|
||||
mapreduce_job:
|
||||
- type: MapReduce
|
||||
input_datasource:
|
||||
|
@ -89,16 +89,16 @@ edp_jobs_flow:
|
|||
- type: Pig
|
||||
input_datasource:
|
||||
type: maprfs
|
||||
source: edp-examples/edp-pig/trim-spaces/data/input
|
||||
source: edp-examples/edp-pig/cleanup-string/data/input
|
||||
output_datasource:
|
||||
type: maprfs
|
||||
destination: /user/hadoop/edp-output
|
||||
main_lib:
|
||||
type: swift
|
||||
source: edp-examples/edp-pig/trim-spaces/example.pig
|
||||
source: edp-examples/edp-pig/cleanup-string/example.pig
|
||||
additional_libs:
|
||||
- type: swift
|
||||
source: edp-examples/edp-pig/trim-spaces/udf.jar
|
||||
source: edp-examples/edp-pig/cleanup-string/edp-pig-udf-stringcleaner.jar
|
||||
mapr:
|
||||
- type: MapReduce
|
||||
input_datasource:
|
||||
|
|
Loading…
Reference in New Issue