Apache PIG - User Defined Functions
-
Upload
christoph-bauer -
Category
Education
-
view
10.878 -
download
3
description
Transcript of Apache PIG - User Defined Functions
![Page 1: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/1.jpg)
Apache Pig UDFsExtending Pig to solve complex tasks
UDF = User Defined Functions
![Page 2: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/2.jpg)
Your speaker today:
Christoph Bauer
java developer 10+ years
one of the founders
Helping our clients to use and understand their (big) data
working in "BigData" since 2010
![Page 3: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/3.jpg)
Why use PIG
● ad-hoc way for creating and executing map/reduce jobs
● simple, high-level language● more natural for analysts than map/reduce
![Page 4: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/4.jpg)
Done.
http://leesfishandphotos.blogspot.de
![Page 5: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/5.jpg)
Oh, wait...
![Page 6: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/6.jpg)
UDFs to the rescue
Writing user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write them in java, python, ...
![Page 7: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/7.jpg)
Do whatever you want
● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language processing
...● much more...
![Page 8: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/8.jpg)
User Defined Functions
● EvalFunc<T>public <T> exec(Tuple input)
● FilterFuncpublic Boolean exec(Tuple input)
● Aggregate Functionspublic interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal();}
● Load/Store Functionspublic Tuple getNext()public void putNext(Tuple input);
![Page 9: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/9.jpg)
What? Why?companyName companyAdress
companyAdresscompanyAddress
Net WorthNet Worth
Net WorthNet Worth
Net WorthNet Worth
Net WorthNet Worth
Net Worth
2010 | companyName | current Address | historical Net Worth
2011 | companyName | current Address | historical Net Worth
2012 | companyName | current Address | historical Net Worth
![Page 10: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/10.jpg)
Exampler1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }
...apply UDF
r1, t1, q1:"v1", q2:"v4"
r1, t3, q1:"v1", q2:"v4"
r1, t5, q1:"v2", q2:"v4"
SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
![Page 11: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/11.jpg)
LATESTpublic class LATEST extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException { }}
![Page 12: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/12.jpg)
LATEST (contd.)public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } } return result;}
r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }
![Page 13: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/13.jpg)
SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>();
DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur));
dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }
![Page 14: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/14.jpg)
SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts);
for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } } return result;}
r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }
![Page 15: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/15.jpg)
PigLatin
REGISTER 'my-udf.jar'DEFINE LATEST myudf.Latest();DEFINE SNAPSHOT myudf.Snapshot ('2000-01-01 2013-01-01 1y');A = LOAD 'inputTable' AS (id, q1, q2);B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR);STORE C INTO 'output.csv';
r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }
![Page 16: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/16.jpg)
Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot ('2000-01-01 2013-01-01 1y');
...public SNAPSHOTS(String start, String end, String increment) { super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment);}
![Page 17: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/17.jpg)
I didn't talk about
● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects
● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface
● do implementpublic Schema outputSchema(Schema input)
● report progress when doing time consuming stuff
● Performance?
![Page 18: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/18.jpg)
SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG));
for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType.BAG)); } catch (FrontendException e) { } return null;}
![Page 19: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/19.jpg)
Reality check
● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase
![Page 20: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/20.jpg)
Wrapping it up
We at Oberbaum Concept developed a bunch of PIG Functions handling versioned data in HBase.● Rewrote HBaseStorage● UDFs for Snapshots, Latest
Right now we are trying to push our changes back into PIG.
![Page 21: Apache PIG - User Defined Functions](https://reader034.fdocuments.in/reader034/viewer/2022042513/5563a6e3d8b42a2b6a8b53b6/html5/thumbnails/21.jpg)
Questions?