Pig workshop

Post on 10-May-2015

5.763 views 0 download

Tags:

description

Slides that I used for my Pig Workshop

Transcript of Pig workshop

Pig WorkshopSudar Muthu

http://sudarmuthu.comhttp://twitter.com/sudarmuthu

https://github.com/sudar

Research Engineer by profession I mine useful information from data You might recognize me from other HasGeek

events Blog at http://sudarmuthu.com Builds robots as hobby ;)

Who am I?

HasGeekSpecial Thanks

What I will not cover?

What is BigData, or why it is needed? What is MapReduce? What is Hadoop? Internal architecture of Pig

http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig

What I will not cover?

What we will see today?

What is Pig How to use it

Loading and storing data Pig Latin SQL vs Pig Writing UDF’s

Debugging Pig Scripts Optimizing Pig Scripts When to use Pig

What we will see today?

So, all of you have Pig installed right? ;)

“Platform for analyzing large sets of data”

What is Pig?

Pig Shell (Grunt) Pig Language (Latin) Libraries (Piggy Bank) User Defined Functions (UDF)

Components of Pig

It is a data flow language Provides standard data processing

operations Insulates Hadoop complexity Abstracts Map Reduce Increases programmer productivity

… but there are cases where Pig is not suitable.

Why Pig?

Pig Modes

For this workshop, we will be using Pig only in local

mode

Getting to know your Pig shell

Similar to Python’s shellpig –x local

Inline in shell From a file Streaming through other executable Embed script in other languages

Different ways of executing Pig Scripts

Pigs eat anythingLoading and Storing data

file = LOAD 'data/dropbox-policy.txt' AS (line);

data = LOAD 'data/tweets.csv' USING PigStorage(',');

data = LOAD 'data/tweets.csv' USING PigStorage(',') AS ('list', 'of', 'fields');

Loading Data into Pig

PigStorage – for most cases TextLoader – to load text files JSONLoader – to load JSON files Custom loaders – You can write your own

custom loaders as well

Loading Data into Pig

DUMP input;

Very useful for debugging, but don’t use it on huge datasets

Viewing Data

STORE data INTO 'output_location';

STORE data INTO 'output_location' USING PigStorage();

STORE data INTO 'output_location' USING PigStorage(',');

STORE data INTO 'output_location' USING BinStorage();

Storing Data from Pig

Similar to `LOAD`, lot of options are available

Can store locally or in HDFS You can write your own custom Storage as

well

Storing Data

data = LOAD 'data/data-bag.txt' USING PigStorage(',');

STORE data INTO 'data/output/load-store' USING PigStorage('|');

https://github.com/sudar/pig-samples/load-store.pig

Load and Store example

Pig Latin

Scalar Types Complex Types

Data Types

int, long – (32, 64 bit) integer float, double – (32, 64 bit) floating point boolean (true/false) chararray (String in UTF-8) bytearray (blob) (DataByteArray in Java)

If you don’t specify anything bytearray is used by default

Scalar Types

tuple – ordered set of fields (data) bag – collection of tuples map – set of key value pairs

Complex Types

Row with one or more fields Fields can be of any data type Ordering is important Enclosed inside parentheses ()

Eg: (Sudar, Muthu, Haris, Dinesh)(Sudar, 176, 80.2F)

Tuple

Set of tuples SQL equivalent is Table Each tuple can have different set of fields Can have duplicates Inner bag uses curly braces {} Outer bag doesn’t use anything

Bag

Outer bag

(1,2,3)(1,2,4)(2,3,4)(3,4,5)(4,5,6)

https://github.com/sudar/pig-samples/data-bag.pig

Bag - Example

Inner bag

(1,{(1,2,3),(1,2,4)})(2,{(2,3,4)})(3,{(3,4,5)})(4,{(4,5,6)})

https://github.com/sudar/pig-samples/data-bag.pig

Bag - Example

Set of key value pairs Similar to HashMap in Java Key must be unique Key must be of chararray data type Values can be any type Key/value is separated by # Map is enclosed by []

Map

[name#sudar, height#176, weight#80.5F]

[name#(sudar, muthu), height#176, weight#80.5F]

[name#(sudar, muthu), languages#(Java, Pig, Python)]

Map - Example

Similar to SQL Denotes that value of data element is

unknown Any data type can be null

Null

We can specify a schema (collection of datatypes) to `LOAD` statements

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);

data = LOAD 'data/nested-schema.txt' AS (f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);

Schemas in Load statement

Fields can be looked up by

Position Name Map Lookup

Expressions

data = LOAD 'data/nested-schema.txt' AS (f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);

by_pos = FOREACH data GENERATE $0;DUMP by_pos;

by_field = FOREACH data GENERATE f2;DUMP by_field;

by_map = FOREACH data GENERATE f3#'name';DUMP by_map;

https://github.com/sudar/pig-samples/lookup.pig

Expressions - Example

Operators

All usual arithmetic operators are supported

Addition (+) Subtraction (-) Multiplication (*) Division (/) Modulo (%)

Arithmetic Operators

All usual boolean operators are supported

AND OR NOT

Boolean Operators

All usual comparison operators are supported

== != < > <= >=

Comparison Operators

FOREACH FLATTERN GROUP FILTER COUNT ORDER BY DISTINCT LIMIT JOIN

Relational Operators

Generates data transformations based on columns of data

x = FOREACH data GENERATE *;

x = FOREACH data GENERATE $0, $1;

x = FOREACH data GENERATE $0 AS first, $1 AS second;

FOREACH

Un-nests tuples and bags. Most of the time results in cross product

(a, (b, c)) => (a,b,c)

({(a,b),(d,e)}) => (a,b) and (d,e)

(a, {(b,c), (d,e)}) => (a, b, c) and (a, d, e)

FLATTEN

Groups data in one or more relations Groups tuples that have the same group key Similar to SQL group by operator

outerbag = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP outerbag;

innerbag = GROUP outerbag BY f1;DUMP innerbag;

https://github.com/sudar/pig-samples/group-by.pig

GROUP

Selects tuples from a relation based on some condition

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP data;

filtered = FILTER data BY f1 == 1;DUMP filtered;

https://github.com/sudar/pig-samples/filter-by.pig

FILTER

Counts the number of tuples in a relationship

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);grouped = GROUP data BY f2;

counted = FOREACH grouped GENERATE group, COUNT (data);DUMP counted;

https://github.com/sudar/pig-samples/count.pig

COUNT

Sort a relation based on one or more fields. Similar to SQL order by

data = LOAD 'data/nested-sample.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP data;

ordera = ORDER data BY f1 ASC;DUMP ordera;

orderd = ORDER data BY f1 DESC;DUMP orderd;

https://github.com/sudar/pig-samples/order-by.pig

ORDER By

Removes duplicates from a relation

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP data;

unique = DISTINCT data;DUMP unique;

https://github.com/sudar/pig-samples/distinct.pig

DISTINCT

Limits the number of tuples in the output.

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP data;

limited = LIMIT data 3;DUMP limited;

https://github.com/sudar/pig-samples/limit.pig

LIMIT

Joins relation based on a field. Both outer and inner joins are supported

a = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP a;

b = LOAD 'data/simple-tuples.txt' USING PigStorage(',') AS (t1:int, t2:int);DUMP b;

joined = JOIN a by f1, b by t1;DUMP joined;

https://github.com/sudar/pig-samples/join.pig

JOIN

From Table – Load file(s) Select – FOREACH GENERATE Where – FILTER BY Group By – GROUP BY + FOREACH

GENERATE Having – FILTER BY Order By – ORDER BY Distinct - DISTINCT

SQL vs Pig

Count the number of words in a text file

Let’s see a complete example

https://github.com/sudar/pig-samples/count-words.pig

Extending Pig - UDF

Do operations on more than one field Do more than grouping and filtering Programmer is comfortable Want to reuse existing logic

Traditionally UDF can be written only in Java. Now other languages like Python are also supported

Why UDF?

Eval Functions Filter functions Load functions Store functions

Different types of UDF’s

Can be used in FOREACH statement Most common type of UDF Can return simple types or Tuples

b = FOREACH a generate udf.Function($0);

b = FOREACH a generate udf.Function($0, $1);

Eval Functions

Extend EvalFunc<T> interface The generic <T> should contain the return type Input comes as a Tuple Should check for empty and nulls in input Extend exec() function and it should return the value Extend getArgToFuncMapping() to let UDF know

about Argument mapping Extend outputSchema() to let UDF know about

output schema

Eval Functions

Create a jar file which contains your UDF classes

Register the jar at the top of Pig script Register other jars if needed Define the UDF function Use your UDF function

Using Java UDF in Pig Scripts

Let’s see an example which returns a string

https://github.com/sudar/pig-samples/strip-quote.pig

Let’s see an example which returns a Tuple

https://github.com/sudar/pig-samples/get-twitter-names.pig

Can be used in the Filter statements Returns a boolean value

Eg: vim_tweets = FILTER data By FromVim(StripQuote($6));

Filter Functions

Extends FilterFun, which is a EvalFunc<Boolean>

Should return a boolean Input it is same as EvalFunc<T> Should check for empty and nulls in input Extend getArgToFuncMapping() to let UDF

know about Argument mapping

Filter Functions

Let’s see an example which returns a Boolean

https://github.com/sudar/pig-samples/from-vim.pig

If the error affects only particular row then return null.

If the error affects other rows, but can recover, then throw an IOException

If the error affects other rows, and can’t recover, then also throw an IOException. Pig and Hadoop will quit, if there are many IOExceptions.

Error Handling in UDF

Can we try to write some more UDF’s?

Writing UDF in other languages

Streaming

Entire data set is passed through an external task

The external task can be in any language Even shell script also works Uses the `STREAM` function

Streaming

data = LOAD 'data/tweets.csv' USING PigStorage(',');

filtered = STREAM data THROUGH `cut -f6,8`;

DUMP filtered;

https://github.com/sudar/pig-samples/stream-shell-script.pig

Stream through shell script

data = LOAD 'data/tweets.csv' USING PigStorage(',');

filtered = STREAM data THROUGH `strip.py`;

DUMP filtered;

https://github.com/sudar/pig-samples/stream-python.pig

Stream through Python

DUMP is your friend, but use with LIMIT DESCRIBE – will print the schema names ILLUSTRATE – Will show the structure of the

schema In UDF’s, we can use warn() function. It

supports upto 15 different debug levels Use Penny - https://cwiki.apache.org/PIG/

pennytoollibrary.html

Debugging Pig Scripts

Project early and often Filter early and often Drop nulls before a join Prefer DISTINCT over GROUP BY Use the right data structure

Optimizing Pig Scripts

-p key=value - substitutes a single key, value

-m file.ini – substitutes using an ini file default – provide default values

http://sudarmuthu.com/blog/passing-command-line-arguments-to-pig-scripts

Using Param substitution

Anything data relatedProblems that can be solved using Pig

Lot of custom logic needs to be implemented Need to do lot of cross lookup Data is mostly binary (processing image

files) Real-time processing of data is needed

When not to use Pig?

PiggyBank - https://cwiki.apache.org/PIG/piggybank.html

DataFu – Linked-In Pig Library - https://github.com/linkedin/datafu

Elephant Bird – Twitter Pig Library - https://github.com/kevinweil/elephant-bird

External Libraries

Pig homepage - http://pig.apache.org/ My blog about Pig - http://sudarmuthu.com/blog/category/hadoop-pig Sample code –

https://github.com/sudar/pig-samples Slides – http://slideshare.net/sudar

Useful Links

Thank you