Introduction to Pig Latin - University of...

Introduction to Pig Latin

A. Soner Balkir

What’s wrong with Map Reduce?

Map

Reduce

Map

• Long chains of map reduce jobs following one another

are hard to deal with.

•For each map reduce job, you need to implement at

least 3 classes.

Map

Reduce

Map

Reduce

•Input output specifications, compilation errors,

creating a jar file for each job.

•How about JOINS?

Remember Word Count?

• You wrote over 150 lines of code but you care

about at most 15 lines.

• What if?

A = load 'textdoc' using TextLoader() as (sentence: chararray)

B = foreach A generate flatten(TOKENIZE(sentence)) as word

C = group B by word

D = foreach C generate group, COUNT(B)

store D into 'wordcount'

Pig Automatically…

• Parses the script and tries to optimize it.

– No guarantees! Be careful!

• Compiles the java code and creates the jar file.

• Submits the job, and monitors it.• Submits the job, and monitors it.

• No major upgrades to the existing code. Just

update your Pig binaries!!!

– Remember the changes from Hadoop 0.1x to 0.2x

What is more?

• Pig can run a script on a small subset of the

data.

• Create a random sample of the input.

• Local mode vs. map reduce mode.• Local mode vs. map reduce mode.

• Extensible: Write your own UDF’s!

What Pig is not?

• It can’t be used as a DBMS replacement.

– No single record fetching.

– No SELECT keyword.

• Pure Hadoop implementations may be faster. • Pure Hadoop implementations may be faster. Pig is evolving.

• Pig and Hive are almost sisters with minor differences.

• Carefully decide when you need Hadoop and when you need Pig? Life is all about trade offs.

Installing and Running

• Download Pig at: http://hadoop.apache.org/pig/

• Set environment variables:

– export PIGDIR=/home/soner/pig-0.5.0

– export PATH=$PATH:$PIGDIR/bin– export PATH=$PATH:$PIGDIR/bin

– export PIG_HADOOP_VERSION=20

– export PIG_CLASSPATH=$HADOOPDIR/conf

• Start the grunt shell:

pig -x local

pig -x mapreduce

Three ways to run a Pig script

• Grunt shell

• Embedded

– Just like using JDBC to execute SQL commands.

• Scripts• Scripts

– Pig –x local myscript.pig

Before Starting…

• If you are using Eclipse, take a look at PigPen:

– http://wiki.apache.org/pig/PigPen

• If you are more comfortable in vi or vim, go to:

– http://www.vim.org/scripts/script.php?script_id=– http://www.vim.org/scripts/script.php?script_id=

2186

• Syntax highlighters and IDE’s always make

things easy.

• Examples will be online.

Hello World Example

• sample.txt contains earthquake data:

(date, magnitude, location id)

example: head –n 4 sample.txt

2005-05-02,4.2,3

2005-02-15,5.3,3

2002-08-17,7.1,5

2004-09-01,4.0,3

• records = LOAD 'sample.txt' USING PigStorage(',') AS

(date:chararray, magnitude:float, location:int);

– Specify a schema

– Identify the field names and types

• DUMP records;

– Display the tuples

(2005-05-02,4.2F,3)

(2005-02-15,5.3F,3)

(2002-08-17,7.1F,5)

(2004-09-01,4.0F,3)

(2006-11-21,5.6F,2)

Filters

• Filter some of the records using BOOLEAN functions.

• filtered_records = FILTER records BY magnitude > 5;

• DUMP filtered_records;

(2005-02-15,5.3F,3)

(2002-08-17,7.1F,5)

(2006-11-21,5.6F,2)

Grouping Data

• This is where reducers kick in.

• grouped_records = GROUP filtered_records BY location;

• Observe that location id is our grouping key.

• DUMP grouped records;• DUMP grouped records;

(2,{(2006-11-21,5.6F,2)})

(3,{(2005-02-15,5.3F,3),(2007-01-04,5.8F,3)})

(5,{(2002-08-17,7.1F,5),(2001-07-12,6.7F,5)})

Each group consists of a KEY and a bag containing zero or

more tuples. Notice the curly braces.

Processing the Groups

• averages = FOREACH grouped_records GENERATE

group, AVG(filtered_records.magnitude);

• DUMP averages;

(1,7.5)

(2,5.599999904632568)

(3,5.550000190734863)

(5,6.8999998569488525)

Storing the results on disk

• STORE averages INTO 'script1-output' USING PigStorage('\t');

• To run the script in local mode:

– pig -x local script1.pig

• To run in map reduce mode:

– hadoop dfs –copyFromLocal sample.txt /input/earthquake/sample.txt

– pig -x mapreduce script1.pig– pig -x mapreduce script1.pig

– hadoop dfs -cat /script1-output/part-00000

• NOTE: script-output is interpreted as an output file in the local mode and an output directory in mapreduce job.

• This is because in the mapreduce job, multiple reducers may be generating separate part-xxxxx files and writing them in a common output directory in HDFS.

Displaying Schema information

• DESCRIBE records;

– records: {date: chararray,magnitude:

float,location: int}

• DESCRIBE averages;

– averages: {group: int,double}

ILLUSTRATE averages;------------------------------------------------------------------------------

| records | date: bytearray | magnitude: bytearray | location: bytearray |

------------------------------------------------------------------------------

| | 2005-02-15 | 5.3 | 3 |

| | 2004-09-01 | 4.0 | 3 |

| | 2007-01-04 | 5.8 | 3 |

------------------------------------------------------------------------------

| records | date: chararray | magnitude: float | location: int |

--------------------------------------------------------------------

| | 2005-02-15 | 5.3 | 3 |

| | 2004-09-01 | 4.0 | 3 |

| | 2007-01-04 | 5.8 | 3 |

--------------------------------------------------------------------

| filtered_records | date: chararray | magnitude: float | location: int || filtered_records | date: chararray | magnitude: float | location: int |

-----------------------------------------------------------------------------

| | 2005-02-15 | 5.3 | 3 |

| | 2007-01-04 | 5.8 | 3 |

-----------------------------------------------------------------------------

| grouped_records | group: int | filtered_records: bag({date: chararray,magnitude: float,location: int}) |

--------------------------------------------------------------------------------------------------------------

| | 3 | {(2005-02-15, 5.3, 3), (2007-01-04, 5.8, 3)} ||

-------------------------------------------------

| averages | group: int | double |

-------------------------------------------------

| | 3 | 5.550000190734863 |

Dealing with corrupt data

• DUMP records:(2004-09-01,4.0F,3)

(2006-11-21,5.6F,)

(2001-07-12,6.7F,5)(2001-07-12,6.7F,5)

(2004-03-01,,)

(2007-01-04,5.8F,3)

• good_records = FILTER records BY location IS

NOT NULL;

Filtering corrupt data

• DUMP good_records;

(2004-09-01,4.0F,3)

(2001-07-12,6.7F,5)(2001-07-12,6.7F,5)

(2007-01-04,5.8F,3)

Splitting data

• Split the data into partitions as good and bad

records.

• SPLIT records INTO bad_records IF location IS

NULL, good_records IF location IS NOT NULL;NULL, good_records IF location IS NOT NULL;

• DUMP bad_records;

(2006-11-21,5.6F,)

(2004-03-01,,)

Counting

• Count the number of bad records.• temp_group = GROUP bad_records ALL;

• bad_count = FOREACH temp_group GENERATE

COUNT(bad_records);

• DUMP bad_count;

(2L)

• NOTE: Unlike SQL, Pig Latin doesn't have a SELECT COUNT(*)

operator. We have to manually create the groups and count

the bags inside each group.

Malformed Records

• Eliminate records with missing fields.

• DUMP records;(2004-09-01,4.0,3)

(2006-11-21,5.6)

(2001-07-12)(2001-07-12)

(2007-01-04,5.8,3)

• filtered_records = FILTER records BY SIZE(*) == 3;

• DUMP filtered_records;(2004-09-01,4.0,3)

(2007-01-04,5.8,3)

User Defined Functions

• REGISTER udf.jar;

• DEFINE isHigh src.pig.examples.MagnitudeFilter();

• DEFINE round src.pig.examples.OutputFormatter();

• records = LOAD ' sample.txt' USING PigStorage(',') AS (date:chararray, magnitude:float, location:int);

• filtered_records = FILTER records BY isHigh(magnitude);• filtered_records = FILTER records BY isHigh(magnitude);

• grouped_records = GROUP filtered_records BY location;

• averages = FOREACH grouped_records GENERATE group, AVG(filtered_records.magnitude);

• rounded_averages = FOREACH averages GENERATE round($1);

• STORE rounded_averages INTO ' udf-output' USING PigStorage('\t');

Filter UDF Examplepublic class MagnitudeFilter extends FilterFunc {

@Override

public Boolean exec(Tuple t) throws IOException {

if( t == null || t.size() == 0) {

return false;

}

try {

// Get the first field

Object o = t.get(0);

if(o == null) {if(o == null) {

return false;

}

// Cast it to a Float object

float i = (Float) o;

return i > 5 ? true : false;

} catch (ExecException e) {

throw new IOException(e);

}

}

}

Eval UDF Examplepublic class OutputFormatter extends EvalFunc<String> {

@Override

public String exec(Tuple t) throws IOException {

if (t == null || t.size() == 0) {

return null;

} try {

Object o = t.get(0);

if(o == null) {if(o == null) {

return null;

}

DecimalFormat f = new DecimalFormat("0.00");

return f.format((Double) o);

} catch(ExecException e) {

throw new IOException(e);

}

}

}

JOINS• locations = LOAD 'locations.txt' USING PigStorage(',') AS

(location:int,name:chararray);

• DUMP locations;

(1,Hong Kong)

(2,Venice)

(5,Istanbul)

(6,Mumbai)

(7,Los Angeles)(7,Los Angeles)

• DUMP earthquakes;

(2005-05-02,4.2F,3)

(2005-02-15,5.3F,3)

(2002-08-17,7.1F,5)

(2004-09-01,4.0F,3)

(2006-11-21,5.6F,2)

…

Inner Joins

• Each row in the resulting relation is a match between the two input relations.

• temp = JOIN earthquakes BY $2, locations BY $0;

• DUMP temp;

(2004-03-01,3.6F,1,1,Hong Kong)

(2006-11-14,7.5F,1,1,Hong Kong)(2006-11-14,7.5F,1,1,Hong Kong)

(2006-10-12,4.1F,1,1,Hong Kong)

(2006-11-21,5.6F,2,2,Venice)

(2003-06-19,3.8F,2,2,Venice)

(2002-08-17,7.1F,5,5,Istanbul)

(2001-07-12,6.7F,5,5,Istanbul)

(2007-05-24,4.7F,6,6,Mumbai)

(2009-01-13,7.3F,7,7,Los Angeles)

Inner Joins cont.

• Filter out some of the fields.

• temp_clean = FOREACH temp GENERATE $0, $1, $4;

• DUMP temp_clean;(2004-03-01,3.6F,Hong Kong)

(2006-11-14,7.5F,Hong Kong)

(2006-10-12,4.1F,Hong Kong)(2006-10-12,4.1F,Hong Kong)

(2006-11-21,5.6F,Venice)

(2003-06-19,3.8F,Venice)

(2002-08-17,7.1F,Istanbul)

(2001-07-12,6.7F,Istanbul)

(2007-05-24,4.7F,Mumbai)

(2009-01-13,7.3F,Los Angeles)

Remarks and Optimization Tricks

• General JOIN operator should be used when both of the data sets are large.

• How would you join two large datasets if you were told to write a Map Reduce application using Hadoop?

• Map Side Joins: Also known as Fragment Replicate Joins. Distribute the small data set to all of the Mappers.Distribute the small data set to all of the Mappers.

• The Mappers will read the small data set into an in memory data structure such as a Hash Table.

• Each record from the larger data set will be compared against the Hash Table to check for membership relation.

• If true, emit the output - else, simply ignore it.

Inner JOINS with PIG

• temp_fast = JOIN earthquakes BY $2, locations

BY $0 USING "replicated";

• Make sure the first relation corresponds to the

large data set.large data set.

• More advanced techniques?

• Bloom Filters, Distributed Memory Caches,

Special Encoding for Compression, etc...

COGROUP Operator• The result of the JOIN operator is always a flat data set consisting of

tuples.

• If you need more complex data structures having nested tuples, you should use COGROUP.

• temp_cogroup1 = COGROUP earthquakes BY $2, locations BY $0;

(1,{(2004-03-01,3.6F,1),(2006-11-14,7.5F,1),(2006-10-12,4.1F,1)},{(1,Hong Kong)})

(2,{(2006-11-21,5.6F,2),(2003-06-19,3.8F,2)},{(2,Venice)})(2,{(2006-11-21,5.6F,2),(2003-06-19,3.8F,2)},{(2,Venice)})

(3,{(2005-05-02,4.2F,3),(2005-02-15,5.3F,3),(2004-09-01,4.0F,3),(2007-01-04,5.8F,3)},{})

(5,{(2002-08-17,7.1F,5),(2001-07-12,6.7F,5)},{(5,Istanbul)})

(6,{(2007-05-24,4.7F,6)},{(6,Mumbai)})

(7,{(2009-01-13,7.3F,7)},{(7,Los Angeles)})

(8,{(2001-07-13,6.1F,8)},{})

• Notice that non matching fields have empty bags.

• To suppress the rows having empty bags in the second relation, use the INNER keyword.

• temp_cogroup2 = COGROUP earthquakes BY $2, locations BY $0 INNER;

• DUMP temp_cogroup2;

(1,{(2004-03-01,3.6F,1),(2006-11-14,7.5F,1),(2006-10-12,4.1F,1)},{(1,Hong Kong)})12,4.1F,1)},{(1,Hong Kong)})

(2,{(2006-11-21,5.6F,2),(2003-06-19,3.8F,2)},{(2,Venice)})

(5,{(2002-08-17,7.1F,5),(2001-07-12,6.7F,5)},{(5,Istanbul)})

(6,{(2007-05-24,4.7F,6)},{(6,Mumbai)})

(7,{(2009-01-13,7.3F,7)},{(7,Los Angeles)})

FLATTEN Operator

• Flatten the second data set.

• temp_flattened1 = FOREACH temp_cogroup2

GENERATE FLATTEN(locations), earthquakes.$0;

• DUMP temp_flattened1;• DUMP temp_flattened1;

(1,Hong Kong,{(2004-03-01),(2006-11-14),(2006-10-12)})

(2,Venice,{(2006-11-21),(2003-06-19)})

(5,Istanbul,{(2002-08-17),(2001-07-12)})

(6,Mumbai,{(2007-05-24)})

(7,Los Angeles,{(2009-01-13)})

Filter the first field• temp_flattened_filtered = FOREACH temp_flattened1 GENERATE $1, $2;

- OR –

• DESCRIBE temp_flattened1;

• temp_flattened1: {locations::location: int,locations::name: chararray,{date: chararray}}

• temp_flattened_filtered = FOREACH temp_flattened1 GENERATE name, date;

• DUMP temp_flattened_filtered;• DUMP temp_flattened_filtered;

(Hong Kong,{(2004-03-01),(2006-11-14),(2006-10-12)})

(Venice,{(2006-11-21),(2003-06-19)})

(Istanbul,{(2002-08-17),(2001-07-12)})

(Mumbai,{(2007-05-24)})

(Los Angeles,{(2009-01-13)})

Counting

• Count the number of earthquakes in each location.

• eq_counts = FOREACH temp_flattened_filtered GENERATE$0, COUNT($1);

• DUMP eq_counts;

(Hong Kong,3L)

(Venice,2L)

(Istanbul,2L)

(Mumbai,1L)

(Los Angeles,1L)

Sorting

• Sort the results alphabetically.

• ordered_eq_counts = ORDER eq_counts BY $0 ASC;

• DUMP ordered_eq_counts;

(Hong Kong,3L)

(Istanbul,2L)

(Los Angeles,1L)

(Mumbai,1L)

(Venice,2L)

CROSS Operator• Used for performing a cartesian product operation among two

relations:

• cartesian_product = CROSS earthquakes, locations;

• DUMP cartesian_product;

(2005-05-02,4.2F,3,1,Hong Kong)

(2005-05-02,4.2F,3,2,Venice)

(2005-05-02,4.2F,3,5,Istanbul)(2005-05-02,4.2F,3,5,Istanbul)

(2005-05-02,4.2F,3,6,Mumbai)

(2005-05-02,4.2F,3,7,Los Angeles)

• If we have two data sets x and Y, the resulting relation will have |x| * |Y| tuples.

• Warning: Avoid duing this with large data sets.

LIMIT Operator

• limited = LIMIT cartesian_product 5;

• DUMP limited;

(2005-05-02,4.2F,3,1,Hong Kong)

(2005-05-02,4.2F,3,2,Venice)(2005-05-02,4.2F,3,2,Venice)

(2005-05-02,4.2F,3,5,Istanbul)

(2005-05-02,4.2F,3,6,Mumbai)

(2005-05-02,4.2F,3,7,Los Angeles)

• Use it for sampling a large data set quickly.

• Note that limit doesn't choose the rows randomly, it just returns the first k rows where k is the limiting factor.

Parallelism

• In a Map Reduce job, we can set the number of reducers to specify the degree of parallelism.

• Processing, and writing the data should be • Processing, and writing the data should be done in parallel in Hadoop mode.

• groups = GROUP earthquakes BY location PARALLEL 10;

• This will invoke 10 reducers running in parallel.

Parallelism cont.

• We can't manually determine the number of

Mappers in Pig (nor in Hadoop, anymore).

• Each HDFS block will be assigned as a separate

map task.

• Ideally, each block will be a new input split.• Ideally, each block will be a new input split.

• Default HDFS block size is 64MB.

• In Hadoop, you can modify the split size to

have more control over the map tasks.

More on Optimization

• Project early and often

• records = LOAD ‘input' AS (a, b, c, x, y);

• req_records = FOREACH A GENERATE x, y;

• Filter early and often

• records = LOAD ‘input' AS (a, b, c, d); • records = LOAD ‘input' AS (a, b, c, d);

• filtered = FILTER records BY a == 1;

• Check out http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html

for more optimization tricks.

Similar Frameworks

• Pig Latin, Yahoo!

• Hive, Facebook

• Sawzall, Google

• DryadLINQ, Microsoft• DryadLINQ, Microsoft

Useful References

• http://hadoop.apache.org/pig/docs/r0.5.0/pi

glatin_users.html - USER GUIDE

• http://hadoop.apache.org/pig/docs/r0.5.0/pi

glatin_reference.html - REFERENCE MANUALglatin_reference.html - REFERENCE MANUAL

• http://hadoop.apache.org/pig/docs/r0.5.0/tut

orial.html - PIG TUTORIAL

• http://wiki.apache.org/pig/ - PIG WIKI PAGE

Introduction to Pig Latin - University of...

Documents

Transcript of Introduction to Pig Latin - University of...