Introduction to Pig Latin - University of...
Transcript of Introduction to Pig Latin - University of...
Introduction to Pig Latin
A. Soner Balkir
What’s wrong with Map Reduce?
Map
Reduce
Map
• Long chains of map reduce jobs following one another
are hard to deal with.
•For each map reduce job, you need to implement at
least 3 classes.
Map
Reduce
Map
Reduce
•Input output specifications, compilation errors,
creating a jar file for each job.
•How about JOINS?
Remember Word Count?
• You wrote over 150 lines of code but you care
about at most 15 lines.
• What if?
A = load 'textdoc' using TextLoader() as (sentence: chararray)
B = foreach A generate flatten(TOKENIZE(sentence)) as word
C = group B by word
D = foreach C generate group, COUNT(B)
store D into 'wordcount'
Pig Automatically…
• Parses the script and tries to optimize it.
– No guarantees! Be careful!
• Compiles the java code and creates the jar file.
• Submits the job, and monitors it.• Submits the job, and monitors it.
• No major upgrades to the existing code. Just
update your Pig binaries!!!
– Remember the changes from Hadoop 0.1x to 0.2x
What is more?
• Pig can run a script on a small subset of the
data.
• Create a random sample of the input.
• Local mode vs. map reduce mode.• Local mode vs. map reduce mode.
• Extensible: Write your own UDF’s!
What Pig is not?
• It can’t be used as a DBMS replacement.
– No single record fetching.
– No SELECT keyword.
• Pure Hadoop implementations may be faster. • Pure Hadoop implementations may be faster. Pig is evolving.
• Pig and Hive are almost sisters with minor differences.
• Carefully decide when you need Hadoop and when you need Pig? Life is all about trade offs.
Installing and Running
• Download Pig at: http://hadoop.apache.org/pig/
• Set environment variables:
– export PIGDIR=/home/soner/pig-0.5.0
– export PATH=$PATH:$PIGDIR/bin– export PATH=$PATH:$PIGDIR/bin
– export PIG_HADOOP_VERSION=20
– export PIG_CLASSPATH=$HADOOPDIR/conf
• Start the grunt shell:
pig -x local
pig -x mapreduce
Three ways to run a Pig script
• Grunt shell
• Embedded
– Just like using JDBC to execute SQL commands.
• Scripts• Scripts
– Pig –x local myscript.pig
Before Starting…
• If you are using Eclipse, take a look at PigPen:
– http://wiki.apache.org/pig/PigPen
• If you are more comfortable in vi or vim, go to:
– http://www.vim.org/scripts/script.php?script_id=– http://www.vim.org/scripts/script.php?script_id=
2186
• Syntax highlighters and IDE’s always make
things easy.
• Examples will be online.
Hello World Example
• sample.txt contains earthquake data:
(date, magnitude, location id)
example: head –n 4 sample.txt
2005-05-02,4.2,3
2005-02-15,5.3,3
2002-08-17,7.1,5
2004-09-01,4.0,3
• records = LOAD 'sample.txt' USING PigStorage(',') AS
(date:chararray, magnitude:float, location:int);
– Specify a schema
– Identify the field names and types
• DUMP records;
– Display the tuples
(2005-05-02,4.2F,3)
(2005-02-15,5.3F,3)
(2002-08-17,7.1F,5)
(2004-09-01,4.0F,3)
(2006-11-21,5.6F,2)
Filters
• Filter some of the records using BOOLEAN functions.
• filtered_records = FILTER records BY magnitude > 5;
• DUMP filtered_records;
(2005-02-15,5.3F,3)
(2002-08-17,7.1F,5)
(2006-11-21,5.6F,2)
Grouping Data
• This is where reducers kick in.
• grouped_records = GROUP filtered_records BY location;
• Observe that location id is our grouping key.
• DUMP grouped records;• DUMP grouped records;
(2,{(2006-11-21,5.6F,2)})
(3,{(2005-02-15,5.3F,3),(2007-01-04,5.8F,3)})
(5,{(2002-08-17,7.1F,5),(2001-07-12,6.7F,5)})
Each group consists of a KEY and a bag containing zero or
more tuples. Notice the curly braces.
Processing the Groups
• averages = FOREACH grouped_records GENERATE
group, AVG(filtered_records.magnitude);
• DUMP averages;
(1,7.5)
(2,5.599999904632568)
(3,5.550000190734863)
(5,6.8999998569488525)
Storing the results on disk
• STORE averages INTO 'script1-output' USING PigStorage('\t');
• To run the script in local mode:
– pig -x local script1.pig
• To run in map reduce mode:
– hadoop dfs –copyFromLocal sample.txt /input/earthquake/sample.txt
– pig -x mapreduce script1.pig– pig -x mapreduce script1.pig
– hadoop dfs -cat /script1-output/part-00000
• NOTE: script-output is interpreted as an output file in the local mode and an output directory in mapreduce job.
• This is because in the mapreduce job, multiple reducers may be generating separate part-xxxxx files and writing them in a common output directory in HDFS.
Displaying Schema information
• DESCRIBE records;
– records: {date: chararray,magnitude:
float,location: int}
• DESCRIBE averages;
– averages: {group: int,double}
ILLUSTRATE averages;------------------------------------------------------------------------------
| records | date: bytearray | magnitude: bytearray | location: bytearray |
------------------------------------------------------------------------------
| | 2005-02-15 | 5.3 | 3 |
| | 2004-09-01 | 4.0 | 3 |
| | 2007-01-04 | 5.8 | 3 |
------------------------------------------------------------------------------
| records | date: chararray | magnitude: float | location: int |
--------------------------------------------------------------------
| | 2005-02-15 | 5.3 | 3 |
| | 2004-09-01 | 4.0 | 3 |
| | 2007-01-04 | 5.8 | 3 |
--------------------------------------------------------------------
| filtered_records | date: chararray | magnitude: float | location: int || filtered_records | date: chararray | magnitude: float | location: int |
-----------------------------------------------------------------------------
| | 2005-02-15 | 5.3 | 3 |
| | 2007-01-04 | 5.8 | 3 |
-----------------------------------------------------------------------------
| grouped_records | group: int | filtered_records: bag({date: chararray,magnitude: float,location: int}) |
--------------------------------------------------------------------------------------------------------------
| | 3 | {(2005-02-15, 5.3, 3), (2007-01-04, 5.8, 3)} ||
-------------------------------------------------
| averages | group: int | double |
-------------------------------------------------
| | 3 | 5.550000190734863 |
Dealing with corrupt data
• DUMP records:(2004-09-01,4.0F,3)
(2006-11-21,5.6F,)
(2001-07-12,6.7F,5)(2001-07-12,6.7F,5)
(2004-03-01,,)
(2007-01-04,5.8F,3)
• good_records = FILTER records BY location IS
NOT NULL;
Filtering corrupt data
• DUMP good_records;
(2004-09-01,4.0F,3)
(2001-07-12,6.7F,5)(2001-07-12,6.7F,5)
(2007-01-04,5.8F,3)
Splitting data
• Split the data into partitions as good and bad
records.
• SPLIT records INTO bad_records IF location IS
NULL, good_records IF location IS NOT NULL;NULL, good_records IF location IS NOT NULL;
• DUMP bad_records;
(2006-11-21,5.6F,)
(2004-03-01,,)
Counting
• Count the number of bad records.• temp_group = GROUP bad_records ALL;
• bad_count = FOREACH temp_group GENERATE
COUNT(bad_records);
• DUMP bad_count;
(2L)
• NOTE: Unlike SQL, Pig Latin doesn't have a SELECT COUNT(*)
operator. We have to manually create the groups and count
the bags inside each group.
Malformed Records
• Eliminate records with missing fields.
• DUMP records;(2004-09-01,4.0,3)
(2006-11-21,5.6)
(2001-07-12)(2001-07-12)
(2007-01-04,5.8,3)
• filtered_records = FILTER records BY SIZE(*) == 3;
• DUMP filtered_records;(2004-09-01,4.0,3)
(2007-01-04,5.8,3)
User Defined Functions
• REGISTER udf.jar;
• DEFINE isHigh src.pig.examples.MagnitudeFilter();
• DEFINE round src.pig.examples.OutputFormatter();
• records = LOAD ' sample.txt' USING PigStorage(',') AS (date:chararray, magnitude:float, location:int);
• filtered_records = FILTER records BY isHigh(magnitude);• filtered_records = FILTER records BY isHigh(magnitude);
• grouped_records = GROUP filtered_records BY location;
• averages = FOREACH grouped_records GENERATE group, AVG(filtered_records.magnitude);
• rounded_averages = FOREACH averages GENERATE round($1);
• STORE rounded_averages INTO ' udf-output' USING PigStorage('\t');
Filter UDF Examplepublic class MagnitudeFilter extends FilterFunc {
@Override
public Boolean exec(Tuple t) throws IOException {
if( t == null || t.size() == 0) {
return false;
}
try {
// Get the first field
Object o = t.get(0);
if(o == null) {if(o == null) {
return false;
}
// Cast it to a Float object
float i = (Float) o;
return i > 5 ? true : false;
} catch (ExecException e) {
throw new IOException(e);
}
}
}
Eval UDF Examplepublic class OutputFormatter extends EvalFunc<String> {
@Override
public String exec(Tuple t) throws IOException {
if (t == null || t.size() == 0) {
return null;
} try {
Object o = t.get(0);
if(o == null) {if(o == null) {
return null;
}
DecimalFormat f = new DecimalFormat("0.00");
return f.format((Double) o);
} catch(ExecException e) {
throw new IOException(e);
}
}
}
JOINS• locations = LOAD 'locations.txt' USING PigStorage(',') AS
(location:int,name:chararray);
• DUMP locations;
(1,Hong Kong)
(2,Venice)
(5,Istanbul)
(6,Mumbai)
(7,Los Angeles)(7,Los Angeles)
• DUMP earthquakes;
(2005-05-02,4.2F,3)
(2005-02-15,5.3F,3)
(2002-08-17,7.1F,5)
(2004-09-01,4.0F,3)
(2006-11-21,5.6F,2)
…
Inner Joins
• Each row in the resulting relation is a match between the two input relations.
• temp = JOIN earthquakes BY $2, locations BY $0;
• DUMP temp;
(2004-03-01,3.6F,1,1,Hong Kong)
(2006-11-14,7.5F,1,1,Hong Kong)(2006-11-14,7.5F,1,1,Hong Kong)
(2006-10-12,4.1F,1,1,Hong Kong)
(2006-11-21,5.6F,2,2,Venice)
(2003-06-19,3.8F,2,2,Venice)
(2002-08-17,7.1F,5,5,Istanbul)
(2001-07-12,6.7F,5,5,Istanbul)
(2007-05-24,4.7F,6,6,Mumbai)
(2009-01-13,7.3F,7,7,Los Angeles)
Inner Joins cont.
• Filter out some of the fields.
• temp_clean = FOREACH temp GENERATE $0, $1, $4;
• DUMP temp_clean;(2004-03-01,3.6F,Hong Kong)
(2006-11-14,7.5F,Hong Kong)
(2006-10-12,4.1F,Hong Kong)(2006-10-12,4.1F,Hong Kong)
(2006-11-21,5.6F,Venice)
(2003-06-19,3.8F,Venice)
(2002-08-17,7.1F,Istanbul)
(2001-07-12,6.7F,Istanbul)
(2007-05-24,4.7F,Mumbai)
(2009-01-13,7.3F,Los Angeles)
Remarks and Optimization Tricks
• General JOIN operator should be used when both of the data sets are large.
• How would you join two large datasets if you were told to write a Map Reduce application using Hadoop?
• Map Side Joins: Also known as Fragment Replicate Joins. Distribute the small data set to all of the Mappers.Distribute the small data set to all of the Mappers.
• The Mappers will read the small data set into an in memory data structure such as a Hash Table.
• Each record from the larger data set will be compared against the Hash Table to check for membership relation.
• If true, emit the output - else, simply ignore it.
Inner JOINS with PIG
• temp_fast = JOIN earthquakes BY $2, locations
BY $0 USING "replicated";
• Make sure the first relation corresponds to the
large data set.large data set.
• More advanced techniques?
• Bloom Filters, Distributed Memory Caches,
Special Encoding for Compression, etc...
COGROUP Operator• The result of the JOIN operator is always a flat data set consisting of
tuples.
• If you need more complex data structures having nested tuples, you should use COGROUP.
• temp_cogroup1 = COGROUP earthquakes BY $2, locations BY $0;
(1,{(2004-03-01,3.6F,1),(2006-11-14,7.5F,1),(2006-10-12,4.1F,1)},{(1,Hong Kong)})
(2,{(2006-11-21,5.6F,2),(2003-06-19,3.8F,2)},{(2,Venice)})(2,{(2006-11-21,5.6F,2),(2003-06-19,3.8F,2)},{(2,Venice)})
(3,{(2005-05-02,4.2F,3),(2005-02-15,5.3F,3),(2004-09-01,4.0F,3),(2007-01-04,5.8F,3)},{})
(5,{(2002-08-17,7.1F,5),(2001-07-12,6.7F,5)},{(5,Istanbul)})
(6,{(2007-05-24,4.7F,6)},{(6,Mumbai)})
(7,{(2009-01-13,7.3F,7)},{(7,Los Angeles)})
(8,{(2001-07-13,6.1F,8)},{})
• Notice that non matching fields have empty bags.
• To suppress the rows having empty bags in the second relation, use the INNER keyword.
• temp_cogroup2 = COGROUP earthquakes BY $2, locations BY $0 INNER;
• DUMP temp_cogroup2;
(1,{(2004-03-01,3.6F,1),(2006-11-14,7.5F,1),(2006-10-12,4.1F,1)},{(1,Hong Kong)})12,4.1F,1)},{(1,Hong Kong)})
(2,{(2006-11-21,5.6F,2),(2003-06-19,3.8F,2)},{(2,Venice)})
(5,{(2002-08-17,7.1F,5),(2001-07-12,6.7F,5)},{(5,Istanbul)})
(6,{(2007-05-24,4.7F,6)},{(6,Mumbai)})
(7,{(2009-01-13,7.3F,7)},{(7,Los Angeles)})
FLATTEN Operator
• Flatten the second data set.
• temp_flattened1 = FOREACH temp_cogroup2
GENERATE FLATTEN(locations), earthquakes.$0;
• DUMP temp_flattened1;• DUMP temp_flattened1;
(1,Hong Kong,{(2004-03-01),(2006-11-14),(2006-10-12)})
(2,Venice,{(2006-11-21),(2003-06-19)})
(5,Istanbul,{(2002-08-17),(2001-07-12)})
(6,Mumbai,{(2007-05-24)})
(7,Los Angeles,{(2009-01-13)})
Filter the first field• temp_flattened_filtered = FOREACH temp_flattened1 GENERATE $1, $2;
- OR –
• DESCRIBE temp_flattened1;
• temp_flattened1: {locations::location: int,locations::name: chararray,{date: chararray}}
• temp_flattened_filtered = FOREACH temp_flattened1 GENERATE name, date;
• DUMP temp_flattened_filtered;• DUMP temp_flattened_filtered;
(Hong Kong,{(2004-03-01),(2006-11-14),(2006-10-12)})
(Venice,{(2006-11-21),(2003-06-19)})
(Istanbul,{(2002-08-17),(2001-07-12)})
(Mumbai,{(2007-05-24)})
(Los Angeles,{(2009-01-13)})
Counting
• Count the number of earthquakes in each location.
• eq_counts = FOREACH temp_flattened_filtered GENERATE$0, COUNT($1);
• DUMP eq_counts;
(Hong Kong,3L)
(Venice,2L)
(Istanbul,2L)
(Mumbai,1L)
(Los Angeles,1L)
Sorting
• Sort the results alphabetically.
• ordered_eq_counts = ORDER eq_counts BY $0 ASC;
• DUMP ordered_eq_counts;
(Hong Kong,3L)
(Istanbul,2L)
(Los Angeles,1L)
(Mumbai,1L)
(Venice,2L)
CROSS Operator• Used for performing a cartesian product operation among two
relations:
• cartesian_product = CROSS earthquakes, locations;
• DUMP cartesian_product;
(2005-05-02,4.2F,3,1,Hong Kong)
(2005-05-02,4.2F,3,2,Venice)
(2005-05-02,4.2F,3,5,Istanbul)(2005-05-02,4.2F,3,5,Istanbul)
(2005-05-02,4.2F,3,6,Mumbai)
(2005-05-02,4.2F,3,7,Los Angeles)
• If we have two data sets x and Y, the resulting relation will have |x| * |Y| tuples.
• Warning: Avoid duing this with large data sets.
LIMIT Operator
• limited = LIMIT cartesian_product 5;
• DUMP limited;
(2005-05-02,4.2F,3,1,Hong Kong)
(2005-05-02,4.2F,3,2,Venice)(2005-05-02,4.2F,3,2,Venice)
(2005-05-02,4.2F,3,5,Istanbul)
(2005-05-02,4.2F,3,6,Mumbai)
(2005-05-02,4.2F,3,7,Los Angeles)
• Use it for sampling a large data set quickly.
• Note that limit doesn't choose the rows randomly, it just returns the first k rows where k is the limiting factor.
Parallelism
• In a Map Reduce job, we can set the number of reducers to specify the degree of parallelism.
• Processing, and writing the data should be • Processing, and writing the data should be done in parallel in Hadoop mode.
• groups = GROUP earthquakes BY location PARALLEL 10;
• This will invoke 10 reducers running in parallel.
Parallelism cont.
• We can't manually determine the number of
Mappers in Pig (nor in Hadoop, anymore).
• Each HDFS block will be assigned as a separate
map task.
• Ideally, each block will be a new input split.• Ideally, each block will be a new input split.
• Default HDFS block size is 64MB.
• In Hadoop, you can modify the split size to
have more control over the map tasks.
More on Optimization
• Project early and often
• records = LOAD ‘input' AS (a, b, c, x, y);
• req_records = FOREACH A GENERATE x, y;
• Filter early and often
• records = LOAD ‘input' AS (a, b, c, d); • records = LOAD ‘input' AS (a, b, c, d);
• filtered = FILTER records BY a == 1;
• Check out http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html
for more optimization tricks.
Similar Frameworks
• Pig Latin, Yahoo!
• Hive, Facebook
• Sawzall, Google
• DryadLINQ, Microsoft• DryadLINQ, Microsoft
Useful References
• http://hadoop.apache.org/pig/docs/r0.5.0/pi
glatin_users.html - USER GUIDE
• http://hadoop.apache.org/pig/docs/r0.5.0/pi
glatin_reference.html - REFERENCE MANUALglatin_reference.html - REFERENCE MANUAL
• http://hadoop.apache.org/pig/docs/r0.5.0/tut
orial.html - PIG TUTORIAL
• http://wiki.apache.org/pig/ - PIG WIKI PAGE