TriHUG November Pig Talk by Alan Gates
-
Upload
trihug -
Category
Technology
-
view
2.326 -
download
0
Transcript of TriHUG November Pig Talk by Alan Gates
![Page 1: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/1.jpg)
PigDataflow Scripting for Hadoop
© Hortonworks, Inc 2011Page 1
Alan F. Gates
@alanfgates
![Page 2: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/2.jpg)
Who Am I?
• Pig committer and PMC Member• HCatalog committer and mentor• Member of ASF and Incubator PMC• Co-founder of Hortonworks• Author of Programming Pig from O’Reilly
Photo credit: Steven Guarnaccia, The Three Little Pigs
![Page 3: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/3.jpg)
3
Who Are You?
![Page 4: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/4.jpg)
Example
For all of your registered users, you want to count how many came to your site this month. You want this count both by geography (zip code) and by demographic group (age and gender)
Load Logs
Semi-join
Count by zip
Store results
Load Users
Count by age, gender
Store results
![Page 5: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/5.jpg)
In Pig Latin-- Load web server logslogs = load 'server_logs' using HCatLoader();thismonth = filter logs by date >= '20110801' and date < '20110901';
-- Load usersusers = load 'users' using HCatLoader();
-- Remove any users that did not visit this monthgrpd = cogroup thismonth by userid, users by userid;fltrd = filter grpd by not IsEmpty(logs);visited = foreach fltrd generate flatten(users);
-- Count by zip codegrpbyzip = group visited by zip;cntzip = foreach grpbyzip generate group, COUNT(visited);store cntzip into 'by_zip' using HCatStorer('date=201108');
-- Count by demographicsgrpbydemo = group visited by (age, gender);cntdemo = foreach grpbydemo
generate flatten(group), COUNT(visited);store cntdemo into 'by_demo' using HCatStorer('date=201108');
![Page 6: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/6.jpg)
6
Pig’s Place in the Data World
Data Collection Data FactoryPig
PipelinesIterative ProcessingResearch
Data WarehouseHive
BI ToolsAnalysis
![Page 7: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/7.jpg)
7
Why not MapReduce?
• Pig Provides a number of standard data operators– Five different implementations of join (hash, fragment-replicate,
merge, sparse merged, skewed)– Order by provides total ordering across reducers in a balanced way
• Provides optimizations that are hard to do by hand– Multi-query: Pig will combine certain types of operations together in
a single pipeline to reduce the number of times data is scanned
• User Defined Functions provide a way to inject your code into the data transformation– can be written in Java or Python– can do column transformation (TOUPPER) and aggregation (SUM)– can be written to take advantage of the combiner
• Control flow can be done via Python or Java
![Page 8: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/8.jpg)
Embedding Example: Compute Pagerank
PageRank:
A system of linear equations (as many as there are pages on the web, yeah, a lot):
It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value.
Ref: http://en.wikipedia.org/wiki/PageRank
Slide courtesy of Julien Le Dem
![Page 9: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/9.jpg)
Or more visually
Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.
Slide courtesy of Julien Le Dem
![Page 10: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/10.jpg)
Slide courtesy of Julien Le Dem
![Page 11: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/11.jpg)
Let’s zoom in
pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Pass parameters as a dictionary
Iterate 10 times
Pass parameters as a dictionary
Just run P, that was declared above
The output becomes the new
inputSlide courtesy of Julien Le Dem
![Page 12: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/12.jpg)
14
Recently Added Features
• New in 0.9 (released July 2011):– Embedding in Python– Macros and Imports
• New in 0.10 (should release in Dec 2011)– Boolean data type– Hash based aggregation for aggregates with
low cardinality keys– UDFs to build and apply bloom filters– UDFs in JRuby (may slip to next release)
![Page 13: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/13.jpg)
Learn More
• Read the online documentation: http://pig.apache.org/
• Programming Pig from O’Reilly Press
• Join the mailing lists:– [email protected] for user
questions– [email protected] for developer
issues
• Follow me on Twitter, @alanfgates
![Page 14: TriHUG November Pig Talk by Alan Gates](https://reader033.fdocuments.in/reader033/viewer/2022052321/554bc570b4c90530298b54b2/html5/thumbnails/14.jpg)
16
Questions