Pig And HCatalog In the HadoopEcosystem
Page 1
Alan F. Gates@alanfgates
Who Am I?
Page 2
•Pig committer and PMC Member
•HCatalog committer•Member of ASF and Incubator PMC
•Co-founder of Hortonworks•Author of Programming Pigfrom O’Reilly
© Hortonworks 2013
Who Are You?
Page 3© Hortonworks 2013
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
The Hadoop Ecosystem
HORTONWORKS DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA, DR, Snapshots, Security, …
Distributed Storage & ProcessingHDFS
MAP REDUCE
DATASERVICES
Store, Process and Access Data
HCATALOG
HIVEPIGHBASE
SQOOP
FLUME
OPERATIONAL SERVICES
Manage & Operate at
ScaleOOZIE
AMBARI
Page 4
Pig’s Place in the Data World
5
Data Collection Data FactoryPig
PipelinesIterative ProcessingResearch
Data WarehouseHive
BI ToolsAnalysis
© Hortonworks 2013
Example
For all of your registered users, you want to count how many came to your site this month. You want this count both by geography (zip code) and by demographic group (age and gender)
Load Logs
Semi-join
Count by zip
Store results
Load Users
Count by age, gender
Store results
6© Hortonworks 2013
In Pig Latin-- Load web server logslogs = load 'server_logs' using HCatLoader();thismonth = filter logs by date >= '20110801'
and date < '20110901';
-- Load usersusers = load 'users' using HCatLoader();
-- Remove any users that did not visit this monthgrpd = cogroup thismonth by userid, users by userid;fltrd = filter grpd by not IsEmpty(logs);visited = foreach fltrd generate flatten(users);
-- Count by zip codegrpbyzip = group visited by zip;cntzip = foreach grpbyzip generate group, COUNT(visited);store cntzip into 'by_zip' using HCatStorer('date=201108');
-- Count by demographicsgrpbydemo = group visited by (age, gender);cntdemo = foreach grpbydemo
generate flatten(group), COUNT(visited);store cntdemo into 'by_demo' using HCatStorer('date=201108');
7© Hortonworks 2013
Why not MapReduce?
• Pig Provides a number of standard data operators– Five different implementations of join (hash, fragment-replicate,
merge, sparse merged, skewed)– Order by provides total ordering across reducers in a balanced
way• Provides optimizations that are hard to do by hand
– Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned
• User Defined Functions provide a way to inject your code into the data transformation– can be written in Java or Python– can do column transformation (TOUPPER) and aggregation
(SUM)– can be written to take advantage of the combiner
• Control flow can be done via Python or Java
8© Hortonworks 2013
Embedding Example: Compute Pagerank
PageRank:A system of linear equations (as many as there
are pages on the web, yeah, a lot):
It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value.
Ref: http://en.wikipedia.org/wiki/PageRank
Slide courtesy of Julien Le Dem9© Hortonworks 2013
Or more visually
Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.
Slide courtesy of Julien Le Dem10 © Hortonworks 2013
Slide courtesy of Julien Le Dem
11 © Hortonworks 2013
Let’s zoom in
pig script: PR(A) = (1 d) + d (PR(T1)/C(T1) + ... + pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Pass parameters as a dictionary
Iterate 10 times
Pass parameters as a dictionary
Just run P, that was declared above
p
The output becomes the new
inputSlide courtesy of Julien Le Dem
12 © Hortonworks 2013
Recently Added Features
• New in 0.9 (released July 2011):– Macros and Imports
• New in 0.10 (released April 2012)– Boolean data type– Hash based aggregation for aggregates with low
cardinality keys– UDFs to build and apply bloom filters– UDFs in Jruby
• New in 0.11 (released February 2013)– Datetime type– RANK, CUBE, ROLLUP– HCatalog DDL integration– UDFs in Groovy
13© Hortonworks 2013
Learn More
• Read the online documentation: http://pig.apache.org/
• Programming Pig from O’Reilly Press • Join the mailing lists:
– [email protected] for user questions– [email protected] for developer issues
14© Hortonworks 2013
HCatalogTable Management For Hadoop
Page 15
Many Data Tools
Page 16
MapReduce• Early adopters• Non-relational algorithms• Performance sensitive applications
Pig• ETL• Data modeling• Iterative algorithms
Hive• Analysis• Connectors to BI tools
Strength: Pick the right tool for your application
Weakness: Hard for users to share their data© Hortonworks 2013
Users: Data Sharing is Hard
Page 17
Photo Credit: totalAldo via Flickr
This is programmer Bob, he uses Pig to crunch data.
This is analyst Joe, he uses Hive to build reports and answer ad-hoc queries.
Hmm, is it done yet? Where is it? What formatdid you use to store it today? Is it compressed?And can you help me load it into Hive, I cannever remember all the parameters I have topass that alter table command.
Hmm, is it done yet? Where is it? What formatdid you use to store it today? Is it compressed?And can you help me load it into Hive, I cannever remember all the parameters I have topass that alter table command.
OkOk
Bob, I need today’s dataBob, I need today’s data
Dude, we need HCatalog
Dude, we need HCatalog
© Hortonworks 2013
Tool Comparison
Page 18
Feature MapReduce Pig HiveRecord format Key value pairs Tuple RecordData model User defined int, float, string,
bytes, maps, tuples, bags
int, float, string, maps, structs, lists
Schema Encoded in app Declared in script or read by loader
Read from metadata
Data location Encoded in app Declared in script Read from metadata
Data format Encoded in app Declared in script Read from metadata
• Pig and MR users need to know a lot to write their apps• When data schema, location, or format change Pig and MR apps must be
rewritten, retested, and redeployed• Hive users have to load data from Pig/MR users to have access to it
© Hortonworks 2013
How Hadoop Apps Access Data
Page 19
MetastoreHDFS
Hive
Metastore Client InputFormat/ OuputFormat
SerDe
InputFormat/ OuputFormat
MapReduce Pig
Load/Store
© Hortonworks 2013
Opening up Metadata to MR & Pig
Page 20
MetastoreHDFS
Hive
Metastore Client InputFormat/ OuputFormat
SerDe
HCatInputFormat/ HCatOuputFormat
MapReduce Pig
/ HCatLoader/ HCatStorer
© Hortonworks 2013
Tools With HCatalog
Page 21
Feature MapReduce + HCatalog
Pig + HCatalog Hive
Record format Record Tuple RecordData model int, float, string,
maps, structs, listsint, float, string, bytes, maps, tuples, bags
int, float, string, maps, structs, lists
Schema Read from metadata
Read from metadata
Read from metadata
Data location Read from metadata
Read from metadata
Read from metadata
Data format Read from metadata
Read from metadata
Read from metadata
• Pig/MR users can read schema from metadata• Pig/MR users are insulated from schema, location, and format changes• All users have access to other users’ data as soon as it is committed
© Hortonworks 2013
Pig Example
Page 22
Assume you want to count how many time each of your users went to each of your URLsraw = load '/data/rawevents/20120530' as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530';
© Hortonworks 2013
Pig Example
Page 23
Assume you want to count how many time each of your users went to each of your URLsraw = load '/data/rawevents/20120530' as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530';
Using HCatalog:raw = load 'rawevents' using HCatLoader();botless = filter raw by myudfs.NotABot(user) and ds == '20120530';grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into 'counted' using HCatStorer();
No need to know No need to know file location declare schema
No need to declare schema
Partition filter
© Hortonworks 2013
WebHCat REST API
Page 24
Hadoop/HCatalog
Get a list of all tables in the default database:
GEThttp://…/v1/ddl/database/default/table
{"tables": ["counted","processed",],"database": "default"}
• FKA Templeton• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop
Create new table “rawevents”
PUT{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],"partitionedBy": [{ "name": "ds", "type": "string" }]}
http://…/v1/ddl/database/default/table/rawevents
{"table": "rawevents","database": "default”}
Describe table “rawevents”
GEThttp://…/v1/ddl/database/default/table/rawevents
{"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],"database": "default","table": "rawevents"
}
© Hortonworks 2013
Relationship to Hive
Page 25
• Opens Hive’s metadata to non-Hive users; does not keep its own copy of the metadata
• If you are a Hive user, you can transition to HCatalog with no metadata modifications
–Uses Hive’s SerDes for data translation (starting in 0.4)–Uses Hive SQL for data definition (DDL)
• Recently graduated from Apache Incubator and became a part of Hive project.
–Hive 0.11 will include HCatalog
© Hortonworks 2013
Top Related