Download - Pig And HCatalog In the Hadoop Ecosystemfiles.meetup.com/3168962/Alan_Gates_Hortonworks... · Pig And HCatalog In the Hadoop Ecosystem Page 1 Alan F. Gates @alanfgates. Who Am I?

Pig And HCatalog In the HadoopEcosystem

Alan F. Gates@alanfgates

Who Am I?

•Pig committer and PMC Member

•HCatalog committer•Member of ASF and Incubator PMC

•Co-founder of Hortonworks•Author of Programming Pigfrom O’Reilly

© Hortonworks 2013

Who Are You?

© Hortonworks 2013

© Hortonworks Inc. 2013

ApplianceCloudOS / VM

The Hadoop Ecosystem

HORTONWORKS DATA PLATFORM (HDP)

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness: HA, DR, Snapshots, Security, …

Distributed Storage & ProcessingHDFS

MAP REDUCE

DATASERVICES

Store, Process and Access Data

HCATALOG

HIVEPIGHBASE

SQOOP

FLUME

OPERATIONAL SERVICES

Manage & Operate at

ScaleOOZIE

AMBARI

Pig’s Place in the Data World

5

Data Collection Data FactoryPig

PipelinesIterative ProcessingResearch

Data WarehouseHive

BI ToolsAnalysis

© Hortonworks 2013

Example

For all of your registered users, you want to count how many came to your site this month. You want this count both by geography (zip code) and by demographic group (age and gender)

Load Logs

Semi-join

Count by zip

Store results

Load Users

Count by age, gender

Store results

6© Hortonworks 2013

In Pig Latin-- Load web server logslogs = load 'server_logs' using HCatLoader();thismonth = filter logs by date >= '20110801'

and date < '20110901';

-- Load usersusers = load 'users' using HCatLoader();

-- Remove any users that did not visit this monthgrpd = cogroup thismonth by userid, users by userid;fltrd = filter grpd by not IsEmpty(logs);visited = foreach fltrd generate flatten(users);

-- Count by zip codegrpbyzip = group visited by zip;cntzip = foreach grpbyzip generate group, COUNT(visited);store cntzip into 'by_zip' using HCatStorer('date=201108');

-- Count by demographicsgrpbydemo = group visited by (age, gender);cntdemo = foreach grpbydemo

generate flatten(group), COUNT(visited);store cntdemo into 'by_demo' using HCatStorer('date=201108');


Why not MapReduce?

• Pig Provides a number of standard data operators– Five different implementations of join (hash, fragment-replicate,

merge, sparse merged, skewed)– Order by provides total ordering across reducers in a balanced

way• Provides optimizations that are hard to do by hand

– Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned

• User Defined Functions provide a way to inject your code into the data transformation– can be written in Java or Python– can do column transformation (TOUPPER) and aggregation

(SUM)– can be written to take advantage of the combiner

• Control flow can be done via Python or Java


Embedding Example: Compute Pagerank

PageRank:A system of linear equations (as many as there

are pages on the web, yeah, a lot):

It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value.

Ref: http://en.wikipedia.org/wiki/PageRank

Slide courtesy of Julien Le Dem9© Hortonworks 2013

Or more visually

Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.

Slide courtesy of Julien Le Dem10 © Hortonworks 2013

Slide courtesy of Julien Le Dem

11 © Hortonworks 2013

Let’s zoom in

pig script: PR(A) = (1 d) + d (PR(T1)/C(T1) + ... + pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Pass parameters as a dictionary

Iterate 10 times

Pass parameters as a dictionary

Just run P, that was declared above

p

The output becomes the new

inputSlide courtesy of Julien Le Dem

12 © Hortonworks 2013

Recently Added Features

• New in 0.9 (released July 2011):– Macros and Imports

• New in 0.10 (released April 2012)– Boolean data type– Hash based aggregation for aggregates with low

cardinality keys– UDFs to build and apply bloom filters– UDFs in Jruby

• New in 0.11 (released February 2013)– Datetime type– RANK, CUBE, ROLLUP– HCatalog DDL integration– UDFs in Groovy


Learn More

• Read the online documentation: http://pig.apache.org/

• Programming Pig from O’Reilly Press • Join the mailing lists:

– [email protected] for user questions– [email protected] for developer issues


HCatalogTable Management For Hadoop

Many Data Tools

MapReduce• Early adopters• Non-relational algorithms• Performance sensitive applications

Pig• ETL• Data modeling• Iterative algorithms

Hive• Analysis• Connectors to BI tools

Strength: Pick the right tool for your application

Weakness: Hard for users to share their data© Hortonworks 2013

Users: Data Sharing is Hard

Photo Credit: totalAldo via Flickr

This is programmer Bob, he uses Pig to crunch data.

This is analyst Joe, he uses Hive to build reports and answer ad-hoc queries.

Hmm, is it done yet? Where is it? What formatdid you use to store it today? Is it compressed?And can you help me load it into Hive, I cannever remember all the parameters I have topass that alter table command.

Hmm, is it done yet? Where is it? What formatdid you use to store it today? Is it compressed?And can you help me load it into Hive, I cannever remember all the parameters I have topass that alter table command.

OkOk

Bob, I need today’s dataBob, I need today’s data

Dude, we need HCatalog

Dude, we need HCatalog

© Hortonworks 2013

Tool Comparison

Feature MapReduce Pig HiveRecord format Key value pairs Tuple RecordData model User defined int, float, string,

bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Encoded in app Declared in script or read by loader

Read from metadata

Data location Encoded in app Declared in script Read from metadata

Data format Encoded in app Declared in script Read from metadata

• Pig and MR users need to know a lot to write their apps• When data schema, location, or format change Pig and MR apps must be

rewritten, retested, and redeployed• Hive users have to load data from Pig/MR users to have access to it

© Hortonworks 2013

How Hadoop Apps Access Data

MetastoreHDFS

Hive

Metastore Client InputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

© Hortonworks 2013

Opening up Metadata to MR & Pig

MetastoreHDFS

Hive

Metastore Client InputFormat/ OuputFormat

SerDe

HCatInputFormat/ HCatOuputFormat

MapReduce Pig

/ HCatLoader/ HCatStorer

© Hortonworks 2013

Tools With HCatalog

Feature MapReduce + HCatalog

Pig + HCatalog Hive

Record format Record Tuple RecordData model int, float, string,

maps, structs, listsint, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Read from metadata

Read from metadata

Read from metadata

Data location Read from metadata

Read from metadata

Read from metadata

Data format Read from metadata

Read from metadata

Read from metadata

• Pig/MR users can read schema from metadata• Pig/MR users are insulated from schema, location, and format changes• All users have access to other users’ data as soon as it is committed

© Hortonworks 2013

Pig Example

Assume you want to count how many time each of your users went to each of your URLsraw = load '/data/rawevents/20120530' as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530';

© Hortonworks 2013

Pig Example

Assume you want to count how many time each of your users went to each of your URLsraw = load '/data/rawevents/20120530' as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530';

Using HCatalog:raw = load 'rawevents' using HCatLoader();botless = filter raw by myudfs.NotABot(user) and ds == '20120530';grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into 'counted' using HCatStorer();

No need to know No need to know file location declare schema

No need to declare schema

Partition filter

© Hortonworks 2013

WebHCat REST API

Hadoop/HCatalog

Get a list of all tables in the default database:

GEThttp://…/v1/ddl/database/default/table

{"tables": ["counted","processed",],"database": "default"}

• FKA Templeton• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop

Create new table “rawevents”

PUT{"columns": [{ "name": "url", "type": "string" },

{ "name": "user", "type": "string"}],"partitionedBy": [{ "name": "ds", "type": "string" }]}

http://…/v1/ddl/database/default/table/rawevents

{"table": "rawevents","database": "default”}

Describe table “rawevents”

GEThttp://…/v1/ddl/database/default/table/rawevents

{"columns": [{"name": "url","type": "string"},

{"name": "user","type": "string"}],"database": "default","table": "rawevents"

}

© Hortonworks 2013

Relationship to Hive

• Opens Hive’s metadata to non-Hive users; does not keep its own copy of the metadata

• If you are a Hive user, you can transition to HCatalog with no metadata modifications

–Uses Hive’s SerDes for data translation (starting in 0.4)–Uses Hive SQL for data definition (DDL)

• Recently graduated from Apache Incubator and became a part of Hive project.

–Hive 0.11 will include HCatalog

© Hortonworks 2013