Making pig fly optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Daniel Dai (@daijy)

Thejas Nair (@thejasn)

Making Pig FlyOptimizing Data Processing on Hadoop


What is Apache Pig?

Architecting the Future of Big Data

Pig Latin, a high level

data processing

language.

An engine that

executes Pig

Latin locally or on

a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/


Pig-latin example


• Query : Get the list of web pages visited by users whose

age is between 20 and 29 years.

USERS = load ‘users’ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load ‘pages’ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;


Why pig ?


•Faster development

– Fewer lines of code

– Don’t re-invent the wheel

•Flexible

– Metadata is optional

– Extensible

– Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/


Pig optimizations


• Ideally user should not have to

bother

• Reality– Pig is still young and immature

– Pig does not have the whole picture

–Cluster configuration

–Data histogram

– Pig philosophy: Pig is docile


Pig optimizations


• What pig does for you– Do safe transformations of query to optimize

– Optimized operations (join, sort)

• What you do– Organize input in optimal way

– Optimize pig-latin query

– Tell pig what join/group algorithm to use


Rule based optimizer


• Column pruner

• Push up filter

• Push down flatten

• Push up limit

• Partition pruning

• Global optimizer


Column Pruner


• Pig will do column pruning automatically

• Cases Pig will not do column pruning

automatically

– No schema specified in load statement

A = load ‘input’ as (a0, a1, a2);

B = foreach A generate a0+a1;

C = order B by $0;

Store C into ‘output’;

Pig will prune

a2 automatically

A = load ‘input’;

B = order A by $0;

C = foreach B generate $0+$1;



A1 = foreach A generate $0, $1;

B = order A1 by $0;

C = foreach B generate $0+$1;


DIY


Column Pruner


• Another case Pig does not do column

pruning

– Pig does not keep track of unused column after

grouping


B = group A by a0;

C = foreach B generate SUM(A.a1);


DIY


A1 = foreach A generate $0, $1;

B = group A1 by a0;

C = foreach B generate SUM(A.a1);



Push up filter


• Pig split the filter condition before push

A

Join

a0>0 && b0>10

B

Filter

A

Join

a0>0

B

Filter b0>10

Original query Split filter condition

A

Join

a0>0

B

Filterb0>10

Push up filter


Other push up/down


• Push down flatten

• Push up limit

Load

Flatten

Order

Load

Flatten

Order

A = load ‘input’ as (a0:bag, a1);

B = foreach A generateflattten(a0), a1;

C = order B by a1;


Load

Limit

Foreach

Load

Foreach

Limit

Load (limited)

Foreach

Load

Limit

Order

Load

Order (limited)


Partition pruning


• Prune unnecessary partitions entirely

– HCatLoader

2010

2011

2012

HCatLoaderFilter

(year>=2011)

2010

2011

2012

HCatLoader

(year>=2011)


Intermediate file compression


Pig Script

map 1

reduce 1

map 2

reduce 2

Pig temp file

map 3

reduce 3

Pig temp file

•Intermediate file

between map and

reduce– Snappy

•Temp file between

mapreduce jobs– No compression by

default


Enable temp file compression


•Pig temp file are not compressed by

default– Issues with snappy (HADOOP-7990)

– LZO: not Apache license

•Enable LZO compression–Install LZO for Hadoop

–In conf/pig.properties

–With lzo, up to > 90% disk saving and 4x query

speed up

pig.tmpfilecompression = true

pig.tmpfilecompression.codec = lzo


Multiquery


• Combine two or more map/reduce

job into one

– Happens automatically

– Cases we want to control multiquery: combine too

many

Load

Group by $0 Group by $1

Foreach Foreach

Store Store

Group by $2

Foreach

Store


Control multiquery


• Disable multiquery– Command line option: -M

• Using “exec” to mark the boundaryA = load ‘input’;

B0 = group A by $0;

C0 = foreach B0 generate group, COUNT(A);

Store C0 into ‘output0’;

B1 = group A by $1;



exec

B2 = group A by $2;




Implement the right UDF


• Algebraic UDF– Initial

– Intermediate

– Final


B0 = group A by $0;

C0 = foreach B0 generate group, SUM(A);


MapInitial

CombinerIntermediate

ReduceFinal


Implement the right UDF


• Accumulator UDF– Reduce side UDF

– Normally takes a bag

• Benefit– Big bag are passed in

batches

– Avoid using too much

memory

– Batch size


B0 = group A by $0;

C0 = foreach B0 generate group, my_accum(A);


my_accum extends Accumulator {

public void accumulate() {

// take a bag trunk

}

public void getValue() {

// called after all bag trunks are

processed

}

}pig.accumulative.batchsize=20000


Memory optimization


• Control bag size on reduce side

– If bag size exceed threshold, spill to disk

– Control the bag size to fit the bag in memory if

possible

reduce(Text key, Iterator<Writable>

values, ……)

Mapreduce:

Iterator

Bag of Input 1 Bag of Input 2 Bag of Input 3

pig.cachedbag.memusage=0.2


Optimization starts before pig


• Input format

• Serialization format

• Compression


Input format -Test Query


> searches = load ’aol_search_logs.txt'

using PigStorage() as (ID, Query, …);

> search_thejas = filter searches by Query

matches '.*thejas.*';

> dump search_thejas;

(1568578 , thejasminesupperclub, ….)


Input formats


0

20

40

60

80

100

120

140

RunTime (sec)

RunTime (sec)


Columnar format


•RCFile

•Columnar format for a group of rows

•More efficient if you query subset of

columns


Tests with RCFile


• Tests with load + project + filter out all

records.

• Using hcatalog, w compression,types

•Test 1

•Project 1 out of 5 columns

•Test 2

•Project all 5 columns


RCFile test results


0

20

40

60

80

100

120

140

Project 1 (sec) Project all (sec)

Plain Text

RCFile


Cost based optimizations


• Optimizations decisions based on

your query/data

• Often iterative process

Run

queryMeasure

Tune


• Hash Based Agg

Use pig.exec.mapPartAgg=true to enable

Map task

Cost based optimization - Aggregation


Map

(logic)M.

Output

HBAHBA

OutputReduce task


Cost based optimization – Hash Agg.


• Auto off feature

• switches off HBA if output reduction is

not good enough

• Configuring Hash Agg

• Configure auto off feature -pig.exec.mapPartAgg.minReduction

• Configure memory used -pig.cachedbag.memusage


Cost based optimization - Join


• Use appropriate join algorithm

•Skew on join key - Skew join

•Fits in memory – FR join


Cost based optimization – MR tuning


•Tune MR parameters to reduce IO

•Control spills using map sort params

•Reduce shuffle/sort-merge params


Parallelism of reduce tasks


0:14:24

0:17:17

0:20:10

0:23:02

0:25:55

4 6 8 24 48 256

Runtime

Runtime

• Number of reduce slots = 6

• Factors affecting runtime

• Cores simultaneously used/skew

• Cost of having additional reduce tasks


Cost based optimization – keep data sorted


•Frequent joins operations on same

keys

• Keep data sorted on keys

• Use merge join

• Optimized group on sorted keys

• Works with few load functions – needs

additional i/f implementation


Optimizations for sorted data


0

10

20

30

40

50

60

70

80

90

sort+sort+join+join join + join

Join 2

Join 1

Sort2

Sort1


Future Directions


• Optimize using stats

• Using historical stats w hcatalog

• Sampling


Questions


?

Making pig fly optimizing data processing on hadoop presentation

Software

Transcript of Making pig fly optimizing data processing on hadoop presentation