Making pig fly optimizing data processing on hadoop presentation

36
© Hortonworks Inc. 2011 Daniel Dai (@daijy) Thejas Nair (@thejasn) Page 1 Making Pig Fly Optimizing Data Processing on Hadoop

Transcript of Making pig fly optimizing data processing on hadoop presentation

Page 1: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Daniel Dai (@daijy)

Thejas Nair (@thejasn)

Page 1

Making Pig FlyOptimizing Data Processing on Hadoop

Page 2: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

What is Apache Pig?

Page 2Architecting the Future of Big Data

Pig Latin, a high level

data processing

language.

An engine that

executes Pig

Latin locally or on

a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Page 3: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Pig-latin example

Page 3Architecting the Future of Big Data

• Query : Get the list of web pages visited by users whose

age is between 20 and 29 years.

USERS = load ‘users’ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load ‘pages’ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;

Page 4: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Why pig ?

Page 4Architecting the Future of Big Data

•Faster development

– Fewer lines of code

– Don’t re-invent the wheel

•Flexible

– Metadata is optional

– Extensible

– Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Page 5: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Pig optimizations

Page 5Architecting the Future of Big Data

• Ideally user should not have to

bother

• Reality– Pig is still young and immature

– Pig does not have the whole picture

–Cluster configuration

–Data histogram

– Pig philosophy: Pig is docile

Page 6: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Pig optimizations

Page 6Architecting the Future of Big Data

• What pig does for you– Do safe transformations of query to optimize

– Optimized operations (join, sort)

• What you do– Organize input in optimal way

– Optimize pig-latin query

– Tell pig what join/group algorithm to use

Page 7: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Rule based optimizer

Page 7Architecting the Future of Big Data

• Column pruner

• Push up filter

• Push down flatten

• Push up limit

• Partition pruning

• Global optimizer

Page 8: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Column Pruner

Page 8Architecting the Future of Big Data

• Pig will do column pruning automatically

• Cases Pig will not do column pruning

automatically

– No schema specified in load statement

A = load ‘input’ as (a0, a1, a2);

B = foreach A generate a0+a1;

C = order B by $0;

Store C into ‘output’;

Pig will prune

a2 automatically

A = load ‘input’;

B = order A by $0;

C = foreach B generate $0+$1;

Store C into ‘output’;

A = load ‘input’;

A1 = foreach A generate $0, $1;

B = order A1 by $0;

C = foreach B generate $0+$1;

Store C into ‘output’;

DIY

Page 9: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Column Pruner

Page 9Architecting the Future of Big Data

• Another case Pig does not do column

pruning

– Pig does not keep track of unused column after

grouping

A = load ‘input’ as (a0, a1, a2);

B = group A by a0;

C = foreach B generate SUM(A.a1);

Store C into ‘output’;

DIY

A = load ‘input’ as (a0, a1, a2);

A1 = foreach A generate $0, $1;

B = group A1 by a0;

C = foreach B generate SUM(A.a1);

Store C into ‘output’;

Page 10: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Push up filter

Page 10Architecting the Future of Big Data

• Pig split the filter condition before push

A

Join

a0>0 && b0>10

B

Filter

A

Join

a0>0

B

Filter b0>10

Original query Split filter condition

A

Join

a0>0

B

Filterb0>10

Push up filter

Page 11: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Other push up/down

Page 11Architecting the Future of Big Data

• Push down flatten

• Push up limit

Load

Flatten

Order

Load

Flatten

Order

A = load ‘input’ as (a0:bag, a1);

B = foreach A generateflattten(a0), a1;

C = order B by a1;

Store C into ‘output’;

Load

Limit

Foreach

Load

Foreach

Limit

Load (limited)

Foreach

Load

Limit

Order

Load

Order (limited)

Page 12: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Partition pruning

Page 12Architecting the Future of Big Data

• Prune unnecessary partitions entirely

– HCatLoader

2010

2011

2012

HCatLoaderFilter

(year>=2011)

2010

2011

2012

HCatLoader

(year>=2011)

Page 13: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Intermediate file compression

Page 13Architecting the Future of Big Data

Pig Script

map 1

reduce 1

map 2

reduce 2

Pig temp file

map 3

reduce 3

Pig temp file

•Intermediate file

between map and

reduce– Snappy

•Temp file between

mapreduce jobs– No compression by

default

Page 14: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Enable temp file compression

Page 14Architecting the Future of Big Data

•Pig temp file are not compressed by

default– Issues with snappy (HADOOP-7990)

– LZO: not Apache license

•Enable LZO compression–Install LZO for Hadoop

–In conf/pig.properties

–With lzo, up to > 90% disk saving and 4x query

speed up

pig.tmpfilecompression = true

pig.tmpfilecompression.codec = lzo

Page 15: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Multiquery

Page 15Architecting the Future of Big Data

• Combine two or more map/reduce

job into one

– Happens automatically

– Cases we want to control multiquery: combine too

many

Load

Group by $0 Group by $1

Foreach Foreach

Store Store

Group by $2

Foreach

Store

Page 16: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Control multiquery

Page 16Architecting the Future of Big Data

• Disable multiquery– Command line option: -M

• Using “exec” to mark the boundaryA = load ‘input’;

B0 = group A by $0;

C0 = foreach B0 generate group, COUNT(A);

Store C0 into ‘output0’;

B1 = group A by $1;

C1 = foreach B1 generate group, COUNT(A);

Store C1 into ‘output1’;

exec

B2 = group A by $2;

C2 = foreach B2 generate group, COUNT(A);

Store C2 into ‘output2’;

Page 17: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Implement the right UDF

Page 17Architecting the Future of Big Data

• Algebraic UDF– Initial

– Intermediate

– Final

A = load ‘input’;

B0 = group A by $0;

C0 = foreach B0 generate group, SUM(A);

Store C0 into ‘output0’;

MapInitial

CombinerIntermediate

ReduceFinal

Page 18: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Implement the right UDF

Page 18Architecting the Future of Big Data

• Accumulator UDF– Reduce side UDF

– Normally takes a bag

• Benefit– Big bag are passed in

batches

– Avoid using too much

memory

– Batch size

A = load ‘input’;

B0 = group A by $0;

C0 = foreach B0 generate group, my_accum(A);

Store C0 into ‘output0’;

my_accum extends Accumulator {

public void accumulate() {

// take a bag trunk

}

public void getValue() {

// called after all bag trunks are

processed

}

}pig.accumulative.batchsize=20000

Page 19: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Memory optimization

Page 19Architecting the Future of Big Data

• Control bag size on reduce side

– If bag size exceed threshold, spill to disk

– Control the bag size to fit the bag in memory if

possible

reduce(Text key, Iterator<Writable>

values, ……)

Mapreduce:

Iterator

Bag of Input 1 Bag of Input 2 Bag of Input 3

pig.cachedbag.memusage=0.2

Page 20: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Optimization starts before pig

Page 20Architecting the Future of Big Data

• Input format

• Serialization format

• Compression

Page 21: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Input format -Test Query

Page 21Architecting the Future of Big Data

> searches = load ’aol_search_logs.txt'

using PigStorage() as (ID, Query, …);

> search_thejas = filter searches by Query

matches '.*thejas.*';

> dump search_thejas;

(1568578 , thejasminesupperclub, ….)

Page 22: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Input formats

Page 22Architecting the Future of Big Data

0

20

40

60

80

100

120

140

RunTime (sec)

RunTime (sec)

Page 23: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Columnar format

Page 23Architecting the Future of Big Data

•RCFile

•Columnar format for a group of rows

•More efficient if you query subset of

columns

Page 24: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Tests with RCFile

Page 24Architecting the Future of Big Data

• Tests with load + project + filter out all

records.

• Using hcatalog, w compression,types

•Test 1

•Project 1 out of 5 columns

•Test 2

•Project all 5 columns

Page 25: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

RCFile test results

Page 25Architecting the Future of Big Data

0

20

40

60

80

100

120

140

Project 1 (sec) Project all (sec)

Plain Text

RCFile

Page 26: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Cost based optimizations

Page 26Architecting the Future of Big Data

• Optimizations decisions based on

your query/data

• Often iterative process

Run

queryMeasure

Tune

Page 27: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

• Hash Based Agg

Use pig.exec.mapPartAgg=true to enable

Map task

Cost based optimization - Aggregation

Page 27Architecting the Future of Big Data

Map

(logic)M.

Output

HBAHBA

OutputReduce task

Page 28: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Cost based optimization – Hash Agg.

Page 28Architecting the Future of Big Data

• Auto off feature

• switches off HBA if output reduction is

not good enough

• Configuring Hash Agg

• Configure auto off feature -pig.exec.mapPartAgg.minReduction

• Configure memory used -pig.cachedbag.memusage

Page 29: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Cost based optimization - Join

Page 29Architecting the Future of Big Data

• Use appropriate join algorithm

•Skew on join key - Skew join

•Fits in memory – FR join

Page 30: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Cost based optimization – MR tuning

Page 30Architecting the Future of Big Data

•Tune MR parameters to reduce IO

•Control spills using map sort params

•Reduce shuffle/sort-merge params

Page 31: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Parallelism of reduce tasks

Page 31Architecting the Future of Big Data

0:14:24

0:17:17

0:20:10

0:23:02

0:25:55

4 6 8 24 48 256

Runtime

Runtime

• Number of reduce slots = 6

• Factors affecting runtime

• Cores simultaneously used/skew

• Cost of having additional reduce tasks

Page 32: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Cost based optimization – keep data sorted

Page 32Architecting the Future of Big Data

•Frequent joins operations on same

keys

• Keep data sorted on keys

• Use merge join

• Optimized group on sorted keys

• Works with few load functions – needs

additional i/f implementation

Page 33: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Optimizations for sorted data

Page 33Architecting the Future of Big Data

0

10

20

30

40

50

60

70

80

90

sort+sort+join+join join + join

Join 2

Join 1

Sort2

Sort1

Page 34: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Future Directions

Page 34Architecting the Future of Big Data

• Optimize using stats

• Using historical stats w hcatalog

• Sampling

Page 35: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011

Questions

Page 35Architecting the Future of Big Data

?

Page 36: Making pig fly  optimizing data processing on hadoop presentation

© Hortonworks Inc. 2011 Page 36