Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University...

22
Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net

Transcript of Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University...

Page 1: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Advanced topics on Mapreduce with Hadoop

Jiaheng Lu

Department of Computer Science

Renmin University of Chinawww.jiahenglu.net

Page 2: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Outline

Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter

Page 3: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Brief Review

A parallel programming framework Divide and merge

split0

split1

split2

Input data

Map task

Mappers

Map task

Map task

Shuffle

Reduce task

Reducers

Reduce task

Output data

output0

output1

Page 4: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Chaining MapReduce jobs

Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing

steps

Page 5: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Chaining in a sequence

Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes

Page 6: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Configuration conf = getConf();

JobConf job = new JobConf(conf);

job.setJobName("ChainJob");

job.setInputFormat(TextInputFormat.class);

job.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, in);

FileOutputFormat.setOutputPath(job, out);

JobConf map1Conf = new JobConf(false);

ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);

Page 7: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Chaining with complex dependency

Jobs are not chained in a linear fashion

Use addDependingJob() method to add dependency information:

x.addDependingJob(y)

Page 8: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Chaining preprocessing and postprocessing steps

Example: remove stop word in IR Approaches:

Separate: inefficient Chaining those steps into a single job

Use ChainMapper.addMapper() and ChainReducer.setReducer

Map+ | Reduce | Map*

Page 9: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Join in MapReduce

Reduce-side join Broadcast join Map-side filtering and Reduce-side join

A given key A range from dataset(broadcast) a Bloom filter

Page 10: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Reduce-side join

Map output <key, value> key>>join key, value>>tagged with data source

Reduce do a full cross-product of values output the combination results

Page 11: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Example

a b

1 ab

1 cd

4 ef

a c

1 b

2 d

4 c

table x

table y

map()

map()

1

4

key

x ab

x cd

x ef

value

1

2

4

key

y b

y d

y c

valuetag

join key

shuffle()

1

key

x ab

x cd

y b

valuelist

2 y d

4x ef

y c

reduce()

a b c

1 ab b

1 cd b

4 ef c

output

1

Page 12: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Broadcast join (replicated join)

Broadcast the smaller table Do join in Map()

Using distributed cache

DistributedCache.addCacheFile()

Page 13: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Map-side filtering and Reduce-side join

Join key: student IDs from info generate IDs file from info broadcast join

What if the IDs file can’t be stored in memory? a Bloom Filter

Page 14: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

A Bloom Filter

Introduction Implementation of bloom filter Use in MapReduce join

Page 15: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Introduction to Bloom Filter

space-efficient data structure, constant size, test elements, add(), contains()

no false negatives and a small probability of false positives

Page 16: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Implementation of bloom filter

Apply a bit array Add elements

generate k indexes set the k bits to 1

Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false

Page 17: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Example

0

0

0

0

0

0

0

0

0

0

0

1

2

3

4

5

6

7

8

9

1

0

1

0

0

0

1

0

0

0

0

1

2

3

4

5

6

7

8

9

add x(0,2,6)

1

0

1

1

0

0

1

0

0

1

0

1

2

3

4

5

6

7

8

9

add y(0,3,9)

1

0

1

1

0

0

1

0

0

1

0

1

2

3

4

5

6

7

8

9

contain m(1,3,9)

1

0

1

1

0

0

1

0

0

1

0

1

2

3

4

5

6

7

8

9

contain n(0,2,9)initial state

① ② ③ ④ ⑤

× √false positives

Page 18: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Use in MapReduce join

A separate subjob to create a Bloom Filter

Broadcast the Bloom Filter and use in Map() of join job

drop the useless record, and do join in reduce

Page 19: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

References

Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using

Map/Reduce”

Page 20: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

THANK YOU

Page 21: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Hadoop

Page 22: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .