FlumeJava Easy, Efficient Data-Parallel Pipelines

8
FlumeJava Easy, Efficient Data-Parallel Pipelines Google @PLDI’10 Mosharaf Chowdhury

description

FlumeJava Easy, Efficient Data-Parallel Pipelines. Google @PLDI’10 Mosharaf Chowdhury. Problem. Efficient data-parallel pipelines Chain of MapReduce programs Iterative jobs … Exposes a limited set of parallel operations on immutable parallel collections. Goals. Expressiveness - PowerPoint PPT Presentation

Transcript of FlumeJava Easy, Efficient Data-Parallel Pipelines

Page 1: FlumeJava Easy, Efficient Data-Parallel Pipelines

FlumeJava Easy, Efficient Data-Parallel Pipelines

Google @PLDI’10

Mosharaf Chowdhury

Page 2: FlumeJava Easy, Efficient Data-Parallel Pipelines

Problem

• Efficient data-parallel pipelines– Chain of MapReduce programs– Iterative jobs– …

• Exposes a limited set of parallel operations on immutable parallel collections

Page 3: FlumeJava Easy, Efficient Data-Parallel Pipelines

Goals

• Expressiveness• Abstractions

– Data representation– Implementation strategy

• Performance– Lazy evaluation– Dynamic optimization

• Usability & deployability– Implemented as a Java library– Inspired by the failure of Lumberjack

Page 4: FlumeJava Easy, Efficient Data-Parallel Pipelines

FlumeJava Workflow

Write a Java program using the FlumeJava

library

FlumeJava.run(); Optimize

Execute

12 3

4PCollection<String> words = lines.parallelDo(new DoFn<String, String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } } }, collectionOf(strings()));

Page 5: FlumeJava Easy, Efficient Data-Parallel Pipelines

Core Abstractions

Parallel Collections

1. PCollection<T>2. PTable<K, V>

Data-parallel Operations

• Primitives1. parallelDo()2. groupByKey()3. combineValues()4. flatten()

• Derived operations1. count()2. join()3. top()

Page 6: FlumeJava Easy, Efficient Data-Parallel Pipelines

MapShuffleCombineReduce (MSCR)

• Transform combinations of the four primitives into single MapReduce

• Generalizes MapReduce– Multiple

reducers/combiners– Multiple output per

reducer– Pass-through outputs

Page 7: FlumeJava Easy, Efficient Data-Parallel Pipelines

Optimization

Optimizer Strategy

1. Sink flattens2. Lift CombineValues3. Insert fusion blocks4. Fuse parallelDos5. Fuse MSCRs

Optimizer Output

1. MSCR2. Flatten3. Operate

Page 8: FlumeJava Easy, Efficient Data-Parallel Pipelines

Hit or Miss?

• Sizable reduction in SLOC– Except for Sawzall

• 5x reduction in average number of stages

• Faster than other approaches– Except for Hand-optimized

MapReduce chains

• 319 users over a year period