FlumeJava Easy, Efficient Data-Parallel Pipelines

8

FlumeJava Easy, Efficient Data-Parallel Pipelines Google @PLDI’10 Mosharaf Chowdhury

Upload
aulani
Category

Documents
view
44
download
0

Embed Size (px):

description

FlumeJava Easy, Efficient Data-Parallel Pipelines. Google @PLDI’10 Mosharaf Chowdhury. Problem. Efficient data-parallel pipelines Chain of MapReduce programs Iterative jobs … Exposes a limited set of parallel operations on immutable parallel collections. Goals. Expressiveness - PowerPoint PPT Presentation

Transcript of FlumeJava Easy, Efficient Data-Parallel Pipelines

Page 1: FlumeJava Easy, Efficient Data-Parallel Pipelines

FlumeJava Easy, Efficient Data-Parallel Pipelines

Google @PLDI’10

Mosharaf Chowdhury

Page 2: FlumeJava Easy, Efficient Data-Parallel Pipelines

Problem

• Efficient data-parallel pipelines– Chain of MapReduce programs– Iterative jobs– …

• Exposes a limited set of parallel operations on immutable parallel collections

Page 3: FlumeJava Easy, Efficient Data-Parallel Pipelines

Goals

• Expressiveness• Abstractions

– Data representation– Implementation strategy

• Performance– Lazy evaluation– Dynamic optimization

• Usability & deployability– Implemented as a Java library– Inspired by the failure of Lumberjack

Page 4: FlumeJava Easy, Efficient Data-Parallel Pipelines

FlumeJava Workflow

Write a Java program using the FlumeJava

library

FlumeJava.run(); Optimize

Execute

12 3

4PCollection<String> words = lines.parallelDo(new DoFn<String, String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } } }, collectionOf(strings()));

Page 5: FlumeJava Easy, Efficient Data-Parallel Pipelines

Core Abstractions

Parallel Collections

1. PCollection<T>2. PTable<K, V>

Data-parallel Operations

• Primitives1. parallelDo()2. groupByKey()3. combineValues()4. flatten()

• Derived operations1. count()2. join()3. top()

Page 6: FlumeJava Easy, Efficient Data-Parallel Pipelines

MapShuffleCombineReduce (MSCR)

• Transform combinations of the four primitives into single MapReduce

• Generalizes MapReduce– Multiple

reducers/combiners– Multiple output per

reducer– Pass-through outputs

Page 7: FlumeJava Easy, Efficient Data-Parallel Pipelines

Optimization

Optimizer Strategy

1. Sink flattens2. Lift CombineValues3. Insert fusion blocks4. Fuse parallelDos5. Fuse MSCRs

Optimizer Output

1. MSCR2. Flatten3. Operate

Page 8: FlumeJava Easy, Efficient Data-Parallel Pipelines

Hit or Miss?

• Sizable reduction in SLOC– Except for Sawzall

• 5x reduction in average number of stages

• Faster than other approaches– Except for Hand-optimized

MapReduce chains

• 319 users over a year period

Proposed Pipelines & Infrastructure Proposed Pipelines & Infrastructure.

Proposed Pipelines & Infrastructure Proposed Pipelines & Infrastructure.

Offshore pipelines

Offshore pipelines

Pipeline Technology Conference 2011 · Import dependency of Europe remains; tendency: increasing ... Two parallel offshore pipelines of 1,224 km (transport capacity 55 bcm per year)

Pipeline Technology Conference 2011 · Import dependency of Europe remains; tendency: increasing ... Two parallel offshore pipelines of 1,224 km (transport capacity 55 bcm per year)

os-as-parallel-pipelinesaburtsev/doc/os-as-parallel-pipelines-soc.pdf · Title: os-as-parallel-pipelines.pdf Author: Anton Burtsev Created Date: 20090302185408Z

os-as-parallel-pipelinesaburtsev/doc/os-as-parallel-pipelines-soc.pdf · Title: os-as-parallel-pipelines.pdf Author: Anton Burtsev Created Date: 20090302185408Z

akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

FlumeJava: easy, efficient data-parallel pipelinespages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf · As of March 2010, FlumeJava has been in use at Google for nearly

FlumeJava: easy, efficient data-parallel pipelinespages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf · As of March 2010, FlumeJava has been in use at Google for nearly

XStream - GitHub Pages · • Google Cloud Dataflow, MapReduce, FlumeJava, Sawzall, Millwheel • Distributed stream processing: Borealis, Stanford STREAM • Streaming SQL dialects:

XStream - GitHub Pages · • Google Cloud Dataflow, MapReduce, FlumeJava, Sawzall, Millwheel • Distributed stream processing: Borealis, Stanford STREAM • Streaming SQL dialects:

Spotting Code Optimizations in Data-Parallel Pipelines ... · Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE ... black box while compilers treat pipelines

Spotting Code Optimizations in Data-Parallel Pipelines ... · Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE ... black box while compilers treat pipelines

FlumeJava: Easy, Efﬁcient Data-Parallel Pipelines · Google, Inc. fchambers,raniwala,fjp,sra,rrh,robertwb,nweizg@google.com Abstract MapReduce and similar systems signiﬁcantly

FlumeJava: Easy, Efﬁcient Data-Parallel Pipelines · Google, Inc. fchambers,raniwala,fjp,sra,rrh,robertwb,[email protected] Abstract MapReduce and similar systems signiﬁcantly

Spotting Code Optimizations in Data-Parallel Pipelines through …jrzhou/pub/PeriScope.pdf · ple data-parallel program, which is adapted from a real SCOPE job. SCOPE is a distributed

Spotting Code Optimizations in Data-Parallel Pipelines through …jrzhou/pub/PeriScope.pdf · ple data-parallel program, which is adapted from a real SCOPE job. SCOPE is a distributed

Material Definition Language - Technical introduction...ral methods, noise-based textures, texture projection maps, and texture blend pipelines. MDL is designed for modern highly-parallel

Material Definition Language - Technical introduction...ral methods, noise-based textures, texture projection maps, and texture blend pipelines. MDL is designed for modern highly-parallel

Real-World Pipelines: Car Washes Computational Example · PDF fileReal-World Pipelines: Car Washes ... Sequential Parallel Pipelined –4 ... file A B M E PC valP srcA, srcB dstA,

Real-World Pipelines: Car Washes Computational Example · PDF fileReal-World Pipelines: Car Washes ... Sequential Parallel Pipelined –4 ... file A B M E PC valP srcA, srcB dstA,

XStream - tvcutsem.github.iotvcutsem.github.io/assets/XStream_ifip17.pdf · • Google Cloud Dataflow, MapReduce, FlumeJava, Sawzall, Millwheel • Distributed stream processing:

XStream - tvcutsem.github.iotvcutsem.github.io/assets/XStream_ifip17.pdf · • Google Cloud Dataflow, MapReduce, FlumeJava, Sawzall, Millwheel • Distributed stream processing:

1718 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …jrzhou/pub/periscope-journal.pdfSpotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE Xuepeng Fan, Zhenyu Guo,

1718 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …jrzhou/pub/periscope-journal.pdfSpotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE Xuepeng Fan, Zhenyu Guo,

Optimization of execution plans in the FlumeJava model

Optimization of execution plans in the FlumeJava model

Stress analysis of parallel oil and gas steel pipelines in inclined ...

Stress analysis of parallel oil and gas steel pipelines in inclined ...

Global Buckling of Pipelines of Submarine Pipelines Rp-f110_2007-10

Global Buckling of Pipelines of Submarine Pipelines Rp-f110_2007-10

Oil and Gas Pipelines - Home - SPA · The Colton‐Barstow CalNev pipelines and the Kinder Morgan Colton Yuma pipeline appear to run parallel to and nearly on top of the fault. Depending

Oil and Gas Pipelines - Home - SPA · The Colton‐Barstow CalNev pipelines and the Kinder Morgan Colton Yuma pipeline appear to run parallel to and nearly on top of the fault. Depending

Coal Institute Conference: Fundamentals Update · Financial. Reliable. Operating Pipelines Pipelines completed in 2014-2015 Pipelines under construction Awarded Pipelines CFE Planned

Coal Institute Conference: Fundamentals Update · Financial. Reliable. Operating Pipelines Pipelines completed in 2014-2015 Pipelines under construction Awarded Pipelines CFE Planned

Multi-Terabit IP Lookup Using Parallel Bidirectional Pipelines Author: Weirong Jiang, Viktor K. Prasanna Publisher: May 2008 CF '08: Proceedings of the.

Multi-Terabit IP Lookup Using Parallel Bidirectional Pipelines Author: Weirong Jiang, Viktor K. Prasanna Publisher: May 2008 CF '08: Proceedings of the.

Languages

Pages

Legal

Copyright © 2022 FDOCUMENTS