Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data...

46
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish

Transcript of Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data...

Page 1: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Herodotos Herodotou, Harold Lim, Fei Dong,

Shivnath Babu

Duke University

http://www.cs.duke.edu/starfish

Page 2: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Practitioners of Big Data Analytics

6/29/2011 Starfish 2

Data Size

Google

Yahoo!

Facebook

eBay

Journalists

Systems

researchers

Workload

Complexity

Biologists

Economists

Physicists

Counts &

Aggregates

Rollups &

Drilldowns

Statistical analysis,

Linear Algebra

Machine

learning

Text / Images / Video / Graphs

Page 3: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

MapReduce/Hadoop Ecosystem

6/29/2011 Starfish 3

MapReduce Execution Engine

Distributed File System

Hadoop

Oozie Hive Pig Elastic

MapReduce

Java / Ruby /

Python Client

Jaql

HBase

Page 4: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

MADDER Principles of Big Data Analytics

6/29/2011 Starfish 4

Magnetic

Agile

Deep

Data-lifecycle-aware

Elastic

Robust

Page 5: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 5

Magnetic

A

D

D

E

R

Easy to get data into the system

MADDER Principles of Big Data Analytics

Page 6: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 6

M

Agile

D

D

E

R

Make change (data/requirements) easy

MADDER Principles of Big Data Analytics

Page 7: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 7

M

A

Deep

D

E

R

Support the full spectrum of analytics

Write MapReduce programs in Java /

Python / R or use interfaces like Pig / Jaql

MADDER Principles of Big Data Analytics

Page 8: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 8

M

A

D

Data-lifecycle-

aware

E

R

MADDER Principles of Big Data Analytics

Data cycle at LinkedIn

Page 9: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 9

M

A

D

D

Elastic

R

Adapt resources/costs to actual workloads

MADDER Principles of Big Data Analytics

Page 10: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 10

M

A

D

D

E

Robust Graceful degradation under undesirable

events

MADDER Principles of Big Data Analytics

Page 11: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 11

Magnetic

Agile

Deep

Data-lifecycle-aware

Elastic

Robust

Data can be opaque until run-time

Ease

of use

Hard to get good

performance

out of the box

Programs are a different beast

from SQL

Policies are nontrivial

Ease-of-Use Vs. Out-of-the-box Perf.

Page 12: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Starfish: MADDER + Self-Tuning

6/29/2011 Starfish 12

Dynamic instrumentation

Profile

what-if calls Relative

simulation

Mix of models &

Recursive random search

Ease

of use

Get good performance

automatically

Magnetic

Agile

Deep

Data-lifecycle-aware

Elastic

Robust

Page 13: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Starfish: MADDER + Self-Tuning

6/29/2011 Starfish 13

Goal: Provide good performance automatically

MapReduce Execution Engine

Distributed File System

Hadoop

Oozie Hive Pig Elastic MR Java Client …

Starfish

Analytics System

Page 14: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

What are the Tuning Problems?

6/29/2011 Starfish 14

Job-level

MapReduce

configuration

Workflow

optimization

J1

J2

D

Workload

management

Data layout

tuning

Cluster sizing

Page 15: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Starfish Architecture

6/29/2011 Starfish 15

What-if

Engine

Workflow-level tuning

Workflow Optimizer

Workload-level tuning

Workload Optimizer Elastisizer

Data Manager

Metadata

Mgr.

Intermediate

Data Mgr.

Data Layout &

Storage Mgr.

Job Optimizer

Profiler

Job-level tuning

Sampler

Page 16: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Starfish Architecture

6/29/2011 Starfish 16

What-if

Engine

Workflow-level tuning

Workflow Optimizer

Workload-level tuning

Workload Optimizer Elastisizer

Job Optimizer

Profiler

Job-level tuning

Sampler

Data Manager

Metadata

Mgr.

Intermediate

Data Mgr.

Data Layout &

Storage Mgr.

Page 17: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Map function

Reduce function

Run this program as a

MapReduce job

MapReduce Job Execution

Input Splits

job j = < program p, data d, resources r, configuration c >

Page 18: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Map Wave 1

Reduce Wave 1

Map Wave 2

Reduce Wave 2

Input Splits

MapReduce Job Execution

How are the number of splits, number of map and reduce

tasks, memory allocation to tasks, etc., determined?

job j = < program p, data d, resources r, configuration c >

Page 19: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Optimizing MapReduce Job Execution

6/29/2011 Starfish 19

Space of configuration choices include settings for:

Number of map tasks

Number of reduce tasks

Partitioning of map outputs to reduce tasks

Memory allocation to task-level buffers

Multiphase external sorting in the tasks

Whether output data from tasks should be compressed

Whether combine function should be used

job j = < program p, data d, resources r, configuration c >

Page 20: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Optimizing MapReduce Job Execution

190+ parameters in Hadoop

6/29/2011 Starfish 20

2-dim projection

of 13-dim surface

Page 21: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

feature will greatly increase the utility

Hadoop has that can affect the latency and scalability of a job. For different types of jobs, different configurations will yield optimal results. For example, a job with no memory-intensive operations in the map phase but with a combine phase will want to set Hadoop's io.sort.mb quite high, to minimize the number of spills from the map.

Adding this of Pig for Hadoop users, as it will free them from needing to understand Hadoop well enough to tune it themselves for their particular jobs.

Case for Automated Hadoop Tuning (from wiki.apache.org/pig/PigJournal)

6/29/2011 Starfish 21

many configuration parameters

Page 22: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Starfish’s Core Approach to Tuning

Challenges: p is an arbitrary MapReduce program; c is

high-dimensional; …

6/29/2011 Starfish 22

),,,(minarg crdpFcSc

opt

),,,( crdpFperf program p, data d,

resources r,

configuration c

Profiler

What-if Engine

Optimizer(s)

Runs MapReduce jobs to collect job

profiles (concise execution summary)

Given profile of j = <p,d,r,c>, estimates

virtual profile for job j' = <p,d’,r’,c’>

Enumerates and searches through the

optimization space efficiently

Page 23: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Profile of a MapReduce Job

6/29/2011 Starfish 23

split 0 map out 0 reduce

Two Map Waves One Reduce Wave

split 2 map

split 1 map split 3 map Out 1 reduce

Concise representation of program execution

Records information at the level of “task phases”

split 0 map out 0 reduce

Page 24: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Profile of a MapReduce Job

6/29/2011 Starfish 24

Concise representation of program execution

Records information at the level of “task phases”

Memory Buffer

Merge

Sort,

[Combine],

[Compress]

Serialize,

Partition

Map

func

Merge

DFS

Spill Collect Map Read

Map Task Phases

split 0

split 0 map out 0 reduce

Page 25: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Profile of a MapReduce Job

6/29/2011 Starfish 25

Concise representation of program execution

Records information at the level of “task phases”

split 0 map out 0 reduce

Profile

Dataflow

Amount of data flowing though tasks & task phases

Dataflow Statistics

Statistical info about the dataflow

Cost

Execution time at level of tasks & task phases

Cost Statistics

Statistical info about the costs

Page 26: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Fields in a Job Profile Dataflow

Map output bytes

Number of map-side spills

Number of merge rounds

Number of records in buffer per spill

6/29/2011 Starfish 26

Cost

Read phase time in the map task

Map phase time in the map task

Collect phase time in the map task

Spill phase time in the map task

Dataflow Statistics

Map func’s selectivity (output / input)

Map output compression ratio

Combiner’s selectivity

Size of records (keys and values)

Cost Statistics

I/O cost for reading from local disk per byte

I/O cost for writing to HDFS per byte

CPU cost for executing Map func per record

CPU cost for uncompressing the input per byte

Page 27: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Generating Profiles Concise representation of program execution

Records information at the level of “task phases”

Generated by Profiler through measurement or by the

What-if Engine through estimation

6/29/2011 Starfish 27

Page 28: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Generating Profiles by Measurement

Goals

Have zero overhead when off, low overhead when on

Require no modifications to Hadoop

Support unmodified MapReduce programs written in

Java/Python/Ruby/C++

Approach: Dynamic (on-demand) instrumentation

Event-condition-action rules are specified (in Java)

Leads to run-time instrumentation of Hadoop internals

Monitors task phases of MapReduce job execution

We currently use BTrace (Hadoop internals are in Java)

6/29/2011 Starfish 28

Page 29: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 29

split 0 map out 0 reduce

split 1 map

raw data

raw data

raw data

map

profile

reduce

profile

job

profile

Use of Sampling

• Profile fewer tasks

• Execute fewer tasks

JVM = Java Virtual Machine, ECA = Event-Condition-Action

JVM JVM

JVM

Enable Profiling

ECA rules

Generating Profiles by Measurement

Page 30: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Overhead of Profile Measurement

Word Co-occurrence job running on a 16-node cluster of c1.medium EC2 nodes on the Amazon Cloud

6/29/2011 Starfish 30

Page 31: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Task Scheduler Simulator

What-if Engine

6/29/2011 Starfish 31

Job Oracle

What-if Engine

Job

Profile

Input Data

Properties

Cluster

Resources

Configuration

Settings

Virtual Job Profile for <p, d2, r2, c2>

Properties of Hypothetical Job

Possibly Hypothetical

<p, d1, r1, c1> <d2> <r2> <c2>

Page 32: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

What-if Questions Starfish can Answer

How will job j’s execution time change if the number of

reduce tasks is changed from 20 to 40?

What will the change in I/O be if map o/p compression

is turned on, but the input data size increases by 40%?

What will job j’s new execution time be if 5 more nodes

are added to the cluster, bringing the total to 20?

How will workload execution time & dollar cost change

if we move the production cluster from m1.xlarge nodes

to c1.medium nodes on Amazon EC2?

6/29/2011 Starfish 32

Page 33: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

(Virtual) Profile for j'

Virtual Profile Estimation

6/29/2011 Starfish 33

Dataflow

Statistics

Dataflow

Cost

Statistics

Cost

Profile for j

Input

Data d2

Confi-

guration

c2

Resources

r2

Dataflow

Statistics

Cardinality

Models

Cost

White-box Models

Cost

Statistics Relative

Black-box

Models

Given profile for job j = <p, d1, r1, c1>

Estimate profile for job j' = <p, d2, r2, c2>

Dataflow

White-box Models

Page 34: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Job Optimizer

6/29/2011 Starfish 34

Job

Profile

Input Data

Properties

Cluster

Resources

Subspace Enumeration

Recursive Random Search

Job Optimizer

What-if

calls

<p, d1, r1, c1> <d2> <r2>

Best Configuration

Settings <copt> for <p, d2, r2>

Page 35: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Experimental Setup

6/29/2011 Starfish 35

Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type

2 map slots & 2 reduce slots

300MB max memory per task

Cost-Based Job Optimizer Vs. Rule-Based Optimizer

Page 36: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type

2 map slots & 2 reduce slots

300MB max memory per task

Cost-Based Job Optimizer Vs. Rule-Based Optimizer

6/29/2011 Starfish 36

Abbr. MapReduce Program Domain Dataset

CO Word Co-occurrence NLP 30GB,Wikipedia

WC WordCount Text Analytics 30GB, Wikipedia

TS TeraSort Business Analytics 30GB, Teragen

LG LinkGraph Graph Processing 10GB, Wikipedia (compressed)

JO Join Business Analytics 30GB, TPC-H

Page 37: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Job Optimizer Evaluation

6/29/2011 Starfish 37

0

2

4

6

8

10

12

14

16

CO WC TS LG JO

Sp

eed

up

MapReduce Job

Default

Settings

Rule-Based

Optimizer

Just-in-Time

OptimizerCost-Based Job

Page 38: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Insights from Profiles for WordCount

6/29/2011 Starfish 38

Many, small spills

Combiner gave smaller data reduction

Better resource utilization in Mappers

Few, large spills

Combiner gave high data reduction

Combiner made Mappers CPU bound

A: Rule-Based B: Cost-Based (2x faster)

Page 39: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Estimates from the What-if Engine

6/29/2011 Starfish 39

True surface Estimated surface

Page 40: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Starfish Architecture

6/29/2011 Starfish 40

What-if

Engine

Workflow-level tuning

Workflow Optimizer

Workload-level tuning

Workload Optimizer Elastisizer

Job Optimizer

Profiler

Job-level tuning

Sampler

Data Manager

Metadata

Mgr.

Intermediate

Data Mgr.

Data Layout &

Storage Mgr.

Page 41: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Workflow Optimization Space

6/29/2011 Starfish 41

Job-level Configuration

Dataset-level Configuration

Vertical Packing

Intra-job Inter-job

Partition Function Selection

Optimization Space

Physical Logical

Page 42: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Optimizations on TF-IDF Workflow

6/29/2011 Starfish 42

Logical

Optimization

M1

R1

… D0 <{D},{W}>

J1

D1

D2

D4

M2

R2

… <{D, W},{f}>

J2

… <{D},{W, f, c}>

J3, J4

… <{W},{D, t}>

Partition:{D}

Sort: {D,W} M1

R1

M2

R2

… D0 <{D},{W}>

J1, J2

D2

D4

M3

R3

M4

… <{D},{W, f, c}>

J3, J4

… <{W},{D, t}>

Physical

Optimization

Reducers= 50 Compress = off Memory = 400 …

Reducers= 20 Compress = on Memory = 300 …

Legend

D = docname f = frequency

W = word c = count

t = TF-IDF

M3

R3

M4

Page 43: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

Starfish Architecture

6/29/2011 Starfish 43

What-if

Engine

Workflow-level tuning

Workflow Optimizer

Workload-level tuning

Workload Optimizer Elastisizer

Job Optimizer

Profiler

Job-level tuning

Sampler

Data Manager

Metadata

Mgr.

Intermediate

Data Mgr.

Data Layout &

Storage Mgr.

Page 44: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 44

Cloud enables users to provision Hadoop clusters in minutes

Can avoid the man (system administrator) in the middle

Pay only for what resources are used

0200400600800

1,0001,200

m1.small m1.large m1.xlarge c1.medium c1.xlarge

Ru

nn

ing

Tim

e

(min

)

EC2 Instance Type

0.00

2.00

4.00

6.00

8.00

10.00

m1.small m1.large m1.xlarge c1.medium c1.xlarge

Co

st (

$)

EC2 Instance Type

Multi-objective Cluster Provisioning

Page 45: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

6/29/2011 Starfish 45

0200400600800

1,0001,200

m1.small m1.large m1.xlarge c1.medium c1.xlarge

Ru

nn

ing

Tim

e

(min

)

EC2 Instance Type for Target Cluster

Actual

Predicted

0.00

2.00

4.00

6.00

8.00

10.00

m1.small m1.large m1.xlarge c1.medium c1.xlarge

Co

st (

$)

EC2 Instance Type for Target Cluster

Actual

Predicted

Instance Type for Source Cluster: m1.large

Multi-objective Cluster Provisioning

Page 46: Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data Analytics 6/29/2011 Starfish 2 Data Size Google Yahoo! Facebook eBay Journalists Systems

More Info: www.cs.duke.edu/starfish

6/29/2011 Starfish 46

Job-level

MapReduce

configuration

Workflow

optimization

J1

J2

D

Workload

management

Data layout

tuning

Cluster sizing