MapReduce performance prediction

8/12/2019 MapReduce performance prediction

1/29

A Hadoop MapReduce

Performance Prediction Method

Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu#and Xuelian Lin#

* University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France

+ Ecole Centrale de Paris, France# Beihang University, Beijing China

1


2/29

Background

Hadoop MapReduce

I

N

P

U

T

D

A

T

A

Split

Map

Map

Map

Map

Reduce

Reduce

JobMap

Reduce

Map

Reduce

Map

Map

Map

Reduce

+

(Key, Value)Partion1

Partion2

HDFS 2


3/29

Background

Hadoop

Many steps within Map stage and Reduce stage

Different step may consume different type of resource

RE

A

D

Map

SO

R

T

ME

R

G

E

O

U

T

P

U

T

Map

3


4/29

Motivation

Problems

SchedulingNo consideration about the execution time

and different type of resources consumed

Hadoop

ParameterTuning

Numerous parameters, default value is notoptimal

Hadoop

CPU

Intensive

CPU

Intensive

Hadoop

Default

HadoopJobHadoop

Job

Default Conf

4


5/29

Motivation

Solution

Predict the performance of Hadoop Jobs

Scheduling

Hadoop

Parameter

Tuning

Numerous parameters, default value is not

optimal

No consideration about the execution time

and different type of resources consumed

5


6/29

Related Work

Existing Prediction Method 1- Black Box Based

Job

Features

Hadoop

Statistic/Learning

Models

Execution

Time

Lack of the

analysis about

Hadoop

Hard to

choose

6


7/29

Related Work

Existing Prediction Method 2- Cost Model Based

Job

Feature

F(map)=f(read,map,sort,spill,merge,write)

F(reduce)=f(read,write,merge,reduce,write)

Execution

Time

Difficult to

ensure

accuracy

Lots of concurrent

processes

Hard to divide stages

HadoopRead

Hadoop

mapOut

put Read reduce

Out

put

7


8/29

Related Work

A Brief Summary about Existing Prediction Method

Black Box Cost Model

Advantage Simple and Effective

High accuracyHigh isomorphism

Detailed analysis about Hadoop

processingDivision is flexible (stage, resource)

Multiple prediction

Short

Coming

Lack of job feature extraction

Lack of analysis

Hard to divide each step and

resource

Lack of job feature extraction

A lot of concurrent, hard to model

Better for theoretical analysis, not

suitable for prediction

o Simple prediction,

o Lack of jobs (jar package + data) analysis

8


9/29

Goal

Design a Hadoop MapReduce performance prediction

system to:

- Predict the job consumption of various type of resources

(CPU, Disk IO, Network)

- Predict the execution time of Map phase and Reduce phase

Prediction System

- Map execution time

- Reduce execution

time

- CPU Occupation Time

- Disk Occupation Time

- Network Occupation

Time

Job

9


10/29

Design - 1

Cost Model

- Map execution

time

- Reduce execution

time

- CPU Occupation Time- Disk Occupation Time


Time

CO

S

T

M

OD

E

L

Job

10


11/29


12/29

Cost Model [1]

Cost Function Parameters Analysis

Type OneConstant

Hadoop System ConsumeInitialization Consume

Type TwoJob-related Parameters Map Function Computational ComplexityMap Input

Records

Type ThreeParameters defined by Cost Model

Sorting Coefficient, Complexity Factor

[1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.12


13/29

Parameters Collection

Type One and Type Three Type one: Run empty map taskscalculate the system

consumedfromthe logs

Type Three: Extract the sort part from Hadoop source

code, sort a certain number of records. Type Two

Run a new jobanalyze log High Latency

Large Overhead

Sampling Dataonly analyze the behavior of mapfunction and reduce function Almost no latency

Very low extra overhead Job Analyzer

13


14/29

Job Analyzer - Implementation

Job Analyzer Implementation Hadoop virtual execution environment

Accept the job Jar File & Input Data

Sampling Module

Sample input data by a certainpercentage (less than 5%).

MR Module Instantiate user jobs class in

using Java reflection

Analyze Module Input Data (Amount & Number)

Relative computational complexity

Data conversion rates (output/input)

Sampling

Module

MR

Module

Analyze Module

Hadoop virtual execution

environment

Jar File + Input Data

Job Feature

14


15/29

Job Analyzer - Feasibility

Data similarity: Logs have uniform format

Execution similarity: each record will be processed by the

same map & reduce function repeatedly

I

N

P

U

T

D

A

T

A

Split

Map

Map

Map

Map

Reduce

Reduce

15


16/29

Design - 2


- Map execution

time

- Reduce execution

time



Time

C

O

S

T

M

OD

E

L

Job Analyzer:Collect

Parameters of

Type 2

Static Parameters

Collection Module:

Collect

Parameters of

Type1 & Type 3

16


17/29

Prediction Model

Problem Analysis-Many concurrent steps -- the total time can not be

added up by the time of each part

Initiation

Read Data

Network

Transfer

CreateObject

Map Function

Sort

In

Memory

Read/Write

Disk

Merge

Sort

Write

Disk

Serializat

ion

CPU:Disk:

Net:

17


18/29

Prediction Model

Main Factors(according to the performancemodel)- Map Stage

Initiation

Read Data

NetworkTransfer

Create

Object

Map Function

Sort

In

Memory

Read/Write

Disk

Merge

Sort

Write

Disk

Serializ

ation

Tmap=0

+1*MapInput

+2*N

+3*N*Log(N)

+4*The complexity of map function

+5*The conversion rate of map data

The amount of input data

The number of input records (N)

The complexity of Map function

The conversion rate of Map data

NlogN

18


19/29


20/29


21/29

Find the nearest jobs!

Instance-Based Linear Regression Find the nearest samples to the jobs to be predicted in

history logs

nearest-> similar jobs (Top K nearest, with K=10%-15%)

Do linear regression to the samples we have found

Calculate the prediction value

Nearest The weighted distance of job features (weight w)

High contribution for job classification

map/reduce complexitymap/reduce data conversion rate

Low contribution for job classification Data amountNumber of records

21


22/29

Prediction Module

Procedure

CostModel

MainFactors

Tmap=0+1*MapInput

+2*N

+3*N*Log(N)

+4*The complexity of map

function+5*The conversion rate of map

data

Job Features

Search for the nearest

samples

Prediction Function

Prediction Results

1 2

3

4

5

6

7

22


23/29

Prediction Module

Procedure

Training Set

Find-Neighbor Module

Prediction Results

Prediction Function

Cost Model

23


24/29

Design - 3




Time

C

O

S

T

M

OD

E

L

Job Analyzer:Collect

Parameters of

Type 2

Static Parameters

Collection Module:

Collect

Parameters of

Type1 & Type 3

- Map execution

time

- Reduce execution

timePrediction

Module

24


25/29

Experience

Task Execution Time (Error Rate) K=12%, and with w different for each feature

K=12%, and with w the same for each feature

K=25%, and with w different for each feature

4 kinds of jobs, 64M-8G

0

20

40

60

80

100

120

140

160

180

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

ErrorRate(100%)

Reduce Tasks

k=12%

k=25%

k=12%,w=1

0

10

20

30

40

5060

70

80

90

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

ErrorRate(100%

Map Tasks

Job ID Job ID 25


26/29

Conclusion

Job Analyzer :

Analyze Job Jar + Input File

Collect parameters

Prediction Module:

Find the main factor

Propose a linear equation

Job classification

Multiple prediction

26


27/29

27

Thank you!

Question?


28/29

Cost Model [1]

Analysis about Reduce

- Modeling the resources (CPU Disk Network) consumption

- Each stage involves only one type of resources

Initiation

Read

Data

Network

Transfer

Create Object

Reduce Function

Merge

SortRead/Write

Disk

Network

Write DiskSerialization

Deserialization

Reduce CPU:Disk:

Net:

28


29/29

Prediction Model

Main Factors (according to the performancemodel)- Reduce Stage

Initiation Read

Data

Network

Transfer

Create Object

Reduce Function

Merge

Sort

Read/WriteDisk

Network

Write DiskSerialization

Deserialization

Treduce=0+1*MapInput

+2*N

+3*Nlog(N)

+4*The complexity of Reduce function

+5*The conversion rate of Map data

+6*The conversion rate of Reduce data

The amount of input data

The number of input records

The complexity of Reduce

function

The conversion rate of Map data

NlogN

The conversion rate of Reducedata

29

MapReduce performance prediction

Documents

Transcript of MapReduce performance prediction