MapReduce performance prediction

download MapReduce performance prediction

of 29

Transcript of MapReduce performance prediction

  • 8/12/2019 MapReduce performance prediction

    1/29

    A Hadoop MapReduce

    Performance Prediction Method

    Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu#and Xuelian Lin#

    * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France

    + Ecole Centrale de Paris, France# Beihang University, Beijing China

    1

  • 8/12/2019 MapReduce performance prediction

    2/29

    Background

    Hadoop MapReduce

    I

    N

    P

    U

    T

    D

    A

    T

    A

    Split

    Map

    Map

    Map

    Map

    Reduce

    Reduce

    JobMap

    Reduce

    Map

    Reduce

    Map

    Map

    Map

    Reduce

    +

    (Key, Value)Partion1

    Partion2

    HDFS 2

  • 8/12/2019 MapReduce performance prediction

    3/29

    Background

    Hadoop

    Many steps within Map stage and Reduce stage

    Different step may consume different type of resource

    RE

    A

    D

    Map

    SO

    R

    T

    ME

    R

    G

    E

    O

    U

    T

    P

    U

    T

    Map

    3

  • 8/12/2019 MapReduce performance prediction

    4/29

    Motivation

    Problems

    SchedulingNo consideration about the execution time

    and different type of resources consumed

    Hadoop

    ParameterTuning

    Numerous parameters, default value is notoptimal

    Hadoop

    CPU

    Intensive

    CPU

    Intensive

    Hadoop

    Default

    HadoopJobHadoop

    Job

    Default Conf

    4

  • 8/12/2019 MapReduce performance prediction

    5/29

    Motivation

    Solution

    Predict the performance of Hadoop Jobs

    Scheduling

    Hadoop

    Parameter

    Tuning

    Numerous parameters, default value is not

    optimal

    No consideration about the execution time

    and different type of resources consumed

    5

  • 8/12/2019 MapReduce performance prediction

    6/29

    Related Work

    Existing Prediction Method 1- Black Box Based

    Job

    Features

    Hadoop

    Statistic/Learning

    Models

    Execution

    Time

    Lack of the

    analysis about

    Hadoop

    Hard to

    choose

    6

  • 8/12/2019 MapReduce performance prediction

    7/29

    Related Work

    Existing Prediction Method 2- Cost Model Based

    Job

    Feature

    F(map)=f(read,map,sort,spill,merge,write)

    F(reduce)=f(read,write,merge,reduce,write)

    Execution

    Time

    Difficult to

    ensure

    accuracy

    Lots of concurrent

    processes

    Hard to divide stages

    HadoopRead

    Hadoop

    mapOut

    put Read reduce

    Out

    put

    7

  • 8/12/2019 MapReduce performance prediction

    8/29

    Related Work

    A Brief Summary about Existing Prediction Method

    Black Box Cost Model

    Advantage Simple and Effective

    High accuracyHigh isomorphism

    Detailed analysis about Hadoop

    processingDivision is flexible (stage, resource)

    Multiple prediction

    Short

    Coming

    Lack of job feature extraction

    Lack of analysis

    Hard to divide each step and

    resource

    Lack of job feature extraction

    A lot of concurrent, hard to model

    Better for theoretical analysis, not

    suitable for prediction

    o Simple prediction,

    o Lack of jobs (jar package + data) analysis

    8

  • 8/12/2019 MapReduce performance prediction

    9/29

    Goal

    Design a Hadoop MapReduce performance prediction

    system to:

    - Predict the job consumption of various type of resources

    (CPU, Disk IO, Network)

    - Predict the execution time of Map phase and Reduce phase

    Prediction System

    - Map execution time

    - Reduce execution

    time

    - CPU Occupation Time

    - Disk Occupation Time

    - Network Occupation

    Time

    Job

    9

  • 8/12/2019 MapReduce performance prediction

    10/29

    Design - 1

    Cost Model

    - Map execution

    time

    - Reduce execution

    time

    - CPU Occupation Time- Disk Occupation Time

    - Network Occupation

    Time

    CO

    S

    T

    M

    OD

    E

    L

    Job

    10

  • 8/12/2019 MapReduce performance prediction

    11/29

  • 8/12/2019 MapReduce performance prediction

    12/29

    Cost Model [1]

    Cost Function Parameters Analysis

    Type OneConstant

    Hadoop System ConsumeInitialization Consume

    Type TwoJob-related Parameters Map Function Computational ComplexityMap Input

    Records

    Type ThreeParameters defined by Cost Model

    Sorting Coefficient, Complexity Factor

    [1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.12

  • 8/12/2019 MapReduce performance prediction

    13/29

    Parameters Collection

    Type One and Type Three Type one: Run empty map taskscalculate the system

    consumedfromthe logs

    Type Three: Extract the sort part from Hadoop source

    code, sort a certain number of records. Type Two

    Run a new jobanalyze log High Latency

    Large Overhead

    Sampling Dataonly analyze the behavior of mapfunction and reduce function Almost no latency

    Very low extra overhead Job Analyzer

    13

  • 8/12/2019 MapReduce performance prediction

    14/29

    Job Analyzer - Implementation

    Job Analyzer Implementation Hadoop virtual execution environment

    Accept the job Jar File & Input Data

    Sampling Module

    Sample input data by a certainpercentage (less than 5%).

    MR Module Instantiate user jobs class in

    using Java reflection

    Analyze Module Input Data (Amount & Number)

    Relative computational complexity

    Data conversion rates (output/input)

    Sampling

    Module

    MR

    Module

    Analyze Module

    Hadoop virtual execution

    environment

    Jar File + Input Data

    Job Feature

    14

  • 8/12/2019 MapReduce performance prediction

    15/29

    Job Analyzer - Feasibility

    Data similarity: Logs have uniform format

    Execution similarity: each record will be processed by the

    same map & reduce function repeatedly

    I

    N

    P

    U

    T

    D

    A

    T

    A

    Split

    Map

    Map

    Map

    Map

    Reduce

    Reduce

    15

  • 8/12/2019 MapReduce performance prediction

    16/29

    Design - 2

    Parameters Collection

    - Map execution

    time

    - Reduce execution

    time

    - CPU Occupation Time- Disk Occupation Time

    - Network Occupation

    Time

    C

    O

    S

    T

    M

    OD

    E

    L

    Job Analyzer:Collect

    Parameters of

    Type 2

    Static Parameters

    Collection Module:

    Collect

    Parameters of

    Type1 & Type 3

    16

  • 8/12/2019 MapReduce performance prediction

    17/29

    Prediction Model

    Problem Analysis-Many concurrent steps -- the total time can not be

    added up by the time of each part

    Initiation

    Read Data

    Network

    Transfer

    CreateObject

    Map Function

    Sort

    In

    Memory

    Read/Write

    Disk

    Merge

    Sort

    Write

    Disk

    Serializat

    ion

    CPU:Disk:

    Net:

    17

  • 8/12/2019 MapReduce performance prediction

    18/29

    Prediction Model

    Main Factors(according to the performancemodel)- Map Stage

    Initiation

    Read Data

    NetworkTransfer

    Create

    Object

    Map Function

    Sort

    In

    Memory

    Read/Write

    Disk

    Merge

    Sort

    Write

    Disk

    Serializ

    ation

    Tmap=0

    +1*MapInput

    +2*N

    +3*N*Log(N)

    +4*The complexity of map function

    +5*The conversion rate of map data

    The amount of input data

    The number of input records (N)

    The complexity of Map function

    The conversion rate of Map data

    NlogN

    18

  • 8/12/2019 MapReduce performance prediction

    19/29

  • 8/12/2019 MapReduce performance prediction

    20/29

  • 8/12/2019 MapReduce performance prediction

    21/29

    Find the nearest jobs!

    Instance-Based Linear Regression Find the nearest samples to the jobs to be predicted in

    history logs

    nearest-> similar jobs (Top K nearest, with K=10%-15%)

    Do linear regression to the samples we have found

    Calculate the prediction value

    Nearest The weighted distance of job features (weight w)

    High contribution for job classification

    map/reduce complexitymap/reduce data conversion rate

    Low contribution for job classification Data amountNumber of records

    21

  • 8/12/2019 MapReduce performance prediction

    22/29

    Prediction Module

    Procedure

    CostModel

    MainFactors

    Tmap=0+1*MapInput

    +2*N

    +3*N*Log(N)

    +4*The complexity of map

    function+5*The conversion rate of map

    data

    Job Features

    Search for the nearest

    samples

    Prediction Function

    Prediction Results

    1 2

    3

    4

    5

    6

    7

    22

  • 8/12/2019 MapReduce performance prediction

    23/29

    Prediction Module

    Procedure

    Training Set

    Find-Neighbor Module

    Prediction Results

    Prediction Function

    Cost Model

    23

  • 8/12/2019 MapReduce performance prediction

    24/29

    Design - 3

    Parameters Collection

    - CPU Occupation Time- Disk Occupation Time

    - Network Occupation

    Time

    C

    O

    S

    T

    M

    OD

    E

    L

    Job Analyzer:Collect

    Parameters of

    Type 2

    Static Parameters

    Collection Module:

    Collect

    Parameters of

    Type1 & Type 3

    - Map execution

    time

    - Reduce execution

    timePrediction

    Module

    24

  • 8/12/2019 MapReduce performance prediction

    25/29

    Experience

    Task Execution Time (Error Rate) K=12%, and with w different for each feature

    K=12%, and with w the same for each feature

    K=25%, and with w different for each feature

    4 kinds of jobs, 64M-8G

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

    ErrorRate(100%)

    Reduce Tasks

    k=12%

    k=25%

    k=12%,w=1

    0

    10

    20

    30

    40

    5060

    70

    80

    90

    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

    ErrorRate(100%

    Map Tasks

    Job ID Job ID 25

  • 8/12/2019 MapReduce performance prediction

    26/29

    Conclusion

    Job Analyzer :

    Analyze Job Jar + Input File

    Collect parameters

    Prediction Module:

    Find the main factor

    Propose a linear equation

    Job classification

    Multiple prediction

    26

  • 8/12/2019 MapReduce performance prediction

    27/29

    27

    Thank you!

    Question?

  • 8/12/2019 MapReduce performance prediction

    28/29

    Cost Model [1]

    Analysis about Reduce

    - Modeling the resources (CPU Disk Network) consumption

    - Each stage involves only one type of resources

    Initiation

    Read

    Data

    Network

    Transfer

    Create Object

    Reduce Function

    Merge

    SortRead/Write

    Disk

    Network

    Write DiskSerialization

    Deserialization

    Reduce CPU:Disk:

    Net:

    28

  • 8/12/2019 MapReduce performance prediction

    29/29

    Prediction Model

    Main Factors (according to the performancemodel)- Reduce Stage

    Initiation Read

    Data

    Network

    Transfer

    Create Object

    Reduce Function

    Merge

    Sort

    Read/WriteDisk

    Network

    Write DiskSerialization

    Deserialization

    Treduce=0+1*MapInput

    +2*N

    +3*Nlog(N)

    +4*The complexity of Reduce function

    +5*The conversion rate of Map data

    +6*The conversion rate of Reduce data

    The amount of input data

    The number of input records

    The complexity of Reduce

    function

    The conversion rate of Map data

    NlogN

    The conversion rate of Reducedata

    29