Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on...

35
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1 Reporter Yu Chih Lin

Transcript of Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on...

Page 1: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Location-aware MapReduce in Virtual Cloud

2011 IEEE computer society

International Conference on Parallel Processing

Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1

Reporter: Yu Chih Lin

Page 2: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Outline

Introduction

Background

Model and New Strategy

Implementation

Experiment

Conclusion

Page 3: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Introduction

MapReduce is an important programming model

• Processing

• Generating large data sets

Commonly used in applications

• web indexing

• Data mining

• machine learning

Page 4: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Introduction

Multi-core CPU supporting virtualization technology

• Run two or more virtual machines (VMs) simultaneously

• Share the I/O resources to users

MapReduce is set up on a distributed file system

• Goolge uses GFS

• Hadoop uses HDFS

Page 5: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Introduction

In a virtual environmen runs MapReduce, three major problems

• Disk sharing results in unbalanced data distribution and unbalanced workload

• I/O interference caused by data unbalance and load unbalance

• Disk sharing reduces the data redundancy

Page 6: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Introduction

Purpose of this paper

• Abstract a model

• Define evaluation metrics

• Analyze the data pattern and task pattern

For Hadoop

• propose a location-aware file block allocation strategy

Page 7: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Introduction

Three main benefits by using this paper strategy

• MapReduce’s workload is more balanced

• Reduces I/O interference and improves HDFS’s performance

• Retains data’s redundancy

Page 8: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Background

I/O has two kinds of traditional interference

• Disk interference –

when multiple processes try to access the same disk simultaneously

• Network interference –

mainly considers the latency and throughput

Page 9: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Background

I/O virtualization has two kinds of virtualization

• KVM

• Paravirtualization

Virtual machines share CPUs and memory well, but not I/O.

Page 10: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Background

Virtualized Hadoop architecture

Page 11: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

Build a generation model to analyze different allocation strategies

• Data pattern

• Task pattern

To simply the problem for analyzing, make the four assumptions

Page 12: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

Using the same I/O devices hosts and number of virtual machines on each physical machine

All the virtual machines are in local area network and the network topology is flat

No limitation for workload to be randomly assigned to each virtual machine

All file blocks have the same size

Page 13: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

actualReplicaNum (a) :

average number of unique file blocks in a physical machine

Ideal value is 3 (when thereplica number is 3)

Page 14: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

maxBlockNum (b) :

shows the maximum number of blocks in a physical machine

Page 15: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

blockNumSigma (c) :

shows the variation of the pattern

Idea value is 0

Page 16: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

maxAssignedNum (d) :

shows the max number of task that a physical machine is assigned

Page 17: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

assignedNumSigma (e) :

reveals the load balance of the task pattern

Page 18: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

A new allocation strategy

• Replicas of a file block to different physical machines

• Keeps balance ofthe block number of each physical machines

Present two intuitive ways

• Round-robin allocation

• Serpentine allocation

Page 19: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

For example , take p = 8 , n = 8 (p : physical machines , n : file blocks)

An example of round-robin allocation

Model and New Strategy

Page 20: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

For example , take p = 8 , n = 8(p : physical machines , n : file blocks)

An example of serpentine allocation

Page 21: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Model and New Strategy

Evaluation metrics for data pattern

actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0

Enumeration average results for task patterns

round-robin allocation as results:

maxAssignedNum=2.2724 , assignedNumSigma=0.7943

serpentine allocation as results:

maxAssignedNum=2.2705 , assignedNumSigma=0.79323

Page 22: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Implementation

Choose serpentine allocation

Add the location information of virtual node into the network topology

For example, one rack among the physical machines

• may be changed from /default-rack to /Phy0

For example, some rack among the physical machines

• may be changed from /rack1 to /rack1/Phy0

Page 23: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Implementation

Mechanism makes Hadoop easy

• It can keep compatibility with the native Hadoop

• Make special label starting with “ Phy ”

• Identify locations of virtual machines

Page 24: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Implementation

To maintain the block information for each virtual node

• In NameNode of Hadoop , add a sorted list by the number of blocks

In the update

• first update the block number of the virtual node

• Second update its position in the sorted list

Page 25: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Evaluation

Simulation to compare

• New strategy (serpentine allocation) and Hadoop’s original strategy

Set parameter

n = 256

p = [8,16,32,64,128,256]

sampling number is set to 1,000,000

Page 26: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Evaluation

maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling

Page 27: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Evaluation

actualReplicaNum’s comparison original and new strategy

Page 28: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Evaluation

blockNumSigma’s comparison originals and new strategy

Page 29: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Evaluation

maxAssignedNum’s comparison original and new strategy

Page 30: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Evaluation

assignedNumSigma’s comparison original and new strategy

Page 31: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Experiment

N=224 , P=8

SAMPLING NUMBER=1,000,000

Original New

Average of actualReplicaNum 2.0657 3

Average of maxBlockNum 90.5798 84

Average of blockNumSigma 4.1722 0

Average of maxAssignedNum 33.7660 34.5946

Average of assignedNumSigma 3.6256 4.14939

Page 32: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Experiment

Experiment results of RandomWriter’s execution time

Red : SC offBlue : SC on

Page 33: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Experiment

Experiment results of TextSort’s execution time

Red : SC offBlue : SC on

Page 34: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Experiment

Experiment results of WordCount’s execution time

Red : SC offBlue : SC on

Page 35: Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Conclusion

Address problems of data allocation and its impact on MapReduce system

Build a model and evaluation metrics to evaluate the data and task pattern

Propose a new strategy for file block allocation in Hadoop

Simulation and real experiments results

• prove the new allocation strategy is good