VLDB, August 2012 (to appear) Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat.

M3R: INCREASED PERFORMANCE FOR IN-MEMORY HADOOP JOBS

VLDB, August 2012 (to appear)

Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat

BACKGROUND

Hadoop Map Reduce engine Posing a transformational effect on the practice of Big

Data computing Based on HDFS (a resilient distributed filesystem). Automaticlly partition data across nodes and operations

are applied in parallel. The remarkable properties

Simple Widely applicable Parallelizable framework Scalable framework Resilient framework

DISADVANTAGE

Design Point Offline,long-lived,resilient computations

HMR API Support only single-job execution.

Incure I/O and (de-)serialization cost. Mappers and reducers for each job are

started in new JVMs(JVMs typically have high startup cost).

An out-of-core shuffle implementation is used.

Pose a substatial effect on performance. We need interactive analytics

M3R---A NEW DESING POINT

M3R(Main Memory MapReduce) It is a new implementation of the HMR API.

M3R/Hadoop Implementation of HMR API using managed X10 Existing Hadoop applications just work. Reuse HDFS (and some other parts of Hadoop) In-memory: problem size must fit in cluster RAM Not resilient: any node goes down lead to fail. But considerably faster (closer to HPC speeds)

X10

A type-safe, objectoriented,multi-threaded, multi-node, garbage-collected programming language

X10 is built on the two fundamental notions of places and asynchrony.

Place Also called Process. Supplies memory and worker-threads. Collection of resident mutable data objects and activities that

operate on data. Asynchrony

Use asynchrony within a place and for communication across places

ADVANTAGES IN M3R

Reducing Disk I/O

Reducing network communication

Reducing serialization/deserialization cost.

M3R affords significant benefits for job pipelines.

OUTLINE

HMR engine execution flows. M3R engine execution flows. EVALUATION CONCLUSIONS FUTURE WORK

8

BASIC FLOW FOR AN HADOOP JOB

Input

(InputFormat/

RecordReader/

InputSplit)

File System

(HDFS)

Map

(Mapper)

Reduce

(Reducer)

Output

(OutputFormat/

RecordWriter

OutputCommitter)

ShuffleFile

System

File

System

Network and disk i/os and

deser cost

disk i/o and seri cost

Seri cost and disk i/o

Network and disk i/o

How can we eliminate these i/os?M3R

Network and disk i/os

M3R EXECUTION FLOW

The general flow of M3R is similar to the flow of the HMR engine. An M3R instance is associated with a fixed set of

JVMs. Significant benefits in avoiding network, file

i/o and (de-)serialization costs.(job pipelines) Input/Output Cache Co-location Partition Stability DeDuplication.

INPUT/OUTPUT CACHE

Introduce an in-memory key/value cache. M3R caches the key/value pairs in memory

before passing key/value pair to the mapper. before serializing it and write it to disk.

Bypass the required key/value sequence directly from the cache.As the data is stored in memory, there are no

attendant (de)serialization costs or disk/network I/O activity.

11

BASIC FLOW FOR AN M3R/HADOOP JOB

Input

(InputFormat/

RecordReader/

InputSplit)

File System

(HDFS)

Map

(Mapper)

Reduce

(Reducer)

Output

(OutputFormat/

RecordWriter

OutputCommitter)

Shuffle

Cache

Eliminate disk,network I/Os and (de)ser costs specially for shuffle

Single job: Eliminate disk I/Os. Get rid of the file system backing for the

two sides of the shuffle.

No disk I/O

No disk I/O

Job pipelines:No network ,disk I/Os,no (de)serili costs

SHUFFLE

Shuffle 描述着数据从 map task 输出到 reduce task 输入的这段过程。

大部分 map task 与 reduce task 的执行是在不同的节点上 . Reduce 执行时需要跨节点去拉取其它节点上的 map

task 结果 ---network I/O Shuffle 的目标：

完整地从 map task 端拉取数据到 reduce 端。在跨节点拉取数据时，尽可能地减少对带宽的不必要消耗。减少磁盘 IO 对 task 执行的影响。

能优化的地方主要在于减少拉取数据的量及尽量使用内存而不是磁盘。

ELIMINATE NETWORK I/O AND DISK I/O

Co-location Start multiple mappers and reducers in

each place. Some of the data a mapper is sending is

destined for a reducer running in the same JVM.

The M3R engine guarantees that no network, or disk I/O is involved.

MINIMIZE THE AMOUNT THAT NEEDS TO BE COMMUNICATED

We can’t avoid the time and space overhead of (de)serialization in shuffle.

The nodes need to communicate. We can reduce the amount that needs to

be communicated.

15

MAPPERS/SHUFFLE/REDUCERS

Mapper1

Shuffle

Mapper2

Mapper3

Mapper4

Mapper5

Mapper6

Reducer1

Reducer2

Reducer3

Reducer4

Reducer5

Reducer6

Through the shuffle, the mappers send data to various reducers.

M3R---PARTITION STABILITY

M3R provides partition stability guarantee The mapping from partitions to places is

deterministic. Allows job sequences to use a consistent

partitioner to route data locally. The reducer associated with a given

partition number will always be run at the same place Same place => Same memory Can reuse existing data structures.

Avoid a significant amount of communication

17

PARTITIONER: CONNECTING MAPPERS AND REDUCERS

Mapper1

ShuffleMapper2

Mapper3

Mapper4

Mapper5

Mapper6

Reducer1

Reducer2

Reducer3

Reducer4

Reducer5

Reducer6

int partitionNumber = getPartition(key, value);

Partitioner

DE-DUPLICATION

M3R co-locate reducers Coalesce duplicate keys and duplicate

values, and only send one copy. On deserialization ,at the destination, there

will be some aliases to that copy.

This also works if multiple mappers at a single place send the same data.

19

HADOOP BROADCAST

Mapper1

Shuffle

Mapper2

Mapper3

Mapper4

Mapper5

Mapper6

Reducer1

Reducer2

Reducer3

Reducer4

Reducer5

Reducer6

20

M3R BROADCAST VIA DE-DUPLICATION

Mapper1

Shuffle

Mapper2

Mapper3

Mapper4

Mapper5

Mapper6

Reducer1

Reducer2

Reducer3

Reducer4

Reducer5

Reducer6

21

M3R BROADCAST VIA DE-DUPLICATION

Mapper1

Shuffle

Mapper2

Mapper3

Mapper4

Mapper5

Mapper6

Reducer1

Reducer2

Reducer3

Reducer4

Reducer5

Reducer6

22

EXAMPLE-----ITERATED MATRIX VECTOR MULTIPLICATION IN HADOOP

Reducer (*)

Map/Pass (G)

File System

(HDFS)

Map/Bcast (V)

Input (G)

Input (V)Output V#Shuffle

Map/Pass (V#)

Input (V#) Reducer (+) Output V’Shuffle

23

ITERATED MATRIX VECTOR MULTIPLICATION IN M3R

Reducer (*)

Map/Pass (G)

File System

(HDFS)

Map/Bcast (V)

Input (G)

Input (V)

Shuffle

Map/Pass (V#) Reducer (+) Output V’

Shuffle

Cache

Do not communicate G

Do no communication

EVALUATION

20 node cluster of IBM LS-22 blades connected by Gigabit Ethernet.

Each node has 2 quad-core AMD 2.3Ghz Opteron processors, 16 GB of memory, and is running Red Hat Enterprise Linux 6.2.

The JVM used is IBM J9 1.6.0. When running M3R on this cluster, we used

one process per host, using 8 worker threads to exploit the 8 cores.

No partition stability,no cacheEvery iteration takes the sameamount of time

Performance changes drasticallyaccording to the amount of remote shuffling

CONCULUSIONS

Sacrifice resilience and out-of-core execution Gain performance. Used X10 to build a fast map/reduce engine Used X10 features to implement distributed

cache Avoid serialization, disk, network I/O costs.

50x faster for Hadoop app designed for M3R

Thank you for your time!

VLDB, August 2012 (to appear) Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat.

Documents

Transcript of VLDB, August 2012 (to appear) Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat.