SEMem: deployment of MPI-based in-memory storage for ... · Running Hadoop on modern supercomputers...

Thanh-Chung Dao and Shigeru ChibaThe University of Tokyo, Japan

1Thanh-Chung Dao

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

Running Hadoop on modern supercomputers

• Hadoop assumes every compute node has a local disk drive

• Modern supercomputers do not have local disk drives• It only has a central file server using e.g. Lustre• For example, K computer, Cray Titan, and IBM Sequoia

Thanh-Chung Dao 2SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

From Fujitsu

Why supercomputers do not have local disk drives

• Local disk• Not scalable• Hard to maintain• Physical space is limited

• It cannot be shared among all users

• SSD is available on some supercomputers• But data should be erased after a job finishes

3SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao

Using in-memory storage to provide efficient virtual local disks

• Research question: How to deploy in-memory storage on supercomputers• Choose the best deployment strategy in context of MapReduce

• Using in-memory storage is natural approach to avoid expensive disk I/O to central file server• Data is kept in memory

• Memcached-like separate in-memory server is also an option• Typical deployment of Memcached software is installing its daemon on

dedicated nodes

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 4Thanh-Chung Dao

Our approach: SEMemin-memory file system

• Users can choose three deployment strategies• RamDisk: data is stored only in local memory• Every-node: data can be stored in remote memory• Dedicated-node: data is stored on dedicated nodes

• Our in-memory storage, SEMem:• Easily configurable to select appropriate deployment strategy• Tightly integrated with Hadoop• Using MPI communication [Dao & Chiba, CCGRID’16]


RamDisk: data is stored only in local memory

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 6

Not available

Node

Available

Thanh-Chung Dao

In-memory storage

Node 1

Tasks

data

RamDisk: data is stored only in local memory


In-memory storage

Node 2

In-memory storage

Node 3

Not available

Node

Available

• Out of memory can happenv Since each node has limited amount of memory

In-memory storage

Node 4

Thanh-Chung Dao

In-memory storage

Node 1

Tasks

data

Node 1 cannot use memory of Node 2, 3, & 4

The original Hadoop workflow

8SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

Data sending on the same node

Inter-node communication

Thanh-Chung Dao

ShuffleServerMappers Reducers

Map Output

Local Disk

RamDisk deployment on Hadoop workflow


• Mappers are modified to send their output directly to shuffle server

• In-memory storage is set up at shuffle server

Thanh-Chung Dao

Mappers ReducersMap Output

Local Disk

Shuffle server

In-memory storage


Fetch

Every-node: deployed on every node and data can be stored in remote memory


Full

Node

Available

Thanh-Chung Dao

In-memory storage

Node 1

Tasks

data



Full

Node

Available

Thanh-Chung Dao

In-memory storage

Node 1

Tasks

data



In-memory storage

Node 2

In-memory storage

Node 3

• More complex in-memory management is required

In-memory storage

Node 4

Thanh-Chung Dao

When storage on Node 1 is full, it can use memory of Node 2, 3, & 4

Full

Node

Available

In-memory storage

Node 1

Tasks

data


Every-node deployment on Hadoop workflow13SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

MappersReducers

Shuffle server

…

In-memory storage


Shuffle server

Daemon

Node

Thanh-Chung Dao


MappersReducers

Shuffle server

SEMem daemon

…

In-memory storage


Shuffle server

SEMem daemonDaemon

Node

Thanh-Chung Dao


MappersReducers

Map Output

Shuffle server

SEMem daemon

…

In-memory storage


Shuffle server

SEMem daemonDaemon

Node

Thanh-Chung Dao


MappersReducers

Map Output

Shuffle server

SEMem daemon

…

In-memory storage


Shuffle server

SEMem daemonDaemon

Node

Thanh-Chung Dao

Get metadata


MappersReducers

Map Output

Shuffle server

SEMem daemon

…

In-memory storage


Shuffle server

SEMem daemonDaemon

Node

Thanh-Chung Dao

Get metadata

Fetch

Dedicated-node: deployed only on dedicated nodes that are used only for storage


In-memory storage

Node 1

Tasks

data

Full

Node

Available



In-memory storage

Node 1

Tasks

data

Full

Node

Available



Not available

Node

Available

• Dedicated nodes• There is no computation task

In-memory storage

Dedicated node

In-memory storage

Dedicated node

In-memory storage

Node 2

Thanh-Chung Dao

In-memory storage

Node 1

Tasks

data



Not available

Node

Available

In-memory storage

Dedicated node

In-memory storage

Dedicated node

In-memory storage

Node 2

Thanh-Chung Dao

In-memory storage

Node 1

Tasks

data

• Dedicated nodes• There is no computation task

• It might slow since only 2 of 4 nodes compute the task

When storage on Node 1 is full, it can use memory of dedicate nodes

Dedicated-node deployment on Hadoop workflow22SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

MappersReducers

Shuffle server

In-memory storage

Inter-node communicationDaemon

Node

Thanh-Chung Dao


MappersReducers

Shuffle server

SEMem daemon

SEMem daemon

Memory nodeManager

…In-memory storage


Node

Thanh-Chung Dao


MappersReducers

Shuffle server

SEMem daemon

SEMem daemon

Memory nodeManager



Node

Thanh-Chung Dao


MappersReducers

Shuffle server

SEMem daemon

SEMem daemon

Memory nodeManager



Node

Thanh-Chung Dao

Map Output


MappersReducers

Shuffle server

SEMem daemon

SEMem daemon

Memory nodeManager



Node

Thanh-Chung Dao

Map Output

Get metadata


MappersReducers

Shuffle server

SEMem daemon

SEMem daemon

Memory nodeManager



Node

Thanh-Chung Dao

Map Output

Get metadata

Fetch

Technical issue 1: communication protocol on SEMem

• MPI communication on SEMem• Fast communication protocol is required

• Since every-node and dedicated-node are network-intensive• MPI is the de facto communication on modern supercomputers

• HPC-Reuse is used• Enable MPI over Hadoop processes

• MPI-friendly dynamic process creation is required

• Multiplexing non-blocking MPI on memory nodes• Since we want to avoid MPI_THREAD_MULTIPLE

• Handling multiple requests from clients

• Direct memory is used• Since memory copying between JVM’s heap and native MPI is slow

• Current MPI implementation is written in C


MPI over Hadoop processes [Dao, CCGrid 2016]

• Using our HPC-Reuse• MPI-friendly dynamic process creation

• Hadoop requires dynamic process creation• Minimizing the cost of changes in architecture

• Gang scheduling (of processes) more favorable in MPI• All-or-nothing scheduling strategy

• Statically creating all processes at the beginning• Minimizing communication delay• Since resizing running jobs might affect performance and fairness

• MPI-Spawn is slow due to collective operation

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao 29

Avoiding MPI_THREAD_MULTIPLE


Chapter 4. E�ciently Virtualizing Local Disks 82

while true doif req == null then

req = MPI.iRecvendif there is a new request then

Add req to sendingPool’s waitList;Reset req = null

endfor slot in sendingPool’s slots do

if data reading finishes thenMPI.iSend to the client

endif iSend finishes then

free the slotend

endAssign req in waitList to free slots;

endAlgorithm 1: Multiplexing non-blocking MPI on the memory nodes

To multiplex non-blocking communication, SEMem runs a dedicated thread on every

node to handle it. Note that most of the supercomputers we know, for example TSUB-

AME and Fujitsu FX10, do not support MPI THREAD MULTIPLE mode. This ap-

proach helps avoid calling MPI send/recv in di↵erent threads.

On each memory node, we run a waiting daemon to listen incoming requests. The

algorithm implemented in the daemon is shown in Algorithm 1. We use progress strategy

to check iRecv and iSend status and data is queued as well. A loop is called to check if

there is any coming message or not (req variable). When there is a new request, it is

added to a queue (waitList). This waiting list contains requests that have been processed.

A sending pool, namely sendingPool, is created to process queued requests. A slot of

the sending pool is responsible for reading data and calling iSend. When data is read to

the sending bu↵er, iSend is called to send data to the client. Data reading must be also

implemented to use an asynchronous mechanism, for example AsynchronousFileChannel

while any MapOutput doWait for (host, MapOutputs) ;for each MapOutput do

MPI.Send to the host ;if MPI.Recv from the host then

Data in heap ;end

end

endAlgorithm 2: Receiving data at reducers of Hadoop MapReduce

Chapter 4. E�ciently Virtualizing Local Disks 82

while true doif req == null then

req = MPI.iRecvendif there is a new request then

Add req to sendingPool’s waitList;Reset req = null

endfor slot in sendingPool’s slots do

if data reading finishes thenMPI.iSend to the client

endif iSend finishes then

free the slotend

endAssign req in waitList to free slots;

endAlgorithm 1: Multiplexing non-blocking MPI on the memory nodes

To multiplex non-blocking communication, SEMem runs a dedicated thread on every

node to handle it. Note that most of the supercomputers we know, for example TSUB-

AME and Fujitsu FX10, do not support MPI THREAD MULTIPLE mode. This ap-

proach helps avoid calling MPI send/recv in di↵erent threads.

On each memory node, we run a waiting daemon to listen incoming requests. The

algorithm implemented in the daemon is shown in Algorithm 1. We use progress strategy

to check iRecv and iSend status and data is queued as well. A loop is called to check if

there is any coming message or not (req variable). When there is a new request, it is

added to a queue (waitList). This waiting list contains requests that have been processed.

A sending pool, namely sendingPool, is created to process queued requests. A slot of

the sending pool is responsible for reading data and calling iSend. When data is read to

the sending bu↵er, iSend is called to send data to the client. Data reading must be also

implemented to use an asynchronous mechanism, for example AsynchronousFileChannel

while any MapOutput doWait for (host, MapOutputs) ;for each MapOutput do

MPI.Send to the host ;if MPI.Recv from the host then

Data in heap ;end

end

endAlgorithm 2: Receiving data at reducers of Hadoop MapReduce

At SEMem daemon

• Multiplexing non-blocking MPI

At clients

Thanh-Chung Dao

Technical issue 2: storage size in dedicated-node SEMem

• How to estimate number of memory nodes• Need to minimize number of memory nodes

• Since we trade computation resource for data storage in dedicated-node

• Our approach: number of memory nodes is estimated roughly based on the size of input data


Experiment configuration• Benchmarks

• Puma: WordCount, InvertedIndex, and SequenceCount• Tera-sort: up to 1 TB of input data

• Supercomputers• K computer-like FX10 at the University of Tokyo• TSUBAME at Tokyo Institute of Technology

• Hadoop v2.2.0



Dedicated-node is faster than every-node in some benchmarks

0200

400

600

8001000

Data size

Tim

e (s

econ

d)

10G 20G 40G 80G 120G

Dedicated-nodeRamDiskEvery-nodeCentral-disk

0200

400

600

8001000

Data size

Tim

e (s

econ

d)

10G 20G 40G 80G 120G


0200

400

600

8001000

Data size

Tim

e (s

econ

d)

10G 20G 40G 80G 120G


WordcountInvertedIndex SequenceCount

Dedicated-node 13% faster than every-node

Out of memory

Thanh-Chung Dao

Configuration #computation nodesRamDisk 36

Every-node 36Central-disk 36

Dedicated-node 32

• In every-node deployment, the more complex SEMem daemon disturbs computation tasks at several places

• 4 memory nodes in dedicated-node

200

400

600

800

1000

1200

Tota

l Exe

cutio

n Ti

me

(sec

ond)

64G 128G 256G 512G 1TB

SEMemCentral diskSSD

#40

#40#48

#48

#64

Data size


Dedicated-node SEMem is faster than central disk and SSD

SEMem reduces execution time by 20% and 5% on average

• Total number of nodes is the same

• SEMem has lesscomputation nodes

• Tera-sort application on TSUBAME

Thanh-Chung Dao


Dedicated-node SEMem and SSD-backed storage can be an alternative for each other

200

400

600

800

1000

1200

Tota

l Exe

cutio

n Ti

me

(sec

ond)

64G 128G 256G 512G 1TB

SEMem (32 nodes + memory nodes)SSD (32 nodes + SSD storage)

#40 #40#48

#48

#64

Data size

SEMem is faster 4~13%

• Both have the samenumber of computation nodes

• SEMem has memory nodes

• Tera-sort application on TSUBAME

• SSD size of each node is 120 GB

Thanh-Chung Dao


100

200

300

400

500

600

Data size

Tota

l Exe

cutio

n Ti

me

(sec

ond)

64G 128G 256G 512G

MPI-based in memory HadoopTCP-based in-memory Hadoop

Up to 10% fasterthan TCP communication

MPI-based Hadoop is faster than TCP• Tera-sort on TSUBAME

(64 nodes)

• Both MPI and TCP test cases using in-memory storage

Thanh-Chung Dao

• Experiment purpose:• Measure performance impact of storage size (left figure)• When out of memory happens (right figure)

• Number of memory nodes is estimated based on size of input data• 64GB of input data• 8GB each memory node


Number of memory nodes should not be large1000

1500

2000

2500

3000

3500

Tim

e (M

illis

econ

ds)

4 6 8 10 12 14 16

Data copying timeNumber of memory nodes 050

100

150

200

Number of computation/storage nodes

Exe

cutio

n tim

e (S

econ

ds)

32/16 34/14 36/12 38/10 40/8 42/6 44/4 46/2 48/0

Tera-sort 64GB input data

Out of memory happens

8 memory nodes show the best result

Tera-sort on TSUBAME

Thanh-Chung Dao

Related work• M3R (VLDB 2012)

• In-memory storage by providing a shared heap-state• Data is stored through places and activities operators • Did not mention storage deployment explicitly and also no evaluation

• HaLoop (VLDB 2010)• Caching preferences by providing efficient hash algorithms for reading and

writing• Deployment strategies are not relevant in this context

• Spark (NSDI 2012)• Use in-memory storage• Choose a location for Resilient Distributed Datasets (RDD) through

preferredLocations() operator• Does not provide deployment strategies in general• There was no evaluation of RDD deployment



Limitations• No fault-tolerance in MPI

• Multiple levels of storage

• Preferred locations

Future work• Estimating number of memory nodes

• Topology of memory nodes


Summary• Goal: Using in-memory storage to provide efficient virtual local disks

• Challenge: choose the best deployment strategy of in-memory storage (or virtual local disks)

• Our approach: SEMem• Dedicated-node strategy showed a good performance in some

benchmarks • Easily configure the system to investigate an appropriate

strategy for applications• MPI communication


SEMem: deployment of MPI-based in-memory storage for ... · Running Hadoop on modern supercomputers...

Documents

Transcript of SEMem: deployment of MPI-based in-memory storage for ... · Running Hadoop on modern supercomputers...