The Value of the Modern Data Architecture with Apache Hadoop and Teradata
SEMem: deployment of MPI-based in-memory storage for ... · Running Hadoop on modern supercomputers...
Transcript of SEMem: deployment of MPI-based in-memory storage for ... · Running Hadoop on modern supercomputers...
Thanh-Chung Dao and Shigeru ChibaThe University of Tokyo, Japan
1Thanh-Chung Dao
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
Running Hadoop on modern supercomputers
• Hadoop assumes every compute node has a local disk drive
• Modern supercomputers do not have local disk drives• It only has a central file server using e.g. Lustre• For example, K computer, Cray Titan, and IBM Sequoia
Thanh-Chung Dao 2SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
From Fujitsu
Why supercomputers do not have local disk drives
• Local disk• Not scalable• Hard to maintain• Physical space is limited
• It cannot be shared among all users
• SSD is available on some supercomputers• But data should be erased after a job finishes
3SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao
Using in-memory storage to provide efficient virtual local disks
• Research question: How to deploy in-memory storage on supercomputers• Choose the best deployment strategy in context of MapReduce
• Using in-memory storage is natural approach to avoid expensive disk I/O to central file server• Data is kept in memory
• Memcached-like separate in-memory server is also an option• Typical deployment of Memcached software is installing its daemon on
dedicated nodes
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 4Thanh-Chung Dao
Our approach: SEMemin-memory file system
• Users can choose three deployment strategies• RamDisk: data is stored only in local memory• Every-node: data can be stored in remote memory• Dedicated-node: data is stored on dedicated nodes
• Our in-memory storage, SEMem:• Easily configurable to select appropriate deployment strategy• Tightly integrated with Hadoop• Using MPI communication [Dao & Chiba, CCGRID’16]
5SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao
RamDisk: data is stored only in local memory
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 6
Not available
Node
Available
Thanh-Chung Dao
In-memory storage
Node 1
Tasks
data
RamDisk: data is stored only in local memory
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 7
In-memory storage
Node 2
In-memory storage
Node 3
Not available
Node
Available
• Out of memory can happenv Since each node has limited amount of memory
In-memory storage
Node 4
Thanh-Chung Dao
In-memory storage
Node 1
Tasks
data
Node 1 cannot use memory of Node 2, 3, & 4
The original Hadoop workflow
8SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
Data sending on the same node
Inter-node communication
Thanh-Chung Dao
ShuffleServerMappers Reducers
Map Output
Local Disk
RamDisk deployment on Hadoop workflow
9SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
• Mappers are modified to send their output directly to shuffle server
• In-memory storage is set up at shuffle server
Thanh-Chung Dao
Mappers ReducersMap Output
Local Disk
Shuffle server
In-memory storage
Inter-node communication
Fetch
Every-node: deployed on every node and data can be stored in remote memory
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 10
Full
Node
Available
Thanh-Chung Dao
In-memory storage
Node 1
Tasks
data
Every-node: deployed on every node and data can be stored in remote memory
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 11
Full
Node
Available
Thanh-Chung Dao
In-memory storage
Node 1
Tasks
data
Every-node: deployed on every node and data can be stored in remote memory
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 12
In-memory storage
Node 2
In-memory storage
Node 3
• More complex in-memory management is required
In-memory storage
Node 4
Thanh-Chung Dao
When storage on Node 1 is full, it can use memory of Node 2, 3, & 4
Full
Node
Available
In-memory storage
Node 1
Tasks
data
Inter-node communication
Every-node deployment on Hadoop workflow13SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Shuffle server
…
In-memory storage
Inter-node communication
Shuffle server
Daemon
Node
Thanh-Chung Dao
Every-node deployment on Hadoop workflow14SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Shuffle server
SEMem daemon
…
In-memory storage
Inter-node communication
Shuffle server
SEMem daemonDaemon
Node
Thanh-Chung Dao
Every-node deployment on Hadoop workflow15SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Map Output
Shuffle server
SEMem daemon
…
In-memory storage
Inter-node communication
Shuffle server
SEMem daemonDaemon
Node
Thanh-Chung Dao
Every-node deployment on Hadoop workflow16SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Map Output
Shuffle server
SEMem daemon
…
In-memory storage
Inter-node communication
Shuffle server
SEMem daemonDaemon
Node
Thanh-Chung Dao
Get metadata
Every-node deployment on Hadoop workflow17SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Map Output
Shuffle server
SEMem daemon
…
In-memory storage
Inter-node communication
Shuffle server
SEMem daemonDaemon
Node
Thanh-Chung Dao
Get metadata
Fetch
Dedicated-node: deployed only on dedicated nodes that are used only for storage
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 18Thanh-Chung Dao
In-memory storage
Node 1
Tasks
data
Full
Node
Available
Dedicated-node: deployed only on dedicated nodes that are used only for storage
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 19Thanh-Chung Dao
In-memory storage
Node 1
Tasks
data
Full
Node
Available
Dedicated-node: deployed only on dedicated nodes that are used only for storage
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 20
Not available
Node
Available
• Dedicated nodes• There is no computation task
In-memory storage
Dedicated node
In-memory storage
Dedicated node
In-memory storage
Node 2
Thanh-Chung Dao
In-memory storage
Node 1
Tasks
data
Dedicated-node: deployed only on dedicated nodes that are used only for storage
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 21
Not available
Node
Available
In-memory storage
Dedicated node
In-memory storage
Dedicated node
In-memory storage
Node 2
Thanh-Chung Dao
In-memory storage
Node 1
Tasks
data
• Dedicated nodes• There is no computation task
• It might slow since only 2 of 4 nodes compute the task
When storage on Node 1 is full, it can use memory of dedicate nodes
Dedicated-node deployment on Hadoop workflow22SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Shuffle server
In-memory storage
Inter-node communicationDaemon
Node
Thanh-Chung Dao
Dedicated-node deployment on Hadoop workflow23SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Shuffle server
SEMem daemon
SEMem daemon
Memory nodeManager
…In-memory storage
Inter-node communicationDaemon
Node
Thanh-Chung Dao
Dedicated-node deployment on Hadoop workflow24SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Shuffle server
SEMem daemon
SEMem daemon
Memory nodeManager
…In-memory storage
Inter-node communicationDaemon
Node
Thanh-Chung Dao
Dedicated-node deployment on Hadoop workflow25SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Shuffle server
SEMem daemon
SEMem daemon
Memory nodeManager
…In-memory storage
Inter-node communicationDaemon
Node
Thanh-Chung Dao
Map Output
Dedicated-node deployment on Hadoop workflow26SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Shuffle server
SEMem daemon
SEMem daemon
Memory nodeManager
…In-memory storage
Inter-node communicationDaemon
Node
Thanh-Chung Dao
Map Output
Get metadata
Dedicated-node deployment on Hadoop workflow27SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
MappersReducers
Shuffle server
SEMem daemon
SEMem daemon
Memory nodeManager
…In-memory storage
Inter-node communicationDaemon
Node
Thanh-Chung Dao
Map Output
Get metadata
Fetch
Technical issue 1: communication protocol on SEMem
• MPI communication on SEMem• Fast communication protocol is required
• Since every-node and dedicated-node are network-intensive• MPI is the de facto communication on modern supercomputers
• HPC-Reuse is used• Enable MPI over Hadoop processes
• MPI-friendly dynamic process creation is required
• Multiplexing non-blocking MPI on memory nodes• Since we want to avoid MPI_THREAD_MULTIPLE
• Handling multiple requests from clients
• Direct memory is used• Since memory copying between JVM’s heap and native MPI is slow
• Current MPI implementation is written in C
28SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao
MPI over Hadoop processes [Dao, CCGrid 2016]
• Using our HPC-Reuse• MPI-friendly dynamic process creation
• Hadoop requires dynamic process creation• Minimizing the cost of changes in architecture
• Gang scheduling (of processes) more favorable in MPI• All-or-nothing scheduling strategy
• Statically creating all processes at the beginning• Minimizing communication delay• Since resizing running jobs might affect performance and fairness
• MPI-Spawn is slow due to collective operation
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao 29
Avoiding MPI_THREAD_MULTIPLE
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 30
Chapter 4. E�ciently Virtualizing Local Disks 82
while true doif req == null then
req = MPI.iRecvendif there is a new request then
Add req to sendingPool’s waitList;Reset req = null
endfor slot in sendingPool’s slots do
if data reading finishes thenMPI.iSend to the client
endif iSend finishes then
free the slotend
endAssign req in waitList to free slots;
endAlgorithm 1: Multiplexing non-blocking MPI on the memory nodes
To multiplex non-blocking communication, SEMem runs a dedicated thread on every
node to handle it. Note that most of the supercomputers we know, for example TSUB-
AME and Fujitsu FX10, do not support MPI THREAD MULTIPLE mode. This ap-
proach helps avoid calling MPI send/recv in di↵erent threads.
On each memory node, we run a waiting daemon to listen incoming requests. The
algorithm implemented in the daemon is shown in Algorithm 1. We use progress strategy
to check iRecv and iSend status and data is queued as well. A loop is called to check if
there is any coming message or not (req variable). When there is a new request, it is
added to a queue (waitList). This waiting list contains requests that have been processed.
A sending pool, namely sendingPool, is created to process queued requests. A slot of
the sending pool is responsible for reading data and calling iSend. When data is read to
the sending bu↵er, iSend is called to send data to the client. Data reading must be also
implemented to use an asynchronous mechanism, for example AsynchronousFileChannel
while any MapOutput doWait for (host, MapOutputs) ;for each MapOutput do
MPI.Send to the host ;if MPI.Recv from the host then
Data in heap ;end
end
endAlgorithm 2: Receiving data at reducers of Hadoop MapReduce
Chapter 4. E�ciently Virtualizing Local Disks 82
while true doif req == null then
req = MPI.iRecvendif there is a new request then
Add req to sendingPool’s waitList;Reset req = null
endfor slot in sendingPool’s slots do
if data reading finishes thenMPI.iSend to the client
endif iSend finishes then
free the slotend
endAssign req in waitList to free slots;
endAlgorithm 1: Multiplexing non-blocking MPI on the memory nodes
To multiplex non-blocking communication, SEMem runs a dedicated thread on every
node to handle it. Note that most of the supercomputers we know, for example TSUB-
AME and Fujitsu FX10, do not support MPI THREAD MULTIPLE mode. This ap-
proach helps avoid calling MPI send/recv in di↵erent threads.
On each memory node, we run a waiting daemon to listen incoming requests. The
algorithm implemented in the daemon is shown in Algorithm 1. We use progress strategy
to check iRecv and iSend status and data is queued as well. A loop is called to check if
there is any coming message or not (req variable). When there is a new request, it is
added to a queue (waitList). This waiting list contains requests that have been processed.
A sending pool, namely sendingPool, is created to process queued requests. A slot of
the sending pool is responsible for reading data and calling iSend. When data is read to
the sending bu↵er, iSend is called to send data to the client. Data reading must be also
implemented to use an asynchronous mechanism, for example AsynchronousFileChannel
while any MapOutput doWait for (host, MapOutputs) ;for each MapOutput do
MPI.Send to the host ;if MPI.Recv from the host then
Data in heap ;end
end
endAlgorithm 2: Receiving data at reducers of Hadoop MapReduce
At SEMem daemon
• Multiplexing non-blocking MPI
At clients
Thanh-Chung Dao
Technical issue 2: storage size in dedicated-node SEMem
• How to estimate number of memory nodes• Need to minimize number of memory nodes
• Since we trade computation resource for data storage in dedicated-node
• Our approach: number of memory nodes is estimated roughly based on the size of input data
31SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao
Experiment configuration• Benchmarks
• Puma: WordCount, InvertedIndex, and SequenceCount• Tera-sort: up to 1 TB of input data
• Supercomputers• K computer-like FX10 at the University of Tokyo• TSUBAME at Tokyo Institute of Technology
• Hadoop v2.2.0
32SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao
33SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
Dedicated-node is faster than every-node in some benchmarks
0200
400
600
8001000
Data size
Tim
e (s
econ
d)
10G 20G 40G 80G 120G
Dedicated-nodeRamDiskEvery-nodeCentral-disk
0200
400
600
8001000
Data size
Tim
e (s
econ
d)
10G 20G 40G 80G 120G
Dedicated-nodeRamDiskEvery-nodeCentral-disk
0200
400
600
8001000
Data size
Tim
e (s
econ
d)
10G 20G 40G 80G 120G
Dedicated-nodeRamDiskEvery-nodeCentral-disk
WordcountInvertedIndex SequenceCount
Dedicated-node 13% faster than every-node
Out of memory
Thanh-Chung Dao
Configuration #computation nodesRamDisk 36
Every-node 36Central-disk 36
Dedicated-node 32
• In every-node deployment, the more complex SEMem daemon disturbs computation tasks at several places
• 4 memory nodes in dedicated-node
200
400
600
800
1000
1200
Tota
l Exe
cutio
n Ti
me
(sec
ond)
64G 128G 256G 512G 1TB
SEMemCentral diskSSD
#40
#40#48
#48
#64
Data size
34SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
Dedicated-node SEMem is faster than central disk and SSD
SEMem reduces execution time by 20% and 5% on average
• Total number of nodes is the same
• SEMem has lesscomputation nodes
• Tera-sort application on TSUBAME
Thanh-Chung Dao
35SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
Dedicated-node SEMem and SSD-backed storage can be an alternative for each other
200
400
600
800
1000
1200
Tota
l Exe
cutio
n Ti
me
(sec
ond)
64G 128G 256G 512G 1TB
SEMem (32 nodes + memory nodes)SSD (32 nodes + SSD storage)
#40 #40#48
#48
#64
Data size
SEMem is faster 4~13%
• Both have the samenumber of computation nodes
• SEMem has memory nodes
• Tera-sort application on TSUBAME
• SSD size of each node is 120 GB
Thanh-Chung Dao
36SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
100
200
300
400
500
600
Data size
Tota
l Exe
cutio
n Ti
me
(sec
ond)
64G 128G 256G 512G
MPI-based in memory HadoopTCP-based in-memory Hadoop
Up to 10% fasterthan TCP communication
MPI-based Hadoop is faster than TCP• Tera-sort on TSUBAME
(64 nodes)
• Both MPI and TCP test cases using in-memory storage
Thanh-Chung Dao
• Experiment purpose:• Measure performance impact of storage size (left figure)• When out of memory happens (right figure)
• Number of memory nodes is estimated based on size of input data• 64GB of input data• 8GB each memory node
37SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
Number of memory nodes should not be large1000
1500
2000
2500
3000
3500
Tim
e (M
illis
econ
ds)
4 6 8 10 12 14 16
Data copying timeNumber of memory nodes 050
100
150
200
Number of computation/storage nodes
Exe
cutio
n tim
e (S
econ
ds)
32/16 34/14 36/12 38/10 40/8 42/6 44/4 46/2 48/0
Tera-sort 64GB input data
Out of memory happens
8 memory nodes show the best result
Tera-sort on TSUBAME
Thanh-Chung Dao
Related work• M3R (VLDB 2012)
• In-memory storage by providing a shared heap-state• Data is stored through places and activities operators • Did not mention storage deployment explicitly and also no evaluation
• HaLoop (VLDB 2010)• Caching preferences by providing efficient hash algorithms for reading and
writing• Deployment strategies are not relevant in this context
• Spark (NSDI 2012)• Use in-memory storage• Choose a location for Resilient Distributed Datasets (RDD) through
preferredLocations() operator• Does not provide deployment strategies in general• There was no evaluation of RDD deployment
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 38Thanh-Chung Dao
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 39Thanh-Chung Dao
Limitations• No fault-tolerance in MPI
• Multiple levels of storage
• Preferred locations
Future work• Estimating number of memory nodes
• Topology of memory nodes
SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers 40Thanh-Chung Dao
Summary• Goal: Using in-memory storage to provide efficient virtual local disks
• Challenge: choose the best deployment strategy of in-memory storage (or virtual local disks)
• Our approach: SEMem• Dedicated-node strategy showed a good performance in some
benchmarks • Easily configure the system to investigate an appropriate
strategy for applications• MPI communication
41SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers Thanh-Chung Dao