The Future of Supercomputer Software Libraries
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council Israel Supercomputing Conference
by
High-End Computing (HEC): PetaFlop to ExaFlop
HPC Israel (Feb '12)
Expected to have an ExaFlop system in 2019 !
2
20-30
PFlops in
2013
100
PFlops in
2016
HPC Israel (Feb '12)
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
Nov. 1996: 0/500 (0%) Jun. 2002: 80/500 (16%) Nov. 2007: 406/500 (81.2%)
Jun. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Jun. 2008: 400/500 (80.0%)
Nov. 1997: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Nov. 2008: 410/500 (82.0%)
Jun. 1998: 1/500 (0.2%) Nov. 2003: 208/500 (41.6%) Jun. 2009: 410/500 (82.0%)
Nov. 1998: 2/500 (0.4%) Jun. 2004: 291/500 (58.2%) Nov. 2009: 417/500 (83.4%)
Jun. 1999: 6/500 (1.2%) Nov. 2004: 294/500 (58.8%) Jun. 2010: 424/500 (84.8%)
Nov. 1999: 7/500 (1.4%) Jun. 2005: 304/500 (60.8%) Nov. 2010: 415/500 (83%)
Jun. 2000: 11/500 (2.2%) Nov. 2005: 360/500 (72.0%) Jun. 2011: 411/500 (82.2%)
Nov. 2000: 28/500 (5.6%) Jun. 2006: 364/500 (72.8%) Nov. 2011: 410/500 (82.0%)
Jun. 2001: 33/500 (6.6%) Nov. 2006: 361/500 (72.2%)
Nov. 2001: 43/500 (8.6%) Jun. 2007: 373/500 (74.6%)
3
• 209 IB Clusters (41.8%) in the November‘11 Top500 list
(http://www.top500.org)
• Installations in the Top 30 (13 systems):
HPC Israel (Feb '12)
Large-scale InfiniBand Installations
120,640 cores (Nebulae) in China (4th) 29,440 cores (Mole-8.5) in China (21st)
73,278 cores (Tsubame-2.0) in Japan (5th) 42,440 cores (Red Sky) at Sandia (24th)
111,104 cores (Pleiades) at NASA Ames (7th) 62,976 cores (Ranger) at TACC (25th)
138,368 cores (Tera-100) at France (9th) 20,480 cores (Bull Benchmarks) in France (27th)
122,400 cores (RoadRunner) at LANL (10th) 20,480 cores (Helios) in Japan (28th)
137,200 cores (Sunway Blue Light) in China (14th) More are getting installed !
46,208 cores (Zin) at LLNL (15th)
33,072 cores (Lomonosov) in Russia (18th)
4
• Scientific Computing
– Message Passing Interface (MPI) is the Dominant Programming
Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of
momentum
– Memcached is also used
5
Two Major Categories of Applications
HPC Israel (Feb '12)
Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models Message Passing Interface (MPI), Sockets, PGAS (UPC, Global Arrays), Hadoop and MapReduce
Applications/Libraries
Networking Technologies (InfiniBand, 1/10/40GigE, RNICs & Intelligent NICs)
Commodity Computing System Architectures
(single, dual, quad, ..) Multi/Many-core architecture and
Accelerators
HPC Israel (Feb '12)
Point-to-point Communication QoS
Collective Communication
Synchronization & Locks
I/O & File Systems
Fault Tolerance
Library or Runtime for Programming Models
6
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Hybrid programming (MPI + OpenMP, MPI + UPC, …)
• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Support for GPGPUs and Accelerators
• Scalable Collective communication – Offload
– Non-blocking
– Topology-aware
– Power-aware
• Fault-tolerance/resiliency
• QoS support for communication and I/O
Designing (MPI+X) at Exascale
7 HPC Israel (Feb '12)
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 2002
– Used by more than 1,840 organizations (HPC Centers, Industry and Universities) in
65 countries
– More than 95,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 5th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• 7th ranked 111,104-core cluster (Pleiades) at NASA
• 25th ranked 62,976-core cluster (Ranger) at TACC
• and many others
– Available with software stacks of many InfiniBand, High-speed Ethernet and server
vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros
(RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the upcoming U.S. NSF-TACC Stampede (10-15 PFlop) System
8
MVAPICH/MVAPICH2 Software
HPC Israel (Feb '12)
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, Topology-aware and Power-aware
– Exploiting Collective Offload
• Support for GPGPUs and Accelerators
• PGAS Support (Hybrid MPI + UPC)
Approaches being used in MVAPICH2 for Exascale
9 HPC Israel (Feb '12)
HPC Israel (Feb '12) 10
One-way Latency: MPI over IB
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX-QDR
Late
ncy
(u
s)
Message Size (bytes)
Large Message Latency
2.4 GHz Quad-core (Westmere) Intel with IB switch
HPC Israel (Feb '12) 11
Bandwidth: MPI over IB
0.00
500.00
1000.00
1500.00
2000.00
2500.00
3000.00
3500.00
4000.00Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917
1706
0.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
7000.00MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX-QDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3341
3704
4407
6521
2.4 GHz Quad-core (Westmere) Intel with IB switch
12
IB Transport Services
Service Type Connection
Oriented Acknowledged Transport
Reliable Connection (RC) Yes Yes IBA
Unreliable Connection (UC) Yes No IBA
Reliable Datagram (RD) No Yes IBA
Unreliable Datagram (UD) No No IBA
RAW Datagram No No Raw
HPC Israel (Feb '12)
• eXtended Reliable Connection (XRC) has been added later for
multi-core platforms
HPC Israel (Feb '12)
UD vs. RC: Performance and Scalability (SMG2000 Application)
0
0.2
0.4
0.6
0.8
1
1.2
128 256 512 1024 2048 4096
No
rma
lize
d T
ime
Processes
RC UD
RC (MVAPICH 0.9.8) UD Design
Conn. Buffers Struct. Total Buffers Struct Total
512 22.9 65.0 0.3 88.2 37.0 0.2 37.2
1024 29.5 65.0 0.6 95.1 37.0 0.4 37.4
2048 42.4 65.0 1.2 107.4 37.0 0.9 37.9
4096 66.7 65.0 2.4 134.1 37.0 1.7 38.7
Memory Usage (MB/process)
M. Koop, S. Sur, Q. Gao and D. K. Panda, “High Performance MPI Design using Unreliable Datagram for Ultra-
Scale InfiniBand Clusters,” ICS ‘07
13
• Both UD and RC/XRC have benefits
• User transparent, automatically adjusts between UD-RC/XRC
• Runtime options for advanced users to extract highest performance and
scalability
• Available in MVAPICH 1.1 and 1.2 for some time (as a separate interface)
• Available since MVAPICH2 1.7 in an integrated manner with Gen2 interface
HPC Israel (Feb '12)
Hybrid Transport Design (UD-RC/XRC)
M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over
InfiniBand,” IPDPS ‘08
14
0
2000
4000
6000
8000
10000
12000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Ban
dw
idth
(M
B/s
)
Message Size (byte)
Bandwidth
intra-Socket-LiMIC
intra-Socket-Shmem
inter-Socket-LiMIC
inter-Socket-Shmem
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support)
15 HPC Israel (Feb '12)
Latest MVAPICH2 1.8a2
Intel Westmere
Intra-socket:
- 0.19 microseconds for 4bytes
Inter-socket:
- 0.45 microseconds for 4bytes
10,000MB/s 18,500MB/s
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
1 4 16 64 256 1K 4K 6K 64K 256K 1M 4M
Ban
dw
idth
(M
B/s
)
Message Size (byte)
Bi-Directional Bandwidth intra-Socket-LiMIC
intra-Socket-Shmem
inter-Socket-LiMIC
inter-Socket-Shmem
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (byte)
Latency Intra-Socket
Inter-Socket
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, Topology-aware and Power-aware
– Exploiting Collective Offload
• Support for GPGPUs and Accelerators
• PGAS Support (Hybrid MPI + UPC)
MPI Design Challenges
16 HPC Israel (Feb '12)
Shared-memory Aware Collectives (4K cores on TACC Ranger with MVAPICH2)
0
20
40
60
80
100
120
140
160
0 4 8 16 32 64 128 256 512
Late
ncy
(u
s)
Message Size (bytes)
MPI_Reduce (4096 cores)
Original
Shared-memory
HPC Israel (Feb '12)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
Late
ncy
(u
s)
Message Size (bytes)
MPI_ Allreduce (4096 cores)
Original
Shared-memory
17
Non-contiguous Allocation of Jobs
• Supercomputing systems organized
as racks of nodes interconnected
using complex network
architectures
• Job schedulers used to allocate
compute nodes to various jobs
Line Card Switches
Line Card Switches
Spine Switches
- Busy Core
- Idle Core
New Job
- New Job
18 HPC Israel (Feb '12)
Non-contiguous Allocation of Jobs
• Supercomputing systems organized
as racks of nodes interconnected
using complex network
architectures
• Job schedulers used to allocate
compute nodes to various jobs
• Individual processes belonging to
one job can get scattered
• Primary responsibility of scheduler is
to keep system throughput high
Line Card Switches
Line Card Switches
Spine Switches
- Busy Core
- Idle Core
- New Job
19 HPC Israel (Feb '12)
HPC Israel (Feb '12)
Topology-Aware Collectives
Default (Binomial) Vs Topology-Aware Algorithms with 296 Processes
20
K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda, “Designing Topology-Aware Collective Communication
Algorithms for Large Scale Infiniband Clusters: Case Studies with Scatter and Gather,” CAC ‘10
0.E+00
1.E+05
2.E+05
3.E+05
4.E+05
5.E+05
2K
4K
8K
16
K
32
K
64
K
12
K
25
6K
51
2K
12
8K
Late
ncy
(u
sec)
Message Size (Bytes)
Scatter-Default
Scatter-Topo-Aware
0.E+001.E+052.E+053.E+054.E+055.E+056.E+057.E+05
2K
4K
8K
16
K
32
K
64
K
12
8K
25
6K
51
2K
12
8KLa
ten
cy (
use
c)
Message Size (Bytes)
Gather-Default
Gather-Topo-Aware
0
20
40
60
80
100
2R 4R 8R 16R 32R
Late
ncy
(u
sec)
System Size – Number of Racks
Default
Topo-Aware
Estimated Latency Of Default and Topology Aware Algorithms for small messages And Varying System Sizes
22% 54%
Impact of Network-Topology Aware Algorithms on Broadcast Performance
0
5
10
15
20
25
30
35
40
128K 256K 512K 1M
Late
ncy
(m
s)
Message Size (Bytes)
No-Topo-Aware
Topo-Aware-DFT
Topo-Aware-BFT
HPC Israel (Feb '12) 21
0
0.2
0.4
0.6
0.8
1
1.2
128 256 512 1K
No
rmal
ize
d L
ate
ncy
Job Size (Processes)
• Impact of topology aware schemes more pronounced as
• Message size increases
• System size scales
• Upto 14% improvement in performance at scale
14%
H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11
5.E+03
1.E+05
2.E+05
3.E+05
4.E+058
K
16
K
32
K
64
K
12
8K
25
6K
51
2K
12
8K
25
6K
Late
ncy
(u
sec)
Message Size (Bytes)
Default
DVFS
Proposed
HPC Israel (Feb '12) 22
Power-Aware Collectives
11
13
15
17
19
21
0.7
5.0
9.4
13
.7
18
.0
22
.3
26
.6
31
.0
35
.3
39
.6
43
.9
48
.2
52
.6
Po
we
r (
KW
)
Time (s)
Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes
1.4E+13
5.014E+15
1.001E+16
1.501E+16
2.001E+16
2.501E+16
8K 16K 32K 64K 128K 256K
Ene
rgy
(KJ)
System Size (Number of Processes)
Estimated Energy Consumption during an MPI_Alltoall operation with 128K Message size and Varying System Size
30%
32%
K. Kandalla, E. Mancini, Sayantan Sur and D. K. Panda, “Designing Power Aware Collective Communication
Algorithms for Infiniband Clusters, ICPP ‘10
5%
Application
Collective Offload Support in ConnectX-2 (Recv followed by Multi-Send)
• Sender creates a task-list consisting of only
send and wait WQEs
– One send WQE is created for each registered
receiver and is appended to the rear of a
singly linked task-list
– A wait WQE is added to make the ConnectX-2
HCA wait for ACK packet from the receiver
HPC Israel (Feb '12)
InfiniBand HCA
Physical Link
Send Q
Recv Q
Send CQ
Recv CQ
Data Data
MC
Q MQ
23
Task List Send Wait Send Send Send Wait
P3DFFT Application Performance with Non-Blocking Alltoall based on CX-2 Collective Offload
24
0123456789
10
512 600 720 800
Ap
plic
atio
n R
un
-Tim
e (
s)
Blocking Host-Test OffloadData Size
00.5
11.5
22.5
33.5
44.5
5
512 600 720 800
Ap
plic
atio
n R
un
-Tim
e (
s)
Data Size
P3DFFT Application Run-time Comparison. Overlap version with Offload-Alltoall does up to 17% better than default blocking version
K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D. K. Panda, High-Performance and Scalable
Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, Int'l
Supercomputing Conference (ISC), June 2011.
64 Processes 128 Processes
HPC Israel (Feb '12)
Non-Blocking Broadcast with Collective
Offload and Impact on HPL Performance
25
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70
No
rmal
ize
d H
PL
Pe
rfo
rman
ce
HPL Problem Size (N) as % of Total Memory
HPL-Offload HPL-1ring HPL-Host
HPL Performance Comparison with 512 Processes
4.5%
K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur and D. K. Panda, Designing Non-blocking Broadcast
with Collective Offload on InfiniBand Clusters: A Case Study with HPL, Hot Interconnect '11, Aug. 2011.
HPC Israel (Feb '12)
Pre-conditioned Conjugate Gradient (PCG) Solver Performance with Non-Blocking Allreduce based on CX-2 Collective Offload
26
02468
10121416
64 128 256 512
Ru
n-T
ime
(s)
Number of Processes
PCG-Default Modified-PCG-Offload
Experimental Setup: • 64 nodes, 8 core Intel Xeon(2.53 GHz) 12MB L3 Cache, 12 GB Memory per node, 64 nodes • MT26428 QDR ConnectX-2, PCI-Ex interfaces, 171-port Mellanox
64,000 unknowns per process. Modified PCG with Offload-Allreduce has a run-time
which is about 21% lesser than default PCG
21.8%
K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, Accepted for publication at IPDPS ‘12.
HPC Israel (Feb '12)
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, Topology-aware and Power-aware
– Exploiting Collective Offload
• Support for GPGPUs and Accelerators
• PGAS Support (Hybrid MPI + UPC)
MPI Design Challenges
27 HPC Israel (Feb '12)
Data movement in GPU+IB clusters
• Many systems today want to use systems that have both GPUs and high-speed networks such as InfiniBand
• Steps in Data movement in InfiniBand clusters with GPUs
– From GPU device memory to main memory at source process, using CUDA
– From source to destination process, using MPI
– From main memory to GPU device memory at destination process, using CUDA
• Earlier, GPU device and InfiniBand device required separate memory registration
• GPU-Direct (collaboration between NVIDIA and Mellanox) supported common registration between these devices
• However, GPU-GPU communication is still costly and programming is harder
28 HPC Israel (Feb '12)
PCIe
GPU
CPU
NIC
Switch
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(sbuf, sdev, . . .);
MPI_Send(sbuf, size, . . .);
At Receiver:
MPI_Recv(rbuf, size, . . .);
cudaMemcpy(rdev, rbuf, . . .);
Sample Code - Without MPI integration
• Naïve implementation with standard MPI and CUDA
• High Productivity and Poor Performance
29 HPC Israel (Feb '12)
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(sbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
Sample Code – User Optimized Code
• Pipelining at user level with non-blocking MPI and CUDA interfaces
• Code at Sender side (and repeated at Receiver side)
• User-level copying may not match with internal MPI design
• High Performance and Poor Productivity
HPC Israel (Feb '12) 30
Can this be done within MPI Library?
• Support GPU to GPU communication through standard MPI
interfaces
– e.g. enable MPI_Send, MPI_Recv from/to GPU memory
• Provide high performance without exposing low level details
to the programmer
– Pipelined data transfer which automatically provides optimizations
inside MPI library without user tuning
• A new Design incorporated in MVAPICH2 to support this
functionality
31 HPC Israel (Feb '12)
At Sender:
MPI_Send(s_device, size, …);
At Receiver:
MPI_Recv(r_device, size, …);
inside MVAPICH2
Sample Code – MVAPICH2-GPU
• MVAPICH2-GPU: standard MPI interfaces used
• High Performance and High Productivity
32 HPC Israel (Feb '12)
Design considerations
• Memory detection
– CUDA 4.0 introduces Unified Virtual Addressing (UVA)
– MPI library can differentiate between device memory and
host memory without any hints from the user
• Overlap CUDA copy and RDMA transfer
– Data movement from GPU and RDMA transfer are DMA
operations
– Allow for asynchronous progress
33 HPC Israel (Feb '12)
MPI-Level Two-sided Communication
• 45% and 38% improvements compared to Memcpy+Send, with and without
GPUDirect respectively, for 4MB messages
• 24% and 33% improvement compared with MemcpyAsync+Isend, with and without
GPUDirect respectively, for 4MB messages
with GPU Direct
without GPU Direct
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (
us)
Message Size (bytes)
Memcpy+Send
MemcpyAsync+Isend
MVAPICH2-GPU
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (
us)
Message Size (bytes)
Memcpy-Send
MemcpyAsync+Isend
MVAPICH2-GPU
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11
34 HPC Israel (Feb '12)
Other MPI Operations from GPU Buffers
• Similar approaches can be used for
– One-sided
– Collectives
– Communication with Datatypes
• Designs can also be extended for multi-GPUs per node
35
H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011.
HPC Israel (Feb '12)
A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011.
S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES) to be held in conjunction with IPDPS 2012
MVAPICH2 1.8a2 Release
• Supports point-to-point and collective communication
• Supports communication between GPU devices and
between GPU device and host
• Supports communication using contiguous and non-
contiguous MPI Datatypes
• Supports GPU-Direct through CUDA support (from 4.0)
• Takes advantage of CUDA IPC for intra-node (intra-I/OH)
communication (from CUDA 4.1)
• Provides flexibility in tuning performance of both RDMA
and shared-memory based designs based on predominant
message sizes in applications
36 HPC Israel (Feb '12)
OSU MPI Micro-Benchmarks (OMB) 3.5.1 Release
• OSU MPI Micro-Benchmarks provides a comprehensive suite of
benchmarks to compare performance of different MPI stacks
and networks
• Enhancements done for three benchmarks – Latency
– Bandwidth
– Bi-directional Bandwidth
• Flexibility for using buffers in NVIDIA GPU device (D) and host
memory (H)
• Flexibility for selecting data movement between D->D, D->H and
H->D
• Available from http://mvapich.cse.ohio-state.edu/benchmarks
• Available in an integrated manner with MVAPICH2 stack
37 HPC Israel (Feb '12)
HPC Israel (Feb '12) 38
MVAPICH2 vs. OpenMPI (Device-Device, Inter-node)
0
500
1000
1500
2000
1 16 256 4K 64K 1M
Latency
MVAPICH2
OpenMPI
Late
ncy
(u
s)
Message Size (bytes)
0
1000
2000
3000
4000
1 16 256 4K 64K 1M
Bandwidth
MVAPICH2
OpenMPI
Ban
dw
idth
(MB
/s)
Message Size (bytes)
MVAPICH2 1.8 a2 and OpenMPI (Trunk nightly snapshot on Feb 3, 2012) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2075 GPU and CUDA Toolkit 4.1
0
1000
2000
3000
4000
1 16 256 4K 64K 1M
Bi-directional Bandwidth
MVAPICH2
OpenMPI
Ban
dw
idth
(MB
/s)
Message Size (bytes)
BE
TT
ER
BE
TT
ER
BE
TT
ER
HPC Israel (Feb '12) 39
MVAPICH2 vs. OpenMPI (Device-Host, Inter-node)
0
500
1000
1500
2000
1 16 256 4K 64K 1M
Latency
MVAPICH2
OpenMPI
Late
ncy
(u
s)
Message Size (bytes)
0
1000
2000
3000
4000
1 16 256 4K 64K 1M
Bandwidth
MVAPICH2
OpenMPI
Ban
dw
idth
(MB
/s)
Message Size (bytes)
0
1000
2000
3000
4000
1 16 256 4K 64K 1M
Bi-directional Bandwidth
MVAPICH2
OpenMPI
Ban
dw
idth
(MB
/s)
Message Size (bytes)
MVAPICH2 1.8 a2 and OpenMPI (Trunk nightly snapshot on Feb 3, 2012) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2075 GPU and CUDA Toolkit 4.1
Host-Device
Performance is
Similar B
ET
TE
R
BE
TT
ER
BE
TT
ER
HPC Israel (Feb '12) 40
MVAPICH2 vs. OpenMPI (Device-Device, Intra-node, Multi-GPU)
0
1000
2000
3000
4000
1 16 256 4K 64K 1M
Latency
MVAPICH2
OpenMPI
Late
ncy
(u
s)
Message Size (bytes)
0
1000
2000
3000
4000
5000
1 16 256 4K 64K 1M
Bandwidth
MVAPICH2
OpenMPI
Ban
dw
idth
(MB
/s)
Message Size (bytes)
0
2000
4000
6000
8000
1 16 256 4K 64K 1M
Bi-directional Bandwidth
MVAPICH2
OpenMPI
Ban
dw
idth
(MB
/s)
Message Size (bytes)
BE
TT
ER
BE
TT
ER
BE
TT
ER
MVAPICH2 1.8 a2 and OpenMPI (Trunk nightly snapshot on Feb 3, 2012) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2075 GPU and CUDA Toolkit 4.1
HPC Israel (Feb '12) 41
Applications-Level Evaluation (Lattice Boltzmann Method (LBM))
0
20
40
60
80
100
128x512x64 256x512x64 512x512x64 1024x512x64
Tim
e f
or
LB S
tep
(u
s)
Matrix Size XxYxZ
MVAPICH2 1.7
MVAPICH2 1.8a2
24.2% 24.2%
23.5%
23.5%
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios • NVIDIA Tesla C2050, Mellanox QDR InfiniBand HCA MT26428, Intel Westmere Processor with 12 GB main memory; CUDA 4.1, MVAPICH2 1.7 and MVAPICH2 1.8a2 • Run one process on each node for one GPU (8-node cluster)
HPC Israel (Feb '12) 42
Application-Level Evaluation (AWP-ODC)
0
10
20
30
40
50
60
70
4 8
Tota
l Ex
ecu
tio
n T
ime
(se
c)
Number of Processes
MVAPICH2 1.7
MVAPICH2 1.8a2
• AWP-ODC simulates the dynamic rapture and wave propagation that occur during an earthquake. • A Gordon Bell Prize Finalist at SC 2010. • Originally a Fortran code, a new version is being written in C and CUDA. • NVIDIA Tesla C2050, Mellanox QDR IB, Intel Westmere Processor with 12 GB main memory; CUDA 4.1, MVAPICH2 1.7 and MVAPICH2 1.8a2 • One process on each node with one GPU. 128x128x1024 data grid per process/GPU.
12.5% 13.0%
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, Topology-aware and Power-aware
– Exploiting Collective Offload
• Support for GPGPUs and Accelerators
• PGAS Support (Hybrid MPI + UPC)
MPI Design Challenges
43 HPC Israel (Feb '12)
• Partitioned Global Address Space (PGAS) models provide a
different complementary interface as compared to
message passing
– Idea is to decouple data movement with process synchronization
– Processes should have asynchronous access to globally distributed
data
– Well suited for irregular applications and kernels that require
dynamic access to different data
• Different libraries and compilers exist that provide this
model
– Global Arrays (library), UPC (compiler), CAF (compiler)
– HPCS languages: X10, Chapel, Fortress
HPC Israel (Feb '12) 44
PGAS Models
• Currently UPC and MPI do not share runtimes
– Duplication of lower level communication mechanisms
– GASNet unable to leverage advanced buffering mechanisms developed for MVAPICH2
• Our novel approach is to enable a truly unified communication library
Unifying UPC and MPI Runtimes: Experience with MVAPICH2
Network Interface
MPI Runtime, Buffers, Queue Pairs, and other
resources
GASNet Runtime, Buffers, Queue Pairs, and other
resources
MPI Interface GASNet Interface
UPC Compiler
MPI Interface
Network Interface
Unified MVAPICH + GASNet Runtime,
Buffers, Queue Pairs, and other resources
GASNet Interface
UPC Compiler
HPC Israel (Feb '12) 45
• BUPC micro-benchmarks from latest release 2.10.2
• UPC performance is identical with both native IBV layer and new UCR
layer
• Performance of GASNet-MPI conduit is not very good
– Mismatch of MPI specification and Active messages
• GASNet-UCR is more scalable compared native IBV conduit
0
2
4
6
8
10
1 2 4 8
16
32
64
128
256
512
1024
2048
Late
ncy
(u
s)
Message Size (bytes)
UPC Memput Latency
HPC Israel (Feb '12) 46
UPC Micro-benchmark Performance
GASNet-UCR GASNet-IBV GASNet-MPI
0
500
1000
1500
2000
2500
3000
3500
1 4
16 64
256
1K
4K
16K
64K
256K 1M
Ban
dw
idth
(M
Bp
s)
Message Size (bytes)
UPC Memput Bandwidth
200
210
220
230
240
250
260
270
16 32 64 128 256
Mem
ory
Fo
otp
rin
t (M
B)
Number of Processes
UPC Memory Scalability
J. Jose, M. Luo, S. Sur and D. K. Panda, “Unifying UPC and MPI Runtimes: Experience with MVAPICH”, International
Conference on Partitioned Global Address Space (PGAS), 2010
Evaluation using UPC NAS Benchmarks
• GASNet-UCR performs equal or better than GASNet-IBV
• 10% improvement for CG (B, 128)
• 23% improvement for MG (B, 128)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
B-64 C-64 B-128 C-128
Tim
e (s
ec)
Class-Processes
Performance of MG, Class B and C
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128Class-Processes
Performance of FT, Class B and C
0
5
10
15
20
25
30
B-64 C-64 B-128 C-128Class-Processes
Performance of CG, Class B and C
GASNet-UCR GASNet-IBV GASNet-MPI
47 HPC Israel (Feb '12)
Evaluation of Hybrid MPI+UPC NAS-FT
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Tim
e (s
ec)
Class-Processes
GASNet-UCR
GASNet-IBV
GASNet-MPI
Hybrid
• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall
• Truly hybrid program
• 34% improvement for FT (C, 128)
48 HPC Israel (Feb '12)
Graph500 Results with new UPC Queue Design
• Workload – Scale:24, Edge Factor:16 (16 million vertices, 256 million edges)
• 44% Improvement over base version for 512 UPC-Threads
• 30% Improvement over base version for 1024 UPC-Threads
44%
30%
J. Jose, S. Potluri, M. Luo, S. Sur and D. K. Panda, UPC Queues for Scalable Graph Traversals: Design and Evaluation
on InfiniBand Clusters, Fifth Conference on Partitioned Global Address Space Programming Model (PGAS '11), Oct.
2011.
HPC Israel (Feb '12) 49
• Performance and Memory scalability toward 500K-1M cores
• Unified Support for PGAS Models and Languages (UPC, OpenShmem, etc.)
• Support for Hybrid Programming Models (being discussed in MPI 3.0)
• Enhanced Optimization for GPU Support and Accelerators – Extending the GPGPU support and adding Intel MIC support
• Taking advantage of Collective Offload framework in ConnectX-2 – Including support for non-blocking collectives (MPI 3.0)
• Extended topology-aware collectives
• Power-aware collectives
• Enhanced Multi-rail Designs
• Automatic optimization of collectives – LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail
• Checkpoint-Restart and migration support with incremental checkpointing
• Fault-tolerance with run-through stabilization (being discussed in MPI 3.0)
• QoS-aware I/O and checkpointing
MVAPICH2 – Future Plans
50 HPC Israel (Feb '12)
• Scientific Computing
– Message Passing Interface (MPI) is the Dominant Programming
Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of
momentum
– Memcached is also used
51
Two Major Categories of Applications
HPC Israel (Feb '12)
Can High-Performance Interconnects Benefit Enterprise Computing?
• Beginning to draw interest from the enterprise domain
– Oracle, IBM, Google are working along these directions
• Performance in the enterprise domain remains a concern
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs?
52 HPC Israel (Feb '12)
Common Protocols using Open Fabrics
53
Application
Verbs Sockets Application
Interface
SDP
RDMA
SDP
InfiniBand Adapter
InfiniBand Switch
RDMA
IB Verbs
InfiniBand Adapter
InfiniBand Switch
User
space
RDMA
RoCE
RoCE Adapter
User
space
Ethernet Switch
TCP/IP
Ethernet Driver
Kernel Space
Protocol Implementation
InfiniBand Adapter
InfiniBand Switch
IPoIB
Ethernet Adapter
Ethernet Switch
Network Adapter
Network Switch
1/10/40 GigE
iWARP
Ethernet Switch
iWARP
iWARP Adapter
User
space IPoIB
TCP/IP
Ethernet Adapter
Ethernet Switch
10/40 GigE-TOE
Hardware Offload
HPC Israel (Feb '12)
Can New Data Analysis and Management Systems be designed with High-Performance Networks and Protocols?
54
Enhanced Designs
Application
Accelerated Sockets
10 GigE or InfiniBand
Verbs / Hardware Offload
Current Design
Application
Sockets
1/10 GigE Network
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)
– Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
HPC Israel (Feb '12)
Memcached Architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
• Native IB-verbs-level Design and evaluation with
– Memcached Server: 1.4.9
– Memcached Client: (libmemcached) 0.52
55
!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High
Performance
Networks
High
Performance
Networks
Main
memoryCPUs
SSD HDD
High Performance Networks
... ...
...
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
HPC Israel (Feb '12)
56
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on
the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
SDP IPoIB
OSU-RC-IB 1GigE
10GigE OSU-UD-IB
Memcached Get Latency (Small Message)
HPC Israel (Feb '12)
HPC Israel (Feb '12) 57
Memcached Set Latency (Large Message)
• Memcached Get latency
– 8K bytes RC/UD – DDR: 19.8/21.2 us; QDR: 11.7/12.7 us
– 512K bytes RC/UD -- DDR: 366/413 us; QDR: 181/206 us
• Almost factor of two improvement over 10GE (TOE) for 512K bytes on
the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
2K 4K 8K 16K 32K 64K 128K 256K 512K
Tim
e (
us)
Message Size
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
2K 4K 8K 16K 32K 64K 128K 256K 512K
Tim
e (
us)
Message Size
SDP IPoIB
OSU-RC-IB 1GigE
10GigE OSU-UD-IB
HPC Israel (Feb '12) 58
Memcached Get TPS (4byte)
• Memcached Get transactions per second for 4 bytes
– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to SDP and IPoIB
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 32 64 128 256 512 800 1K
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r se
con
d (
TPS)
No. of Clients
SDP IPoIB
OSU-RC-IB 1GigE
OSU-UD-IB
0
200
400
600
800
1000
1200
1400
1600
4 8
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r se
con
d (
TPS)
No. of Clients
HPC Israel (Feb '12) 59
Memcached - Memory Scalability
• Steady Memory Footprint for UD Design
– ~ 200MB
• RC Memory Footprint increases as increase in number of clients
– ~500MB for 4K clients
0
100
200
300
400
500
600
700
1 2 4 8 16 32 64 128 256 512 800 1K 1.6K 2K 4K
Me
mo
ry F
oo
tpri
nt
(MB
)
No. of Clients
SDP IPoIB
OSU-RC-IB 1GigE
OSU-UD-IB OSU-Hybrid-IB
HPC Israel (Feb '12) 60
Application Level Evaluation – Olio Benchmark
• Olio Benchmark
– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients
• 4X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design
0
500
1000
1500
2000
2500
64 128 256 512 1024
Tim
e (
ms)
No. of Clients
0
20
40
60
80
100
120
1 4 8
Tim
e (
ms)
No. of Clients
SDP
IPoIB
OSU-RC-IB
OSU-UD-IB
OSU-Hybrid-IB
HPC Israel (Feb '12) 61
Application Level Evaluation – Real Application Workloads
• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design
0
20
40
60
80
100
120
1 4 8
Tim
e (
ms)
No. of Clients
SDP
IPoIB
OSU-RC-IB
OSU-UD-IB
OSU-Hybrid-IB
0
50
100
150
200
250
300
350
64 128 256 512 1024
Tim
e (
ms)
No. of Clients
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design
on High Performance RDMA Capable Interconnects, CCGrid’12
Overview of HBase Architecture
• An open source database project based on Hadoop framework for hosting very large tables
• Major components: HBaseMaster, HRegionServer and HBaseClient
• HBase and HDFS are deployed in the same cluster to get better data locality
62
HPC Israel (Feb '12)
HPC Israel (Feb '12) 63
HBase Put/Get – Detailed Analysis
• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time
• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time
0
50
100
150
200
250
300
1GigE IPoIB 10GigE OSU-IB
Tim
e (
us)
HBase Put 1KB
0
50
100
150
200
250
1GigE IPoIB 10GigE OSU-IB
Tim
e (
us)
HBase Get 1KB
Communication
Communication Preparation
Server Processing
Server Serialization
Client Processing
Client Serialization
W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12
HPC Israel (Feb '12) 64
HBase Single Server-Multi-Client Results
• HBase Get latency
– 4 clients: 104.5 us; 16 clients: 296.1 us
• HBase Get throughput
– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
• 27% improvement in throughput for 16 clients over 10GE
0
100
200
300
400
500
600
1 2 4 8 16
Tim
e (
us)
No. of Clients
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16
Op
s/se
c
No. of Clients
IPoIB
OSU-IB
1GigE
10GigE
Latency Throughput
HPC Israel (Feb '12) 65
HBase – YCSB Read-Write Workload
• HBase Get latency (Yahoo! Cloud Service Benchmark)
– 64 clients: 2.0 ms; 128 Clients: 3.5 ms
– 42% improvement over IPoIB for 128 clients
• HBase Get latency
– 64 clients: 1.9 ms; 128 Clients: 3.5 ms
– 40% improvement over IPoIB for 128 clients
0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 96 128
Tim
e (
us)
No. of Clients
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
8 16 32 64 96 128
Tim
e (
us)
No. of Clients
IPoIB OSU-IB
1GigE 10GigE
Read Latency Write Latency
J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12
Hadoop Architecture
• Underlying Hadoop Distributed File System (HDFS)
• Fault-tolerance by replicating data blocks
• NameNode: stores information on data blocks
• DataNodes: store blocks and host Map-reduce computation
• JobTracker: track jobs and detect failure
• Model scales but high amount of communication during intermediate phases
66 HPC Israel (Feb '12)
HPC Israel (Feb '12) 67
RDMA-based Design for Native HDFS – Preliminary Results
• HDFS File Write Experiment using one data node on IB-DDR Cluster
• HDFS File Write Time
– 64 MB – 344ms, 128 MB – 669ms, 256 MB – 1.3s,
512 MB – 2.7s, 1GB – 6.7s
– 20% over IPoIB, 13% over 10GigE improvement for 1GB file
0
2000
4000
6000
8000
10000
12000
64 128 256 512 1024
Tim
e (
ms)
File Size (MB)
1GigE IPoIB
10GigE OSU-IB-DDR
• InfiniBand with RDMA feature is gaining momentum in HPC
systems with best performance and greater usage
• As the HPC community moves to Exascale, new solutions are
needed in the MPI stack for supporting PGAS, GPU, collectives
(topology-aware, power-aware), one-sided and QoS, etc.
• Demonstrated how such solutions can be designed with
MVAPICH2 and their performance benefits
• New solutions are also needed to re-design software libraries for
enterprise environments to take advantage of modern networks
• Allow application scientists and engineers to take advantage of
modern supercomputers
68
Concluding Remarks
HPC Israel (Feb '12)
HPC Israel (Feb '12)
Funding Acknowledgments
Funding Support by
Equipment Support by
69
HPC Israel (Feb '12)
Personnel Acknowledgments
Current Students
– V. Dhanraj (M.S.)
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– K. Kandalla (Ph.D.)
– M. Luo (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
– M. Rahman (Ph.D.)
– A. Singh (Ph.D.)
– H. Subramoni (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– P. Lai (Ph. D.)
– J. Liu (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– S. Pai (M.S.)
– G. Santhanaraman (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
70
Past Research Scientist – S. Sur
Current Post-Docs
– J. Vienne
– H. Wang
Current Programmers
– M. Arnold
– D. Bureddy
– J. Perkins
– D. Sharma
Past Post-Docs – X. Besseron
– H.-W. Jin
– E. Mancini
– S. Marcarelli
HPC Israel (Feb '12)
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
71
Top Related