Addressing Emerging Challenges in Designing HPC Runtimes ......Addressing Emerging Challenges in...
Transcript of Addressing Emerging Challenges in Designing HPC Runtimes ......Addressing Emerging Challenges in...
Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPCAC-Switzerland (Mar ‘16)
by
HPCAC-Switzerland (Mar ‘16) 2Network Based Computing Laboratory
• Scalability for million to billion processors
• Collective communication
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)
• InfiniBand Network Analysis and Monitoring (INAM)
• Integrated Support for GPGPUs
• Integrated Support for MICs
• Virtualization (SR-IOV and Containers)
• Energy-Awareness
• Best Practice: Set of Tunings for Common Applications
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPCAC-Switzerland (Mar ‘16) 3Network Based Computing Laboratory
• Integrated Support for GPGPUs– CUDA-Aware MPI
– GPUDirect RDMA (GDR) Support
– CUDA-aware Non-blocking Collectives
– Support for Managed Memory
– Efficient datatype Processing
– Supporting Streaming applications with GDR
– Efficient Deep Learning with MVAPICH2-GDR
• Integrated Support for MICs
• Virtualization (SR-IOV and Containers)
• Energy-Awareness
• Best Practice: Set of Tunings for Common Applications
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPCAC-Switzerland (Mar ‘16) 4Network Based Computing Laboratory
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(s_hostbuf, s_devbuf, . . .);
MPI_Send(s_hostbuf, size, . . .);
At Receiver:
MPI_Recv(r_hostbuf, size, . . .);
cudaMemcpy(r_devbuf, r_hostbuf, . . .);
• Data movement in applications with standard MPI and CUDA interfaces
High Productivity and Low Performance
MPI + CUDA - Naive
HPCAC-Switzerland (Mar ‘16) 5Network Based Computing Laboratory
PCIe
GPU
CPU
NIC
Switch
At Sender:for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
<<Similar at receiver>>
• Pipelining at user level with non-blocking MPI and CUDA interfaces
Low Productivity and High Performance
MPI + CUDA - Advanced
HPCAC-Switzerland (Mar ‘16) 6Network Based Computing Laboratory
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware MPI Library: MVAPICH2-GPU
HPCAC-Switzerland (Mar ‘16) 7Network Based Computing Laboratory
• OFED with support for GPUDirect RDMA is
developed by NVIDIA and Mellanox
• OSU has a design of MVAPICH2 using
GPUDirect RDMA
– Hybrid design using GPU-Direct RDMA
• GPUDirect RDMA and Host-based pipelining
• Alleviates P2P bandwidth bottlenecks on SandyBridge and
IvyBridge
– Support for communication using multi-rail
– Support for Mellanox Connect-IB and ConnectX VPI adapters
– Support for RoCE with Mellanox ConnectX VPI adapters
GPU-Direct RDMA (GDR) with CUDA
IB Adapter
SystemMemory
GPUMemory
GPU
CPU
Chipset
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
HPCAC-Switzerland (Mar ‘16) 8Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
HPCAC-Switzerland (Mar ‘16) 9Network Based Computing Laboratory
MVAPICH2-GDR-2.2bIntel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPUMellanox Connect-IB Dual-FDR HCA
CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA
10x
2X
11x
2x
Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)
0
5
10
15
20
25
30
0 2 8 32 128 512 2K
MV2-GDR2.2b MV2-GDR2.0bMV2 w/o GDR
GPU-GPU internode latency
Message Size (bytes)
Late
ncy
(u
s)
2.18us0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1K 4K
MV2-GDR2.2b
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode Bandwidth
Message Size (bytes)
Ban
dw
idth
(M
B/s
) 11X
0
1000
2000
3000
4000
1 4 16 64 256 1K 4K
MV2-GDR2.2bMV2-GDR2.0bMV2 w/o GDR
GPU-GPU Internode Bi-Bandwidth
Message Size (bytes)
Bi-
Ban
dw
idth
(M
B/s
)
HPCAC-Switzerland (Mar ‘16) 10Network Based Computing Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Ave
rage
Tim
e St
eps
per
se
con
d (
TPS)
Number of Processes
MV2 MV2+GDR
0
500
1000
1500
2000
2500
3000
3500
4 8 16 32Ave
rage
Tim
e St
eps
pe
r se
con
d (
TPS)
Number of Processes
64K Particles 256K Particles
2X2X
HPCAC-Switzerland (Mar ‘16) 11Network Based Computing Laboratory
0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Ove
rlap
(%
)
Message Size (Bytes)
Medium/Large Message Overlap (64 GPU nodes)
Ialltoall (1process/node)
Ialltoall (2process/node; 1process/GPU)0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Ove
rlap
(%
)
Message Size (Bytes)
Medium/Large Message Overlap(64 GPU nodes)
Igather (1process/node)
Igather (2processes/node;1process/GPU)
Platform: Wilkes: Intel Ivy Bridge NVIDIA Tesla K20c + Mellanox Connect-IB
Available since MVAPICH2-GDR 2.2a
CUDA-Aware Non-Blocking Collectives
A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU
Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC,
2015
HPCAC-Switzerland (Mar ‘16) 12Network Based Computing Laboratory
Communication Runtime with GPU Managed Memory
● CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified)
memory allowing a common memory allocation for GPU
or CPU through cudaMallocManaged() call
● Significant productivity benefits due to abstraction of
explicit allocation and cudaMemcpy()
● Extended MVAPICH2 to perform communications directly
from managed buffers (Available in MVAPICH2-GDR 2.2b)
● OSU Micro-benchmarks extended to evaluate the
performance of point-to-point and collective
communications using managed buffers
● Available in OMB 5.2
D S Banerjee, K Hamidouche, DK Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences at GPGPU-9 Workshop held in conjunction with PPoPP 2016. Barcelona Spain
0
2
4
6
8
10
12
1 2 4 8
16
32
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
Late
ncy
(u
s)
Message Size (Bytes)
Latency
H-H MH-MH
0
1000
2000
3000
4000
5000
6000
1 2 4 8
16
32
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth
D-D MD-MD
HPCAC-Switzerland (Mar ‘16) 13Network Based Computing Laboratory
CPU
Progress
GPU
Time
Initi
ate
Kern
el
Star
t Se
nd
Isend(1)
Initi
ate
Kern
el
Star
t Se
nd
Init
iate
Ke
rnel
GPU
CPU
Initi
ate
Kern
el
Star
tSe
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)
Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Init
iate
Ke
rnel
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)
Init
iate
Ke
rnel
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Star
t Se
nd
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPUCommon Scenario
*Buf1, Buf2…contain non-contiguous MPI Datatype
MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…
MPI_Waitall (…);
HPCAC-Switzerland (Mar ‘16) 14Network Based Computing Laboratory
Application-Level Evaluation (HaloExchange - Cosmo)
0
0.5
1
1.5
16 32 64 96
No
rmal
ized
Exe
cuti
on
Tim
e
Number of GPUs
CSCS GPU clusterDefault Callback-based Event-based
0
0.5
1
1.5
4 8 16 32
No
rmal
ized
Exe
cuti
on
Tim
e
Number of GPUs
Wilkes GPU ClusterDefault Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-
Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
HPCAC-Switzerland (Mar ‘16) 15Network Based Computing Laboratory
• Pipelined data parallel compute phases that form the crux of streaming applications lend themselves for GPGPUs
• Data distribution to GPGPU sites occur over PCIe within the node and over InfiniBand interconnects across nodes
Courtesy: Agarwalla, Bikash, et al. "Streamline: A scheduling heuristic for streaming applications on the grid." Electronic Imaging 2006
• Broadcast operation is a key dictator of throughput of streaming applications
• Current Broadcast operation on GPU clusters does not take advantage of• IB Hardware MCAST• GPU Direct RDMA
Nature of Streaming Applications
HPCAC-Switzerland (Mar ‘16) 16Network Based Computing Laboratory
SGL-based design for Efficient Broadcast Operation on GPU Systems
• Current design is limited by the expensive copies from/to GPUs
• Proposed several alternative designs to avoid the overhead of the copy • Loopback, GDRCOPY and hybrid • High performance and scalability • Still uses PCI resources for Host-GPU copies
• Proposed SGL-based design • Combines IB MCAST and GPUDirect RDMA features • High performance and scalability for D-D broadcast• Direct code path between HCA and GPU • Free PCI resources
• 3X improvement in latency
3X
A. Venkatesh , H. Subramoni , K. Hamidouche , and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and
GPUDirect RDMA for Streaming Applications on InfiniBand Clusters , IEEE Int’l Conf. on High Performance Computing (HiPC ’14)
HPCAC-Switzerland (Mar ‘16) 17Network Based Computing Laboratory
Accelerating Deep Learning with MVAPICH2-GDR
• Caffe: A flexible and layered Deep Learning
framework.
• Benefits and Weaknesses
– Multi-GPU Training within a single node
– Performance degradation for GPUs across
different sockets
• Can we enhance Caffe with MVAPICH2-GDR?
– Caffe-Enhanced: A CUDA-Aware MPI version
– Enables Scale-up (within a node) and Scale-
out (across multi-GPU nodes)
– Initial Evaluation suggests up to 8X reduction
in training time on CIFAR-10 dataset
8x improvement
HPCAC-Switzerland (Mar ‘16) 18Network Based Computing Laboratory
• Integrated Support for GPGPUs
• Integrated Support for MICs
• Virtualization (SR-IOV and Containers)
• Energy-Awareness
• Best Practice: Set of Tunings for Common Applications
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPCAC-Switzerland (Mar ‘16) 19Network Based Computing Laboratory
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
OffloadedComputation
MPI Program
MPI Program
MPI Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
• Flexibility in launching MPI jobs on clusters with Xeon Phi
HPCAC-Switzerland (Mar ‘16) 20Network Based Computing Laboratory
MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only and Symmetric Mode
• Internode Communication
• Coprocessors-only and Symmetric Mode
• Multi-MIC Node Configurations
• Running on three major systems
• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)
HPCAC-Switzerland (Mar ‘16) 21Network Based Computing Laboratory
MIC-Remote-MIC P2P Communication with Proxy-based Communication
Bandwidth
Bette
r
Bet
ter
Bet
ter
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1MBan
dw
idth
(M
B/s
ec)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
0
5000
10000
15000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1MBan
dw
idth
(MB
/se
c)Message Size (Bytes)
Bette
r
5594
Bandwidth
HPCAC-Switzerland (Mar ‘16) 22Network Based Computing Laboratory
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
10000
20000
30000
1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
secs
)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M)Small Message LatencyMV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(u
secs
)
Message Size (Bytes)
32-Node-Allgather (8H + 8 M)Large Message LatencyMV2-MIC
MV2-MIC-Opt
0
500
1000
4K 8K 16K 32K 64K 128K 256K 512K
Late
ncy
(u
secs
)
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M)Large Message LatencyMV2-MIC
MV2-MIC-Opt
0
20
40
60
MV2-MIC-Opt MV2-MICExe
cuti
on
Tim
e (
secs
)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT PerformanceCommunication
Computation
76%
58%
55%
HPCAC-Switzerland (Mar ‘16) 23Network Based Computing Laboratory
• Integrated Support for GPGPUs
• Integrated Support for MICs
• Virtualization (SR-IOV and Containers)
• Energy-Awareness
• Best Practice: Set of Tunings for Common Applications
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPCAC-Switzerland (Mar ‘16) 24Network Based Computing Laboratory
• Virtualization has many benefits– Fault-tolerance
– Job migration
– Compaction
• Have not been very popular in HPC due to overhead associated with Virtualization
• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field
• Enhanced MVAPICH2 support for SR-IOV
• MVAPICH2-Virt 2.1 (with and without OpenStack) is publicly available
• How about the Containers support?
Can HPC and Virtualization be Combined?
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized
InfiniBand Clusters? EuroPar'14
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14
J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15
HPCAC-Switzerland (Mar ‘16) 25Network Based Computing Laboratory
• Redesign MVAPICH2 to make it
virtual machine aware
– SR-IOV shows near to native
performance for inter-node point to
point communication
– IVSHMEM offers zero-copy access to
data on shared memory of co-resident
VMs
– Locality Detector: maintains the locality
information of co-resident virtual machines
– Communication Coordinator: selects the
communication channel (SR-IOV, IVSHMEM)
adaptively
Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter
Physical Function
user space
kernel space
MPI proc
PCI Device
VF Driver
Guest 2
user space
kernel space
MPI proc
PCI Device
VF Driver
Virtual
Function
Virtual
Function
/dev/shm/
IV-SHM
IV-Shmem Channel
SR-IOV Channel
J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM
Shmem Benefit MPI Applications on SR-IOV based
Virtualized InfiniBand Clusters? Euro-Par, 2014.
J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High
Performance MPI Library over SR-IOV Enabled InfiniBand
Clusters. HiPC, 2014.
HPCAC-Switzerland (Mar ‘16) 26Network Based Computing Laboratory
• OpenStack is one of the most popular
open-source solutions to build clouds and
manage virtual machines
• Deployment with OpenStack
– Supporting SR-IOV configuration
– Supporting IVSHMEM configuration
– Virtual Machine aware design of MVAPICH2
with SR-IOV
• An efficient approach to build HPC Clouds
with MVAPICH2-Virt and OpenStack
MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack
J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build
HPC Clouds. CCGrid, 2015.
HPCAC-Switzerland (Mar ‘16) 27Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
Exe
cuti
on
Tim
e (
s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1%9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
Exe
cuti
on
Tim
e (
ms)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native2%
• 32 VMs, 6 Core/VM
• Compared to Native, 2-5% overhead for Graph500 with 128 Procs
• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Application-Level Performance on Chameleon
SPEC MPI2007Graph500
5%
HPCAC-Switzerland (Mar ‘16) 28Network Based Computing Laboratory
NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument
• Large-scale instrument
– Targeting Big Data, Big Compute, Big Instrument research
– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network
• Reconfigurable instrument
– Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use
• Connected instrument
– Workload and Trace Archive
– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others
– Partnerships with users
• Complementary instrument
– Complementing GENI, Grid’5000, and other testbeds
• Sustainable instrument
– Industry connections http://www.chameleoncloud.org/
HPCAC-Switzerland (Mar ‘16) 29Network Based Computing Laboratory
0
2
4
6
8
10
12
14
16
18
1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
Late
ncy
(u
s)
Message Size (Bytes)
Container-Def
Container-Opt
Native
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
Ban
dw
idth
(M
Bp
s)
Message Size (Bytes)
Container-Def
Container-Opt
Native
• Intra-Node Inter-Container
• Compared to Container-Def, up to 81% and 191% improvement on Latency and BW
• Compared to Native, minor overhead on Latency and BW
Containers Support: MVAPICH2 Intra-node Point-to-Point Performance on Chameleon
81%
191%
HPCAC-Switzerland (Mar ‘16) 30Network Based Computing Laboratory
0
500
1000
1500
2000
2500
3000
3500
4000
22, 16 22, 20 24, 16 24, 20 26, 16 26, 20
Exec
uti
on
Tim
e (
ms)
Problem Size (Scale, Edgefactor)
Container-Def
Container-Opt
Native
0
10
20
30
40
50
60
70
80
90
100
MG.D FT.D EP.D LU.D CG.D
Exec
uti
on
Tim
e (
s)
Container-Def
Container-Opt
Native
• 64 Containers across 16 nodes, pining 4 Cores per Container
• Compared to Container-Def, up to 11% and 16% of execution time reduction for NAS and Graph 500
• Compared to Native, less than 9 % and 4% overhead for NAS and Graph 500
• Optimized Container support will be available with the next release of MVAPICH2-Virt
Containers Support: Application-Level Performance on Chameleon
Graph 500NAS
11%
16%
HPCAC-Switzerland (Mar ‘16) 31Network Based Computing Laboratory
• Integrated Support for GPGPUs
• Integrated Support for MICs
• Virtualization (SR-IOV and Containers)
• Energy-Awareness
• Best Practice: Set of Tunings for Common Applications
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPCAC-Switzerland (Mar ‘16) 32Network Based Computing Laboratory
Designing Energy-Aware (EA) MPI Runtime
Energy Spent in Communication
Routines
Energy Spent in Computation
Routines
Overall application Energy
Expenditure
Point-to-point
Routines
Collective
RoutinesRMA Routines
MVAPICH2-EA Designs
MPI Two-sided and collectives (ex: MVAPICH2)
Other PGAS Implementations (ex: OSHMPI)One-sided runtimes (ex: ComEx)
Impact MPI-3 RMA Implementations (ex: MVAPICH2)
HPCAC-Switzerland (Mar ‘16) 33Network Based Computing Laboratory
• MVAPICH2-EA 2.1 (Energy-Aware)
• A white-box approach
• New Energy-Efficient communication protocols for pt-pt and collective operations
• Intelligently apply the appropriate Energy saving techniques
• Application oblivious energy saving
• OEMT
• A library utility to measure energy consumption for MPI applications
• Works with all MPI runtimes
• PRELOAD option for precompiled applications
• Does not require ROOT permission:
• A safe kernel module to read only a subset of MSRs
Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)
HPCAC-Switzerland (Mar ‘16) 34Network Based Computing Laboratory
• An energy efficient runtime that
provides energy savings without
application knowledge
• Uses automatically and
transparently the best energy
lever
• Provides guarantees on
maximum degradation with 5-
41% savings at <= 5%
degradation
• Pessimistic MPI applies energy
reduction lever to each MPI call
MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)
A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.
K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]
1
HPCAC-Switzerland (Mar ‘16) 35Network Based Computing Laboratory
MPI-3 RMA Energy Savings with Proxy-Applications
0
10
20
30
40
50
60
512 256 128
Sec
ond
s
#Processes
Graph500 (Execution Time)
optimistic
pessimistic
EAM-RMA
0
50000
100000
150000
200000
250000
300000
350000
512 256 128
Joule
s
#Processes
Graph500 (Energy Usage)
optimistic
pessimistic
EAM-RMA46%
• MPI_Win_fence dominates application execution time in graph500
• Between 128 and 512 processes, EAM-RMA yields between 31% and 46% savings with no
degradation in execution time in comparison with the default optimistic MPI runtime
HPCAC-Switzerland (Mar ‘16) 36Network Based Computing Laboratory
0
500000
1000000
1500000
2000000
2500000
3000000
512 256 128
Joule
s
#Processes
SCF (Energy Usage)
optimistic
pessimistic
EAM-RMA
0
100
200
300
400
500
600
512 256 128
Sec
ond
s
#Processes
SCF (Execution Time)
optimistic
pessimistic
EAM-RMA
MPI-3 RMA Energy Savings with Proxy-Applications
42%
• SCF (self-consistent field) calculation spends nearly 75% total time in MPI_Win_unlock call
• With 256 and 512 processes, EAM-RMA yields 42% and 36% savings at 11% degradation (close to
permitted degradation ρ = 10%)
• 128 processes is an exception due 2-sided and 1-sided interaction
• MPI-3 RMA Energy-efficient support will be available in upcoming MVAPICH2-EA release
HPCAC-Switzerland (Mar ‘16) 37Network Based Computing Laboratory
• Integrated Support for GPGPUs
• Integrated Support for MICs
• Virtualization (SR-IOV and Containers)
• Energy-Awareness
• Best Practice: Set of Tunings for Common Applications
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPCAC-Switzerland (Mar ‘16) 38Network Based Computing Laboratory
• MPI runtime has many parameters
• Tuning a set of parameters can help you to extract higher performance
• Compiled a list of such contributions through the MVAPICH Website– http://mvapich.cse.ohio-state.edu/best_practices/
• Initial list of applications– Amber
– HoomdBlue
– HPCG
– Lulesh
– MILC
– MiniAMR
– Neuron
– SMG2000
• Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu. We will link these results with credits to you.
Applications-Level Tuning: Compilation of Best Practices
HPCAC-Switzerland (Mar ‘16) 39Network Based Computing Laboratory
MVAPICH2 – Plans for Exascale• Performance and Memory scalability toward 1M cores
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)• Support for task-based parallelism (UPC++)*
• Enhanced Optimization for GPU Support and Accelerators
• Taking advantage of advanced features of Mellanox InfiniBand• On-Demand Paging (ODP)
• Swith-IB2 SHArP
• GID-based support
• Enhanced Inter-node and Intra-node communication schemes for upcoming architectures• OpenPower*
• OmniPath-PSM2*
• Knights Landing
• Extended topology-aware collectives
• Extended Energy-aware designs and Virtualization Support
• Extended Support for MPI Tools Interface (as in MPI 3.0)
• Extended Checkpoint-Restart and migration support with SCR
• Support for * features will be available in MVAPICH2-2.2 RC1
HPCAC-Switzerland (Mar ‘16) 40Network Based Computing Laboratory
• Exascale systems will be constrained by– Power
– Memory per core
– Data movement cost
– Faults
• Programming Models and Runtimes for HPC need to be designed for
– Scalability
– Performance
– Fault-resilience
– Energy-awareness
– Programmability
– Productivity
• Highlighted some of the issues and challenges
• Need continuous innovation on all these fronts
Looking into the Future ….
HPCAC-Switzerland (Mar ‘16) 41Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
HPCAC-Switzerland (Mar ‘16) 42Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students
– A. Augustine (M.S.)
– A. Awan (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– N. Islam (Ph.D.)
– M. Li (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– S. Sur
Current Post-Doc
– J. Lin
– D. Banerjee
Current Programmer
– J. Perkins
Past Post-Docs
– H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekar (Ph.D.)
– K. Kulkarni (M.S.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Research Scientists Current Senior Research Associate
– H. Subramoni
– X. Lu
Past Programmers
– D. Bureddy
- K. Hamidouche
Current Research Specialist
– M. Arnold
HPCAC-Switzerland (Mar ‘16) 43Network Based Computing Laboratory
International Workshop on Communication Architectures at Extreme Scale (Exacomm)
ExaComm 2015 was held with Int’l Supercomputing Conference (ISC ‘15), at Frankfurt,
Germany, on Thursday, July 16th, 2015
One Keynote Talk: John M. Shalf, CTO, LBL/NERSCFour Invited Talks: Dror Goldenberg (Mellanox); Martin Schulz (LLNL);
Cyriel Minkenberg (IBM-Zurich); Arthur (Barney) Maccabe (ORNL)Panel: Ron Brightwell (Sandia)
Two Research Papers
ExaComm 2016 will be held in conjunction with ISC ’16
http://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html
Technical Paper Submission Deadline: Friday, April 15, 2016
HPCAC-Switzerland (Mar ‘16) 44Network Based Computing Laboratory
Thank You!
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/