Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS)...

69
The Future of MPI Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda A Presentation at HPC Advisory Council Workshop, Malaga 2012 by

Transcript of Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS)...

Page 1: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

The Future of MPI

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

A Presentation at HPC Advisory Council Workshop, Malaga 2012

by

Page 2: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• Overview of new MPI-3 features

• Addressing Exascale Challenges in MVAPICH2 Project

• Conclusions

2

Presentation Overview

HPC Advisory Council, Malaga, Spain '12

Page 3: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

High-End Computing (HEC): PetaFlop to ExaFlop

HPC Advisory Council, Malaga, Spain '12

Expected to have an ExaFlop system in 2019 -2021!

3

100

PFlops in

2015 1 EFlops in

2019

Page 4: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

Nov. 1996: 0/500 (0%) Jun. 2002: 80/500 (16%) Nov. 2007: 406/500 (81.2%)

Jun. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Jun. 2008: 400/500 (80.0%)

Nov. 1997: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Nov. 2008: 410/500 (82.0%)

Jun. 1998: 1/500 (0.2%) Nov. 2003: 208/500 (41.6%) Jun. 2009: 410/500 (82.0%)

Nov. 1998: 2/500 (0.4%) Jun. 2004: 291/500 (58.2%) Nov. 2009: 417/500 (83.4%)

Jun. 1999: 6/500 (1.2%) Nov. 2004: 294/500 (58.8%) Jun. 2010: 424/500 (84.8%)

Nov. 1999: 7/500 (1.4%) Jun. 2005: 304/500 (60.8%) Nov. 2010: 415/500 (83%)

Jun. 2000: 11/500 (2.2%) Nov. 2005: 360/500 (72.0%) Jun. 2011: 411/500 (82.2%)

Nov. 2000: 28/500 (5.6%) Jun. 2006: 364/500 (72.8%) Nov. 2011: 410/500 (82.0%)

Jun. 2001: 33/500 (6.6%) Nov. 2006: 361/500 (72.2%) Jun. 2012: 407/500 (81.4%)

Nov. 2001: 43/500 (8.6%) Jun. 2007: 373/500 (74.6%)

4

Page 5: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• 209 IB Clusters (41.8%) in the June 2012 Top500 list

(http://www.top500.org)

• Installations in the Top 40 (13 systems):

HPC Advisory Council, Malaga, Spain '12

Large-scale InfiniBand Installations

147, 456 cores (Super MUC) in Germany (4th) 78,660 cores (Lomonosov) in Russia (22nd )

77,184 cores (Curie thin nodes) at France/CEA (9th) 137,200 cores (Sunway Blue Light) in China (26th)

120, 640 cores (Nebulae) at China/NSCS (10th) 46,208 cores (Zin) at LLNL (27th)

125,980 cores (Pleiades) at NASA/Ames (11th) 29,440 cores (Mole-8.5) in China (37th)

70,560 cores (Helios) at Japan/IFERC (12th) 42,440 cores (Red Sky) at Sandia (39th)

73,278 cores (Tsubame 2.0) at Japan/GSIC (14th) More are getting installed !

138,368 cores (Tera-100) at France/CEA (17th)

122,400 cores (Roadrunner) at LANL (19th)

5

Page 6: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• Overview of new MPI-3 features

• Addressing Exascale Challenges in MVAPICH2 Project

• Conclusions

6

Presentation Overview

HPC Advisory Council, Malaga, Spain '12

Page 7: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 7

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• In this presentation series, we concentrate on MPI first and

then focus on PGAS

Page 8: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Message Passing Library standardized by MPI Forum

– C, C++ and Fortran

• Goal: portable, efficient and flexible standard for writing

parallel applications

• Not IEEE or ISO standard, but widely considered “industry

standard” for HPC application

• Evolution of MPI

– MPI-1: 1994

– MPI-2: 1996

– MPI-3: on-going effort (2008 – current)

8

MPI Overview and History

HPC Advisory Council, Malaga, Spain '12

Page 9: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

Major MPI Features

9 HPC Advisory Council, Malaga, Spain '12

Page 10: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models Message Passing Interface (MPI), Sockets and PGAS

(UPC, OpenSHMEM, Global Arrays)

Applications/Libraries

Networking Technologies (InfiniBand, 1/10/40GigE, RoCE & Intelligent NICs)

Commodity Computing System Architectures

(single, dual, quad, ..) Multi/Many-core architecture

and Accelerators

HPC Advisory Council, Malaga, Spain '12

Point-to-point Communication QoS

Collective Communication

Synchronization & Locks I/O & File Systems

Fault Tolerance

Library or Runtime for Programming Models

10

Page 11: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Exascale System Targets

Systems 2010 2018 Difference Today & 2018

System peak 2 PFlop/s 1 EFlop/s O(1,000)

Power 6 MW ~20 MW (goal)

System memory 0.3 PB 32 – 64 PB O(100)

Node performance 125 GF 1.2 or 15 TF O(10) – O(100)

Node memory BW 25 GB/s 2 – 4 TB/s O(100)

Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)

Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)

O(100)

System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]

O(10,000)

Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)

O(10) – O(100)

IO Rates 0.2 TB 60 TB/s O(100)

MTTI Days O(1 day) -O(10)

Courtesy: DOE Exascale Study and Prof. Jack Dongarra

11 HPC Advisory Council, Malaga, Spain '12

Page 12: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• DARPA Exascale Report – Peter Kogge, Editor and Lead

• Energy and Power Challenge

– Hard to solve power requirements for data movement

• Memory and Storage Challenge

– Hard to achieve high capacity and high data rate

• Concurrency and Locality Challenge

– Management of very large amount of concurrency (billion threads)

• Resiliency Challenge

– Low voltage devices (for low power) introduce more faults

HPC Advisory Council, Malaga, Spain '12 12

Basic Design Challenges for Exascale Systems

Page 13: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Power required for data movement operations is one of

the main challenges

• Non-blocking collectives

– Overlap computation and communication

• Much improved One-sided interface

– Reduce synchronization of sender/receiver

• Manage concurrency

– Improved interoperability with PGAS (e.g. UPC, Global Arrays,

OpenShmem)

• Resiliency

– New interface for detecting failures

HPC Advisory Council, Malaga, Spain '12 13

How does MPI plan to meet these challenges?

Page 14: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• Overview of new MPI-3 features

• Addressing Exascale Challenges in MVAPICH2 Project

• Conclusions

14

Presentation Overview

HPC Advisory Council, Malaga, Spain '12

Page 15: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Major features

– Non-blocking Collectives

– Improved One-Sided (RMA) Model

– MPI Tools Interface

• Draft version available : http://meetings.mpi-

forum.org/MPI_3.0_main_page.php

• Targeted to be released during the next few weeks

HPC Advisory Council, Malaga, Spain '12 15

Major New Features in MPI-3

Page 16: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Enables overlap of computation with communication

• Removes synchronization effects of collective operations

(exception of barrier)

• Completion when local part is complete

• Completion of one non-blocking collective does not imply

completion of other non-blocking collectives

• No “tag” for the collective operation

• Issuing many non-blocking collectives may exhaust

resources

– Quality implementations will ensure that this happens for only

pathological cases

HPC Advisory Council, Malaga, Spain '12 16

Non-blocking Collective operations

Page 17: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Non-blocking calls do not match blocking collective calls

– MPI implementation may use different algorithms for blocking and

non-blocking collectives

– Therefore, non-blocking collectives cannot match blocking

collectives

– Blocking collectives: optimized for latency

– Non-blocking collectives: optimized for overlap

• User must call collectives in same order on all ranks

• Progress rules are same as those for point-to-point

• Example new calls: MPIX_Ibarrier, MPIX_Iallreduce, …

HPC Advisory Council, Malaga, Spain '12 17

Non-blocking Collective Operations (cont’d)

Page 18: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Easy to express irregular pattern of communication

– Easier than request-response pattern using two-sided

• Decouple data transfer with synchronization

• Better overlap of communication with computation

HPC Advisory Council, Malaga, Spain '12 18

Recap of One Sided Communication

Rank 0

Rank 2

Rank 1

Rank 3

mem

mem

mem

mem

window

Page 19: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Remote Memory Access (RMA)

• New proposal has major

improvements

• MPI-2: public and private windows

– Synchronization of windows explicit

• MPI-2: works for non-cache

coherent systems

• MPI-3: two types of windows

– Unified and Separate

– Unified window leverages hardware

cache coherence

HPC Advisory Council, Malaga, Spain '12 19

Improved One-sided Model

Processor

Private

Window

Public

Window

Incoming RMA Ops

Synchronization

Local Memory Ops

Page 20: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• MPI_Win_create_dynamic

– Window without memory attached

– MPI_Win_attach to attach memory to a window

• MPI_Win_allocate_shared

– Windows with shared memory

– Allows direct loads/store accesses by remote processes

• MPI_Rput, MPI_Rget, MPI_Raccumulate

– Local completion by using MPI_Wait on request objects

• MPI_Get_accumulate, MPI_Fetch_and_op

– Accumulate into target memory, return old data to origin

• MPI_Compare_and_swap

– Atomic compare and swap

HPC Advisory Council, Malaga, Spain '12 20

Highlights of New MPI-3 RMA Calls

Page 21: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• MPI_Win_lock_all

– Faster way to lock all members of win

• MPI_Win_flush / MPI_Win_flush_all

– Flush all RMA ops to target / window

• MPI_Win_flush_local / MPI_Win_flush_local_all

– Locally complete RMA ops to target / window

• MPI_Win_sync

– Synchronize public and private copies of window

• Overlapping accesses were “erroneous” in MPI-2

– They are “undefined” in MPI-3

HPC Advisory Council, Malaga, Spain '12 21

Highlights of New MPI-3 RMA Calls (Cont’d)

Page 22: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 22

MPI Tools Interface

• Extended tools support in MPI-3, beyond the PMPI interface

• Provide standardized interface (MPIT) to access MPI internal

information • Configuration and control information

• Eager limit, buffer sizes, . . .

• Performance information

• Time spent in blocking, memory usage, . . .

• Debugging information

• Packet counters, thresholds, . . .

• External tools can build on top of this standard interface

Page 23: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 23

Miscellaneous features

• New FORTRAN bindings

• Non-blocking File I/O

• Removed C++ bindings from standard

• Support for large data counts

• Scalable sparse collectives on process topologies

• Distributed graphs

• Topology aware communicator creation

• Non-collective (group-collective) communicator creation

• Enhanced Datatype support - MPI_TYPE_CREATE_HINDEXED_BLOCK

• Mprobe for hybrid programming on clusters of SMP nodes

• Numerous fixes in the existing proposal

Page 24: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• Overview of new MPI-3 features

• Addressing Exascale Challenges in MVAPICH2 Project

• Conclusions

24

Presentation Overview

HPC Advisory Council, Malaga, Spain '12

Page 25: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 1,960 organizations (HPC Centers, Industry and Universities) in

70 countries

– More than 129,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 11th ranked 125,980-core cluster (Pleiades) at NASA

• 14th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• 40h ranked 62,976-core cluster (Ranger) at TACC

• and many others

– Available with software stacks of many IB, HSE and server vendors

including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the upcoming U.S. NSF-TACC Stampede (10-15 PFlop) System

25

MVAPICH2/MVAPICH2-X Software

HPC Advisory Council, Malaga, Spain '12

Page 26: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Hybrid programming (MPI + OpenMP, MPI + OpenSHMEM, MPI + UPC, …)

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Support for GPGPUs and Accelerators (Intel MIC)

• Scalable Collective communication – Offload

– Non-blocking

– Topology-aware

– Power-aware

• Fault-tolerance/resiliency

• QoS support for communication and I/O

• MPI-3 Support will be available soon

Challenges being Addressed by MVAPICH2 for Exascale

26 HPC Advisory Council, Malaga, Spain '12

Page 27: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• High Performance and Scalable Inter-node Communication with Hybrid UD-

RC/XRC transport

• Kernel-based zero-copy intra-node communication

• Collective Communication

– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware

– Exploiting Collective Offload for Non-blocking Collective

• Support for GPGPUs and Accelerators

• Support for MPI-3 RMA Model

• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)

• QoS Support with Virtual Lanes

• Fault Tolerance and Fault Resilience

Overview of MVAPICH2 Designs

27 HPC Advisory Council, Malaga, Spain '12

Page 28: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 28

One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX2-PCIe2-QDR

MVAPICH-ConnectX3-PCIe3-FDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 29: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 29

Bandwidth: MPI over IB

0

1000

2000

3000

4000

5000

6000

7000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3280

3385

1917

1706

6343

-1000

1000

3000

5000

7000

9000

11000

13000

15000 MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX2-PCIe2-QDR

MVAPICH-ConnectX3-PCIe3-FDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3341

3704

4407

11643

6521

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

Page 30: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Both UD and RC/XRC have benefits

• Evaluate characteristics of all of them and use two sets of transports

in the same application – get the best of both

• Available in MVAPICH (since 1.1), as a separate interface

• Available since MVAPICH2 1.7 as an integrated interface

HPC Advisory Council, Malaga, Spain '12

Hybrid Transport Design (UD/RC/XRC)

M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over

InfiniBand,” IPDPS ‘08

30

Page 31: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Experimental Platform: LLNL Hyperion Cluster (Intel Harpertown + QDR IB)

• RC: Default Parameters

• UD: MV2_USE_UD=1

• Hybrid: MV2_HYBRID_ENABLE_THRESHOLD=1

HPC Advisory Council, Malaga, Spain '12

HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid)

31

0

1

2

3

4

5

6

128 256 512 1024

Tim

e (

us)

No. of Processes

UD

Hybrid

RC

38% 26% 40% 30%

Page 32: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

0

5000

10000

15000

20000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bi-Directional Bandwidth

intra-Socket-LiMIC

intra-Socket-Shmem

inter-Socket-LiMIC

inter-Socket-Shmem

0

2000

4000

6000

8000

10000

12000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth

intra-Socket-LiMIC

intra-Socket-Shmem

inter-Socket-LiMIC

inter-Socket-Shmem

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support)

32 HPC Advisory Council, Malaga, Spain '12

Latest MVAPICH2 1.8

Intel Westmere

10,000MB/s 18,500MB/s

0.19 us

0.45 us

Page 33: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• High Performance and Scalable Inter-node Communication with Hybrid UD-

RC/XRC transport

• Kernel-based zero-copy intra-node communication

• Collective Communication

– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware

– Exploiting Collective Offload for Non-Blocking Collective

• Support for GPGPUs and Accelerators

• MPI-3 RMA Support

• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)

• QoS Support with Virtual Lanes

• Fault Tolerance and Fault Resilience

Overview of MVAPICH2 Designs

33 HPC Advisory Council, Malaga, Spain '12

Page 34: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Shared-memory Aware Collectives

020406080

100120140160

0 4 8 16 32 64 128 256 512

La

ten

cy (

us

)

Message Size (bytes)

MPI_Reduce (4096 cores)

Original

Shared-memory

HPC Advisory Council, Malaga, Spain '12

0

1000

2000

3000

4000

5000

0 8 32 128 512 2K 8K

La

ten

cy (

us

)

Message Size (bytes)

MPI_ Allreduce (4096 cores)

Original

Shared-memory

34

MV2_USE_SHMEM_REDUCE=0/1 MV2_USE_SHMEM_ALLREDUCE=0/1

020406080

100120140

128 256 512 1024

Late

ncy

(u

s)

Number of Processes

Pt2Pt Barrier Shared-Mem Barrier

• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)

• MVAPICH2 Barrier with 1K Intel

Westmere cores , QDR IB

MV2_USE_SHMEM_BARRIER=0/1

Page 35: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 35

Hardware Multicast-aware MPI_Bcast

0

2

4

6

8

10

12

14

16

32 64 128 256 512 1024

Tim

e (u

s)

Number of Processes

16 byte MPI_Bcast

Default

Multicast

0

50

100

150

200

250

300

350

32 64 128 256 512 1024

Tim

e (u

s)

Number of Processes

64KB MPI_Bcast

Default

Multicast

0

500

1000

1500

2000

2500

2K 8K 32K 128K 512K

Tim

e (u

s)

Message Size (Bytes)

Large Messages (1,024 Cores)

Default

Multicast

0

5

10

15

20

25

1 4 16 64 256 1K

Tim

e (u

s)

Message Size (Bytes)

Small Messages (1,024 Cores)

Default

Multicast

Available in MVAPICH2 1.9

Page 36: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Non-contiguous Allocation of Jobs

• Supercomputing systems organized

as racks of nodes interconnected

using complex network

architectures

• Job schedulers used to allocate

compute nodes to various jobs

Line Card Switches

Line Card Switches

Spine Switches

- Busy Core

- Idle Core

New Job

- New Job

36 HPC Advisory Council, Malaga, Spain '12

Page 37: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Non-contiguous Allocation of Jobs

• Supercomputing systems organized

as racks of nodes interconnected

using complex network

architectures

• Job schedulers used to allocate

compute nodes to various jobs

• Individual processes belonging to

one job can get scattered

• Primary responsibility of scheduler is

to keep system throughput high

Line Card Switches

Line Card Switches

Spine Switches

- Busy Core

- Idle Core

- New Job

37 HPC Advisory Council, Malaga, Spain '12

Page 38: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Impact of Network-Topology Aware Algorithms on Broadcast Performance

0

5

10

15

20

25

30

35

40

128K 256K 512K 1M

Late

ncy

(m

s)

Message Size (Bytes)

No-Topo-Aware

Topo-Aware-DFT

Topo-Aware-BFT

HPC Advisory Council, Malaga, Spain '12 38

0

0.2

0.4

0.6

0.8

1

1.2

128 256 512 1K

No

rmal

ize

d L

ate

ncy

Job Size (Processes)

• Impact of topology aware schemes more pronounced as

• Message size increases

• System size scales

• Upto 14% improvement in performance at scale

14%

H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11

Page 39: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 39

Network-Topology-Aware Placement of Processes Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?

Overall performance and Split up of physical communication for MILC on Ranger

Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run

15%

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable

InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper

Finalist

• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)

• 15% improvement in MILC execution time @ 2048 cores

• 15% improvement in Hypre execution time @ 1024 cores

Page 40: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload

40

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking

All-to-All with Collective Offload on InfiniBand Clusters: A Study

with Parallel 3D FFT, ISC 2011

17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

No

rmal

ize

d

Pe

rfo

rman

ce

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

K. Kandalla, et. al, Designing Non-blocking Broadcast with

Collective Offload on InfiniBand Clusters: A Case Study

with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%

HPC Advisory Council, Malaga, Spain '12

Page 41: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

41

Non-blocking Neighborhood Collectives to minimize Communication Overheads of Irregular Graph Algorithms

HPC Advisory Council, Malaga, Spain '12

• Critical to improve communication overheads of irregular graph algorithms to cater to the graph-theoretic demands of emerging ‘’big-data’’ applications • What are the challenges involved in re-designing irregular graph algorithms, such as the 2D BFS, to achieve fine-grained computation/communication overlap? • Can we design network-offload based Non-blocking Neighborhood collectives in a high-performance, scalable manner?

CombBLAS (2D BFS) Application Run-time on Hyperion (Scale=29)

Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda (Accepted at Cluster-(IWPAPS) 2012

Communication Overheads in CombBLAS with 1936 Processes on Hyperion

70 %

20 %

6.5 %

Page 42: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

42

Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes

K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for Infiniband

Clusters”, ICPP ‘10

CPMD Application Performance and Power Savings

5% 32%

HPC Advisory Council, Malaga, Spain '12

Page 43: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• High Performance and Scalable Inter-node Communication with Hybrid UD-

RC/XRC transport

• Kernel-based zero-copy intra-node communication

• Collective Communication

– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware

– Exploiting Collective Offload for Non-blocking Collective

• Support for GPGPUs and Accelerators

• Support for MPI-3 RMA

• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)

• QoS Support with Virtual Lanes

• Fault Tolerance and Fault Resilience

Overview of MVAPICH2 Designs

43 HPC Advisory Council, Malaga, Spain '12

Page 44: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(sbuf, sdev, . . .);

MPI_Send(sbuf, size, . . .);

At Receiver:

MPI_Recv(rbuf, size, . . .);

cudaMemcpy(rdev, rbuf, . . .);

Sample Code - Without MPI integration

• Naïve implementation with standard MPI and CUDA

• High Productivity and Poor Performance

44 HPC Advisory Council, Malaga, Spain '12

Page 45: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(sbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

Sample Code – User Optimized Code

• Pipelining at user level with non-blocking MPI and CUDA interfaces

• Code at Sender side (and repeated at Receiver side)

• User-level copying may not match with internal MPI design

• High Performance and Poor Productivity

HPC Advisory Council, Malaga, Spain '12 45

Page 46: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Can this be done within MPI Library?

• Support GPU to GPU communication through standard MPI

interfaces

– e.g. enable MPI_Send, MPI_Recv from/to GPU memory

• Provide high performance without exposing low level details

to the programmer

– Pipelined data transfer which automatically provides optimizations

inside MPI library without user tuning

• A new Design incorporated in MVAPICH2 to support this

functionality

46 HPC Advisory Council, Malaga, Spain '12

Page 47: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

At Sender:

At Receiver:

MPI_Recv(r_device, size, …);

inside MVAPICH2

Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: standard MPI interfaces used

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

• High Performance and High Productivity

47 HPC Advisory Council, Malaga, Spain '12

MPI_Send(s_device, size, …);

Page 48: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

MPI-Level Two-sided Communication

• 45% improvement compared with a naïve user-level implementation

(Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation

(MemcpyAsync+Isend), for 4MB messages

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (

us)

Message Size (bytes)

Memcpy+Send

MemcpyAsync+Isend

MVAPICH2-GPU

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11

48 HPC Advisory Council, Malaga, Spain '12

Page 49: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Other MPI Operations and Optimizations for GPU Buffers

• Overlap optimizations for

– One-sided Communication

– Collectives

– Communication with Datatypes

• Optimized Designs for multi-GPUs per node

– Use CUDA IPC (in CUDA 4.1), to avoid copy through host memory

49

• H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011

HPC Advisory Council, Malaga, Spain '12

• A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011

• S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES), to be held in conjunction with IPDPS 2012, May 2012

Page 50: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

MVAPICH2 1.8 and 1.9a Release

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers

50 HPC Advisory Council, Malaga, Spain '12

Page 51: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 51

Applications-Level Evaluation

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • one process/GPU per node, 16 nodes

• AWP-ODC (Courtesy: Yifeng Cui, SDSC) • a seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

0

20

40

60

80

100

120

140

256*256*256 512*256*256 512*512*256 512*512*512

Ste

p T

ime

(S)

Domain Size X*Y*Z

MPI MPI-GPU

13.7% 12.0%

11.8%

9.4%

0102030405060708090

1 GPU/Proc per Node 2 GPUs/Procs per NodeTota

l Ex

ecu

tio

n T

ime

(se

c)

Configuration

AWP-ODC

MPI MPI-GPU

11.1% 7.9%

LBM-CUDA

Page 52: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Many Integrated Core (MIC) Architecture

• Intel’s Many Integrated Core (MIC) architecture geared for HPC

• Critical part of their solution for exascale computing

• Knights Ferry (KNF) and Knights Corner (KNC)

• Many low-power x86 processor cores with hardware threads and

wider vector units

• Applications and libraries can run with minor or no modifications

• Using MVAPICH2 for communication within a KNF coprocessor

– out-of-the box (Default)

– shared memory designs tuned for KNF (Optimized V1)

– designs enhanced using lower level API (Optimized V2)

52 HPC Advisory Council, Malaga, Spain '12

Page 53: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

53

Point-to-Point Communication

Knights Ferry - D0 1.20GHz card - 32 cores Alpha 9 MIC software stack with an additional pre-alpha patch

0.0

0.2

0.4

0.6

0.8

1.0

0 2 8 32 128 512 2K 8K

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Latency

Default

Optimized V1

Optimized V2

13% 17%

0.0

0.2

0.4

0.6

0.8

1.0

32K 128K 512K 2M

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Latency

Default

Optimized V1

Optimized V2

70% 80%

0.0

0.2

0.4

0.6

0.8

1.0

1 4 16 64 256 1K 4K 16K

No

rma

lize

d B

an

dw

idth

Message Size (Bytes)

Bandwidth

Default

Optimized V1

Optimized V2

0.0

0.2

0.4

0.6

0.8

1.0

32K 128K 512K 2M

No

rma

lize

d B

an

dw

idth

Message Size (Bytes)

Bandwidth

Default

Optimized V1

Optimized V2

15%

4X

9.5X

HPC Advisory Council, Malaga, Spain '12

Page 54: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

54

Collective Communication

0.0

0.2

0.4

0.6

0.8

1.0

0 2 8 32 128 512

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Broadcast

Default

Optimized V1

Optimized V2

17%

0.0

0.2

0.4

0.6

0.8

1.0

64K 128K 256K 512K 1M

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Scatter

Default

Optimized V1

Optimized V2

72% 80%

S. Potluri, K. Tomko, D. Bureddy and D. K. Panda – Intra-MIC MPI Communication using MVPAICH2: Early Experience, TACC-Intel Highly-Parallel Computing Symposium, April 2012 – Best Student Paper Award

0.0

0.2

0.4

0.6

0.8

1.0

2K 8K 32K 128K 512K

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Broadcast

Default

Optimized V1

Optimized V2

20% 26%

HPC Advisory Council, Malaga, Spain '12

Page 55: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• High Performance and Scalable Inter-node Communication with Hybrid UD-

RC/XRC transport

• Kernel-based zero-copy intra-node communication

• Collective Communication

– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware

– Exploiting Collective Offload for Non-blocking collective

• Support for GPGPUs and Accelerators

• Support for MPI-3 RMA

• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)

• QoS Support with Virtual Lanes

• Fault Tolerance and Fault Resilience

MPI Design Challenges

55 HPC Advisory Council, Malaga, Spain '12

Page 56: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 56

Support for MPI-3 RMA Model

MPI-3 One Sided Communication

Accumulate Ordering Undefined Conflicting

Accesses

Separate and Unified

Windows

Window Creation

• Win_allocate

• Win_allocate_shared

• Win_create_dynamic, Win_attach, Win_detach

Synchronization

• Lock_all, Unlock_all

• Win_flush, Win_flush_local, Win_flush_all, Win_flush_local_all

• Win_sync

Communication

• Get_accumulate

• Rput, Rget, Raccumulate, Rget_accumulate

• Fetch_and_op, Compare_and_swap

Page 57: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 57

Support for MPI-3 RMA Model

• Flush Operations

– Local and remote completions bundled in MPI-2

– Considerable overhead on networks like IB - semantics and cost

of local and remote completions are different

– Flush operations allow more efficient check for completions

• Request-based Operations

– Current semantics provide bulk synchronization

– Request based operations return an MPI Request, can be polled

individually

– Allow for better overlap with direct one-sided design in MVAPICH2

• Dynamic Windows

– Allows users to attach and detach memory dynamically, key is to hide

the overheads (such as exchange of RDMA key exchange)

Page 58: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 58

0

1

2

3

4

5

6

7

8

9

1 4 16 64 256 1K 4K

Tim

e (

use

c)

Message Size (Bytes)

Put+Unlock Put+Flush_local Put+Flush

0

20

40

60

80

100

2K 8K 32K 128K 512K 2M

Pe

rce

nta

ge O

verl

ap

Message Size (Bytes)

Lock-Unlock Request Ops

• Flush_local allows for a faster check on local completions

• Request-based operations provide much better overlap than

Lock-Unlock, in a Get-Compute-Put communication pattern

Support for MPI-3 RMA Model

S. Potluri, S. Sur, D. Bureddy and D. K. Panda – Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand, EuroMPI 2011, September 2011

Page 59: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• High Performance and Scalable Inter-node Communication with Hybrid UD-

RC/XRC transport

• Kernel-based zero-copy intra-node communication

• Collective Communication

– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware

– Exploiting Collective Offload for Non-blocking collective

• Support for GPGPUs and Accelerators

• Support for MPI-3 RMA

• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)

• QoS Support with Virtual Lanes

• Fault Tolerance and Fault Resilience

MPI Design Challenges

59 HPC Advisory Council, Malaga, Spain '12

Page 60: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Library-based

– Global Arrays

– OpenSHMEM

• Compiler-based

– Unified Parallel C (UPC)

– Co-Array Fortran (CAF)

• HPCS Language-based

– X10

– Chapel

– Fortress

HPC Advisory Council, Malaga, Spain '12 60

Different Approaches for Supporting PGAS Models

Page 61: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 61

Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs

Hybrid MPI+OpenSHMEM

• Based on OpenSHMEM Reference Implementation http://openshmem.org/

• Provides a design over GASNet

• Does not take advantage of all OFED features

• Design scalable and High-Performance OpenSHMEM over OFED

• Designing a Hybrid MPI + OpenSHMEM Model

• Current Model – Separate Runtimes for OpenSHMEM and MPI

• Possible deadlock if both runtimes are not progressed

• Consumes more network resource

• Our Approach – Single Runtime for MPI and OpenSHMEM

Hybrid (OpenSHMEM+MPI) Application

InfiniBand Network

OSU Design

OpenSHMEM

InterfaceMPI

Interface

OpenSHMEM

callsMPI calls

Available in MVAPICH2-X

Page 62: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 62

Micro-Benchmark Performance (OpenSHMEM)

0

2

4

6

8

fadd finc cswap swap add inc

Tim

e (u

s)

OpenSHMEM-GASNetOpenSHMEM-OSU

0

2

4

6

8

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU41%

Atomic Operations Broadcast (256 bytes)

41% 16%

0

2

4

6

8

10

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU35%

Collect (256 bytes)

0

4

8

12

16

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU36%

Reduce (256 bytes)

Page 63: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12 63

Performance of Hybrid (OpenSHMEM+MPI) Applications

0

400

800

1200

1600

32 64 128 256 512

Tim

e (s

)

No. of Processes

Hybrid-GASNet

Hybrid-OSU

34%

0

2

4

6

8

32 64 128 256

Tim

e (s

)

No. of Processes

Hybrid-GASNet

Hybrid-OSU

2D Heat Transfer Modeling Graph-500

0

20

40

60

80

32 64 128 256 512

Res

ou

rce

Uti

lizat

ion

(M

B)

No. of Processes

Hybrid-GASNetHybrid-OSU

27%

45%

Network Resource Usage

• Improved Performance for Hybrid

Applications

• 34% improvement for 2DHeat Transfer Modeling with 512 processes

• 45% improvement for Graph500 with 256 processes

• Our approach with single Runtime consumes 27% lesser network resources

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and

Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

Page 64: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

Evaluation of Hybrid MPI+UPC NAS-FT

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Tim

e (s

ec)

Class-Processes

GASNet-UCR

GASNet-IBV

GASNet-MPI

Hybrid

• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall

• Truly hybrid program

• 34% improvement for FT (C, 128)

64 HPC Advisory Council, Malaga, Spain '12

Hybrid MPI + UPC Support

Will be Available in future

MVAPICH2-X Release

Page 65: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Optimization for GPU Support and Accelerators

– Extending the GPGPU support (GPU-Direct RDMA)

– Support for Intel MIC

• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)

• RMA support (as in MPI 3.0)

• Extended topology-aware collectives

• Power-aware collectives

• Support for MPI Tools Interface (as in MPI 3.0)

• Checkpoint-Restart and migration support with in-memory checkpointing

• QoS-aware I/O and checkpointing

MVAPICH2 – Plans for Exascale

65 HPC Advisory Council, Malaga, Spain '12

Page 66: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

• Presented challenges for designing programming models for Exascale systems

• Overview of MPI and MPI-3 Features

• How MVAPICH2 is addressing some of these challenges

• MPI-3 Plans for MVAPICH2

• MVAPICH2 plans for Exascale systems

Conclusions

66 HPC Advisory Council, Malaga, Spain '12

Page 67: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12

Funding Acknowledgments

Funding Support by

Equipment Support by

67

Page 68: Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

HPC Advisory Council, Malaga, Spain '12

Personnel Acknowledgments

Current Students – J. Chen (Ph.D.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– M. Luo (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

– M. Rahman (Ph.D.)

– H. Subramoni (Ph.D.)

– A. Venkatesh (Ph.D.)

Past Students – P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

68

Past Research Scientist – S. Sur

Current Post-Docs – X. Lu

– J. Vienne

– H. Wang

Current Programmers – M. Arnold

– D. Bureddy

– J. Perkins

Past Post-Docs – X. Besseron

– H.-W. Jin

– E. Mancini

– S. Marcarelli