Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface)...

40
Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Khaled Hamidouche The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~hamidouc Presentation at IXPUG Meeting, July 2014 by

Transcript of Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface)...

Page 1: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Khaled Hamidouche The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~hamidouc

Presentation at IXPUG Meeting, July 2014 by

Page 2: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

2

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

DSM Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• Additionally, OpenMP can be used to parallelize computation within the node

• Each model has strengths and drawbacks - suite different problems or applications

IXPUG14-OSU-PGAS

Page 3: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

3

Partitioned Global Address Space (PGAS) Models

IXPUG14-OSU-PGAS

• Key features - Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS

- Languages • Unified Parallel C (UPC)

• Co-Array Fortran (CAF)

• X10

• Chapel

- Libraries • OpenSHMEM

• Global Arrays

Page 4: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

SHMEM

• SHMEM: Symmetric Hierarchical MEMory library

• One-sided communications library – had been around for a while

• Similar to MPI, processes are called PEs, data movement is explicit through library calls

• Provides globally addressable memory using symmetric memory objects (more in later slides)

• Library routines for

– Symmetric object creation and management

– One-sided data movement

– Atomics

– Collectives

– Synchronization

IXPUG14-OSU-PGAS 4

Page 5: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

OpenSHMEM

• SHMEM implementations – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP SHMEM, GSHMEM

• Subtle differences in API, across versions – example:

SGI SHMEM Quadrics SHMEM Cray SHMEM

Initialization start_pes(0) shmem_init start_pes

Process ID _my_pe my_pe shmem_my_pe

• Made application codes non-portable

• OpenSHMEM is an effort to address this:

“A new, open specification to consolidate the various extant SHMEM versions

into a widely accepted standard.” – OpenSHMEM Specification v1.0

by University of Houston and Oak Ridge National Lab

SGI SHMEM is the baseline IXPUG14-OSU-PGAS 5

Page 6: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

• UPC: a parallel extension to the C standard • UPC Specifications and Standards:

– Introduction to UPC and Language Specification, 1999 – UPC Language Specifications, v1.0, Feb 2001 – UPC Language Specifications, v1.1.1, Sep 2004 – UPC Language Specifications, v1.2, June 2005 – UPC Language Specifications, v1.3, Nov 2013

• UPC Consortium – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… – Government Institutions: ARSC, IDA, LBNL, SNL, US DOE… – Commercial Institutions: HP, Cray, Intrepid Technology, IBM, …

• Supported by several UPC compilers – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC – Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC

• Aims for: high performance, coding efficiency, irregular applications, …

IXPUG14-OSU-PGAS 6

Compiler-based: Unified Parallel C

Page 7: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

• Hierarchical architectures with multiple address spaces

• (MPI + PGAS) Model – MPI across address spaces

– PGAS within an address space

• MPI is good at moving data between address spaces

• Within an address space, MPI can interoperate with other shared memory programming models

• Applications can have kernels with different communication patterns

• Can benefit from different models

• Re-writing complete applications can be a huge effort

• Port critical kernels to the desired model instead

IXPUG14-OSU-PGAS 7

MPI+PGAS for Exascale Architectures and Applications

Page 8: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics

• Benefits: – Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*: – “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1 MPI

Kernel 2 MPI

Kernel 3 MPI

Kernel N MPI

HPC Application

Kernel 2 PGAS

Kernel N PGAS

IXPUG14-OSU-PGAS 8

Page 9: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Support for GPGPUs and MIC

– Used by more than 2,150 organizations (HPC Centers, Industry and Universities) in 72 countries

– More than 218,000 downloads from OSU site directly

– Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC

• 11th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology

• 16th ranked 96,192-core cluster (Pleiades) at NASA and many others

– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software

9 IXPUG14-OSU-PGAS

Page 10: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

MVAPICH2-X for Hybrid MPI + PGAS Applications

IXPUG14-OSU-PGAS 10

MPI Applications, OpenSHMEM Applications, UPC Applications, Hybrid (MPI + PGAS) Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls UPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM available with MVAPICH2-X 1.9 onwards! – http://mvapich.cse.ohio-state.edu

• Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant (with initial support for UPC 1.3) – Scalable Inter-node and intra-node communication – point-to-point and collectives

Page 11: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

• OpenSHMEM and UPC for Host

• Hybrid MPI + PGAS (OpenSHMEM and UPC) for Host

• OpenSHMEM for MIC

• UPC for MIC

IXPUG14-OSU-PGAS

Outline

11

Page 12: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

OpenSHMEM Design in MVAPICH2-X

• OpenSHMEM Stack based on OpenSHMEM Reference Implementation

• OpenSHMEM Communication over MVAPICH2-X Runtime – Uses active messages, atomic and one-sided operations and remote

registration cache

Communication API Symmetric Memory

Management API

Minimal Set of Internal API

OpenSHMEM API

InfiniBand, RoCE, iWARP

Data Movement Collectives Atomics Memory

Management

Active Messages

One-sided Operations

MVAPICH2-X Runtime

Remote Atomic Ops

Enhanced Registration Cache

IXPUG14-OSU-PGAS 12

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012.

Page 13: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

OpenSHMEM Collective Communication Performance

13

Reduce (1,024 processes) Broadcast (1,024 processes)

Collect (1,024 processes) Barrier

050

100150200250300350400

128 256 512 1024 2048

Tim

e (u

s)

No. of Processes

110

1001000

10000100000

100000010000000

Tim

e (u

s)

Message Size

180X

1

10

100

1000

10000

4 16 64 256 1K 4K 16K 64K 256K

Tim

e (u

s)

Message Size

18X

1

10

100

1000

10000

100000

1 4 16 64 256 1K 4K 16K 64K

Tim

e (u

s)

Message Size

MV2X-SHMEMScalable-SHMEMOMPI-SHMEM

12X

3X

IXPUG14-OSU-PGAS

Page 14: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

OpenSHMEM Application Evaluation

14

0

100

200

300

400

500

600

256 512

Tim

e (s

)

Number of Processes

UH-SHMEM

OMPI-SHMEM (FCA)

Scalable-SHMEM (FCA)

MV2X-SHMEM

0

10

20

30

40

50

60

70

80

256 512

Tim

e (s

)

Number of Processes

UH-SHMEM

OMPI-SHMEM (FCA)

Scalable-SHMEM (FCA)

MV2X-SHMEM

Heat Image Daxpy

• Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA • Execution time for 2DHeat Image at 512 processes (sec):

- UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X-SHMEM – 169

• Execution time for DAXPY at 512 processes (sec): - UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X-

SHMEM – 5.2 J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of OpenSHMEM Libraries

on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014

IXPUG14-OSU-PGAS

Page 15: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

MVAPICH2-X Support for Berkeley UPC Runtime

• GASNet (Global-Address Space Networking) is a language-independent, low-level networking layer that provides support for PGAS language

• Support multiple networks through different conduit: MVAPICH2-X Conduit is available in MVAPICH2-X release, which support UPC/OpenMP/MPI on InfiniBand

IXPUG14-OSU-PGAS 15

UPC Applications

Compiler-generated Code

Compiler-specific Runtime

GASNet Core APIs

Extended APIs

SMP Conduit MPI Conduit MXM Conduit MVAPICH2-X Conduit …

High-level operations such as remote memory

access and various collective operations

Heavily based on active messages; directly on top of each individual network architectures

IB Conduit

Page 16: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

IXPUG14-OSU-PGAS 16

UPC Collectives Performance Broadcast (2048 processes) Scatter (2048 processes)

Gather (2048 processes) Exchange (2048 processes)

02000400060008000

100001200014000

64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

256K

Tim

e (u

s)

Message Size

UPC-GASNetUPC-OSU

0

50

100

150

200

250

64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

256K

Tim

e (m

s)

Message Size

UPC-GASNet

UPC-OSU

0

50

100

150

200

250

64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

256K

Tim

e (m

s)

Message Size

UPC-GASNet

UPC-OSU 2X

0500

100015002000250030003500

64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

Tim

e (m

s)

Message Size

UPC-GASNetUPC-OSU

25X

2X

35%

J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and D. K. Panda, Optimizing Collective Communication in UPC (HiPS’14, in association with IPDPS’14)

Page 17: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

• OpenSHMEM and UPC for Host

• Hybrid MPI + PGAS (OpenSHMEM and UPC) for Host

• OpenSHMEM for MIC

• UPC for MIC

IXPUG14-OSU-PGAS

Outline

17

Page 18: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Unified Runtime for Hybrid MPI + OpenSHMEM Applications

IXPUG14-OSU-PGAS 18

MPI Applications, OpenSHMEM Applications, Hybrid (MPI + OpenSHMEM) Applications

MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls

Hybrid (OpenSHMEM + MPI) Applications

OpenSHMEM Runtime

MPI Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls

• Optimal network resource usage • No deadlock because of single runtime • Better performance

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012.

Page 19: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

IXPUG14-OSU-PGAS 19

Hybrid MPI+OpenSHMEM Graph500 Design Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013

0123456789

26 27 28 29

Billi

ons

of T

rave

rsed

Edg

es P

er

Seco

nd (T

EPS)

Scale

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

0123456789

1024 2048 4096 8192

Billi

ons

of T

rave

rsed

Edg

es P

er

Seco

nd (T

EPS)

# of Processes

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

Strong Scalability

Weak Scalability

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

05

101520253035

4K 8K 16K

Tim

e (s

)

No. of Processes

MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)

13X

7.6X

• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design

• 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple • 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM

Programming Models, International Supercomputing Conference (ISC’13), June 2013

Page 20: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Hybrid MPI+UPC NAS-FT

• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall • Truly hybrid program • For FT (Class C, 128 processes)

• 34% improvement over UPC (GASNet) • 30% improvement over UPC (MV2-X)

20

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Tim

e (s

)

NAS Problem Size – System Size

34%

J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth Conference on Partitioned Global Address Space Programming Model (PGAS '10), October 2010

UPC (GASNet)

UPC (MV2-X)

MPI+UPC (MV2-X)

J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS) 2010

IXPUG14-OSU-PGAS

Page 21: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

• OpenSHMEM and UPC for Host

• Hybrid MPI + PGAS (OpenSHMEM and UPC) for Host

• OpenSHMEM for MIC

• UPC for MIC

IXPUG14-OSU-PGAS

Outline

21

Page 22: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

HPC Applications on MIC Clusters • Flexibility in launching HPC jobs on Xeon Phi Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

OpenSHMEM Program

OpenSHMEM Program

Offloaded Computation

OpenSHMEM Program

OpenSHMEM Program

OpenSHMEM Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

IXPUG14-OSU-PGAS 22

Page 23: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Data Movement on Intel Xeon Phi Clusters

CPU CPU QPI

MIC

PCIe

M

IC

MIC

CPU

MIC

IB

Node 0 Node 1 1. Intra-Socket 2. Inter-Socket 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host

10. Inter-Node MIC-Host 9. Inter-Socket MIC-Host

OpenSHMEM Process

11. Inter-Node MIC-MIC with IB on remote socket

and more . . .

• Critical for runtimes to optimize data movement, hiding the complexity

IXPUG14-OSU-PGAS 23

Page 24: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Need for Non-Uniform Memory Allocation in OpenSHMEM

• MIC cores have limited

memory per core

• OpenSHMEM relies on symmetric memory, allocated using shmalloc()

• shmalloc() allocates same amount of memory on all PEs

• For applications running in symmetric mode, this limits the total heap size

• Similar issues for applications (even host-only) with memory load imbalance (Graph500, Out-of-Core Sort, etc.)

• How to allocate different amounts of memory on host and MIC cores, and still be able to communicate?

Memory per core

IXPUG14-OSU-PGAS 24

Page 25: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

OpenSHMEM Design for MIC Clusters

• Non-Uniform Memory Allocation: – Team-based Memory Allocation

(Proposed Extensions)

– Address Structure for non-uniform memory allocations

IXPUG14-OSU-PGAS 25

Page 26: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

HOST2

Proxy-based designs for OpenSHMEM

OpenSHMEM Put using Active Proxy OpenSHMEM Get using Active Proxy

HOST1

MIC1 HCA

HOST2

MIC2 HCA

(1) IB REQ

(2) SCIF Read

(2) IB Write

(3) IB FIN

HOST1

MIC1 HCA

MIC2 HCA

(3) IB FIN

(2) SCIF Read

(2) IB Write

(1) IB REQ

• Current generation architectures impose limitations on read bandwidth when HCA reads from MIC memory

– Impacts both put and get operation performance • Solution: Pipelined data transfer by proxy running on host using IB and SCIF

channels • Improves latency and bandwidth!

IXPUG14-OSU-PGAS 26

Page 27: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

OpenSHMEM Put/Get Performance

OpenSHMEM Put Latency OpenSHMEM Get Latency

0

500

1000

1500

2000

2500

3000

3500

4000

45001 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Late

ncy

(us)

Message Size

MV2X-DefMV2X-Opt

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Late

ncy

(us)

Message Size

MV2X-Def

MV2X-Opt

• Proxy-based designs alleviate hardware limitations • Put Latency of 4M message: Default: 3911us, Optimized: 838us • Get Latency of 4M message: Default: 3889us, Optimized: 837us

4.5X 4.6X

IXPUG14-OSU-PGAS 27

Page 28: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

OpenSHMEM Collectives Performance

0

1000

2000

3000

4000

5000

6000

7000

4 16 64 256 1K 4K 16K 64K 256K

Late

ncy

(us)

Message Size

MV2X-Def

MV2X-Opt

0

1000

2000

3000

4000

5000

6000

7000

4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

Late

ncy

(us)

Message Size

MV2X-Def

MV2X-Opt

Broadcast Latency (1,024 processes) Reduce Latency (1,024 processes)

• Optimized designs for OpenSHMEM collective operations • Broadcast Latency of 256KB message at 1,024 processes:

– Default: 5955us, Optimized: 2268us

• Reduce Latency of 256KB message at 1,024 processes: – Default: 6581us, Optimized: 3294us

2.6X 1.9X

IXPUG14-OSU-PGAS 28

Page 29: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Performance Evaluations using Graph500

Native Mode (8 procs/MIC) Symmetric Mode (16 Host+16MIC)

• Graph500 Execution Time (Native Mode): – At 512 processes , Default: 5.17s, Optimized: 4.96s – Performance Improvement from MIC-aware collectives design

• Graph500 Execution Time (Symmetric Mode): – At 1,024 processes, Default: 15.91s, Optimized: 12.41s – Performance Improvement from MIC-aware collectives and proxy-based designs

0

1

2

3

4

5

6

64 128 256 512

Exec

utio

n Ti

me

(s)

Number of Processes

MV2X-Def

MV2X-Opt

0

5

10

15

20

25

128 256 512 1024

Exec

utio

n Ti

me

(s)

Number of Processes

MV2X-Def

MV2X-Opt

28%

IXPUG14-OSU-PGAS 29

Page 30: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Graph500 Evaluations with Extensions

• Redesigned Graph500 using MIC to overlap computation/communication – Data Transfer to MIC memory; MIC cores pre-processes received data – Host processes traverses vertices, and sends out new vertices

• Graph500 Execution time at 1,024 processes: – Host-Only: .33s, Host+MIC with Extensions: .26s

• Magnitudes of improvement compared to default symmetric mode – Default Symmetric Mode: 12.1s, Host+MIC Extensions: 0.16s

00.10.20.30.40.50.60.7

128 256 512 1024

Exec

utio

n Ti

me

(s)

Number of Processes

HostHost+MIC (extensions)

26%

J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko and D. K. Panda, High Performance OpenSHMEM for Intel MIC Clusters: Extensions, Runtime Designs and Application Co-Design, IEEE International Conference on Cluster Computing (CLUSTER '14) (Best Paper Nominee)

IXPUG14-OSU-PGAS 30

Page 31: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

• OpenSHMEM and UPC for Host

• Hybrid MPI + PGAS (OpenSHMEM and UPC) for Host

• OpenSHMEM for MIC

• UPC for MIC

IXPUG14-OSU-PGAS

Outline

31

Page 32: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

HPC Applications on MIC Clusters • Flexibility in launching HPC jobs on Xeon Phi Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

UPC Program

UPC Program Offloaded Computation

UPC Program UPC Program

UPC Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

IXPUG14-OSU-PGAS 32

Page 33: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

UPC on MIC clusters

33

• Process Model – Communication via shared memory – single copy/double copy

– One connection per process

– Drawbacks: Communication Overheads, Network Connections

• Thread Model – Direct access to shared array: same memory space

– No extra copy

– No shared memory mapping entries

– Drawback: sharing of network connections

– Our work on multi-endpoint addresses this issue

IXPUG14-OSU-PGAS

M. Luo, J. Jose, S. Sur and D. K. Panda, Multi-threaded UPC Runtime with Network Endpoints: Design Alternatives and Evaluation on Multi-core Architectures, Int'l Conference on High Performance Computing (HiPC '11), Dec. 2011

Page 34: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

UPC Runtime on MIC: Remote Memory Access between Host and MIC

34

Leader-to-all connection mode

Global Memory Region on MIC

Intel Xeon Phi Co-Processors Intel Xeon

CPU

CPU

CPU

CPU

Pin-down the whole global memory region

• Original connection mode: Each UPC thread pins down the global memory region that has affinity with this UPC thread; Establish all-to-all connections between every pair of UPC threads between MIC and host

• Leader-to-all connection mode: Only one UPC thread pinned down the whole global memory space on MIC; all UPC threads on the host will get access with the leader UPC thread

• Connections: NMIC*NHOST -> NMIC+NHOST

Global Memory Region on MIC

Intel Xeon Phi Co-Processors Intel Xeon

CPU

CPU

CPU

CPU

Original all-to-all connection mode

IXPUG14-OSU-PGAS

Page 35: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

UPC Runtime on MIC: Remote Memory Access between Host and MIC

35

• Process-based Runtime: The whole global memory region need to be mapped as shared memory region, such as PSHM

• Thread-based Runtime: As the threads share the same memory space, the leader can access other global memory region on MIC without mapping to shared memory

Global Memory Region on MIC

Intel Xeon Phi Co-Processors Intel Xeon

CPU

CPU

CPU

CPU

IXPUG14-OSU-PGAS

Page 36: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Application Benchmark Evaluation for Native Mode

36

0

10

20

30

40

50

60

CG.B EP.B FT.B IS.B MG.B

native-host

native-MIC

00.10.20.30.40.50.60.7

RandomAccess UTS

2.6X

4X

• Host and MIC are occupied with 16 and 60 UPC threads, respectively • MIC performs 80%, 67%, and 54% as 16-CPU host for MG, EP, and FT, respectively • For communication-intensive benchmarks CG and IS, MIC spends 7x and 3x execution

times • Default applications without modification – no representation of optimal

performance

IXPUG14-OSU-PGAS

M. Luo, M. Li, A. Venkatesh, X. Lu and D. K. Panda, UPC on MIC: Early Experiences with Native and Symmetric Modes, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013.

Page 37: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Concluding Remarks

• HPC Systems are evolving to support growing needs of exascale applications

• Hybrid programming (MPI + PGAS) is a practical way to program exascale systems

• Presented designs to demonstrate performance potential of PGAS and hybrid MPI+PGAS models

• MIC support for OpenSHMEM and UPC will be available in future MVAPICH2-X releases

IXPUG14-OSU-PGAS 37

Page 38: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

Funding Acknowledgments

Funding Support by

Equipment Support by

38 IXPUG14-OSU-PGAS

Page 39: Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) ... • Similar to MPI, processes are called

IXPUG14-OSU-PGAS

Personnel Acknowledgments Current Students

– S. Chakraborthy (Ph.D.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– M. Li (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

Past Students – P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

39

Past Research Scientist – S. Sur

Current Post-Docs – K. Hamidouche

– J. Lin

Current Programmers – M. Arnold

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– R. Shir (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associates – H. Subramoni

– X. Lu

Past Programmers – D. Bureddy