Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 1

Parallel Programing Languages and Accelerations

[email protected]


Mellanox ScalableHPC


Offer high performing and scalable parallel programming libraries for HPC

Support a comprehensive set of MPIs and PGAS languages • Integration of Mellanox acceleration technology into broad list of languages

• Provide our own language library package when there is no open source alternative

Integrates Mellanox acceleration components into MPIs/PGAS languages • MXM – MellanoX Messaging Accelerator

• FCA – Mellanox Fabric Collective Accelerator

Mellanox ScalableHPC


The I/O Bottleneck Paradigm

Network

Server/Storage

Application

Communication Libraries

Network

Server/Storage

Application


Bottleneck


The I/O Bottleneck Paradigm

Network

Server/Storage

Application


Network

Server/Storage

Application


Bottleneck

Highest throughput

Lowest latency, Message rate

Low CPU overhead,

Hardware accelerations


The I/O Bottleneck Paradigm – Scaling Issues

Network

Server/Storage

Application


Network

Server/Storage

Application


Bottleneck

Bottleneck

Highest throughput


Low CPU overhead,



The I/O Bottleneck Paradigm – Co-Design Architecture

Network

Server/Storage

Application


Network

Server/Storage

Application


Bottleneck

Bottleneck

Highest throughput


Low CPU overhead,


Extension of I/O

communications

(RDMA, collectives,

synchronization etc)


MPI/SHMEM/PGAS

Application

InfiniBand Network

InfiniBand Verbs

MPI/SHMEM/PGAS Architecture


MPI/SHMEM/PGAS

Mellanox ScalableHPC Architecture

Application

InfiniBand Network (with Hardware Offloading)

InfiniBand Verbs

Mellanox Collectives

Collectives accelerations

(FCA with CORE-Direct)

Mellanox Messaging (MXM)

One-sided/Two-sided

communication

Intra-Node Shared

Memory


High performing and scalable accelerations for collective operations

• Topology aware collectives take advantage of optimized message coalescing

• Makes use of powerful multicast capabilities in network for one-to-many

communications

• Run collectives on separate service level so no interference with other

communications

• Utilizes Mellanox CoreDirect collective hardware offload to minimize system noise

Mellanox Fabric Collective Accelerations (FCA)


High performance and scalability for send/receive (or put/get) messages • Proper management of HCA resources and memory structures

• Optimized intra-node communication

• Hybrid transport technology for large scale deployments

• Efficient memory registration

• Connection management

• Receive Side tag matching

• Fully utilizes hardware offloads and capabilities

• Incorporated in MLNX_OFED-1.5.3-300 and later

- Also provided as a stand-along package

MellanoX Messaging (MXM)


ScalableHPC Communication Libraries


MPI - Message Passing Interface • Based on Send/Receive and collectives communication semantics

SHMEM - Shared Memory • Provides logically shared memory model and one-way put/get communications

PGAS - Partitioned Global Address Space • Message passing abstracted into a partitioned global address space • UPC (Unified Parallel C) is one example of a PGAS language

HPC Parallel Programming Models

Message Passing Model

P1 P2 P3

Memory Memory Memory

Distributed Memory Model

MPI (Message Passing Interface)

SHMEM

P1 P2 P3

Shared Memory Model

SHMEM, DSM

PGAS

P1 P2 P3

Memory Memory Memory

Logical shared memory

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, CAF, …

Memory

Logical shared memory


SHMEM Details

SHared MEMory library • Library of functions somewhat similar to MPI (e.g. shmem_get()) • ….but SHMEM supports one-sided communication (puts/gets vs. MPI’s

send/receive)

SHMEM and PGAS both allow for a unique combination of using a ‘Distributed Memory Model’ (like MPI), and a ‘Shared Memory Model’ (like SMP machines)

Cray first introduced SHMEM in 1993

OpenSHMEM consortium formed to consolidate the various

SHMEM versions into a widely accepted standard

Mellanox ScalableSHMEM based on OpenSHMEM-1.0 specification

with FCA/MXM integration


UPC Details

UPC, or ‘Unified Parallel C’ is another PGAS language

Higher level abstraction than MPI or SHMEM

Allows programmers to directly represent and manipulate distributed

data structures

Commercial compilers are available for Cray, SGI and HP machines

Open source compiler from LBNL/UCB (Berkeley UPC) available on

InfiniBand

Mellanox ScalableUPC based on BUPC with FCA/MXM integration


FCA Details


What are Collective Operations?

Collective Operations are Group Communications involving all processes

in job

Synchronous operations • By nature consume many ‘Wait’ cycles on large clusters

Popular examples • Barrier

• Reduce

• Allreduce

• Gather

• Allgather

• Bcast


Collective Operation Challenges at Large Scale

Collective algorithms are not topology aware

and can be inefficient

Congestion due to many-to-many

communications

Slow nodes and OS jitter affect

scalability and increase variability

Ideal Actual


Mellanox InfiniBand Switches High performance IB multicast for result distribution

FCA Manager

Topology-based collective tree Separate Virtual network IB multicast for result distribution

Mellanox Fabric Collectives Accelerations (FCA)

FCA Agents

Library integrated with MPI Intra-node optimizations CoreDirect integration


Collective Example – Allreduce using Recursive Doubling

Collective Operations are Group Communications involving all

processes in job

7 8 5 6 3 4 1 2 15 11 7 3 3 10 11 27 10 37 37 37 37 37 37 37 37

A 4000 process Allreduce using recursive doubling is 12 stages


Scalable Collectives with FCA

1 2

3 4

5 6

7 8 1 2

3 4

5 6

7 8 32

1 2

3 4

5 6

7 8 32

Intra-node processing

1 2

3 4

5 6

7 8 32

1st tier coalescing

32 648 648

2nd tier coalescing

(result at root)

11664 11664 11664 11664

Multicast Result Host1 Host18 Host324 Host306

SW-1 SW-18


Performance Results


FCA collective performance with OpenMPI


FCA collective scalability for SHMEM

0.0

20.0

40.0

60.0

80.0

100.0

0 500 1000 1500 2000 2500

Late

nc

y (

us)

Processes (PPN=8)

Barrier Collective

Without FCA With FCA

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500

La

ten

cy (

us

)

processes (PPN=8)

Reduce Collective


0

2000

4000

6000

8000

10000

0 500 1000 1500 2000 2500

Ba

nd

wid

th (

KB

*pro

ces

se

s)

Processes (PPN=8)

8-Byte Broadcast



Mellanox MXM – HPCC Random Ring Latency


Thank You [email protected]

Parallel Programing Languages and Accelerations

Documents

Transcript of Parallel Programing Languages and Accelerations