Parallel Programing Languages and Accelerations

26
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 1 Parallel Programing Languages and Accelerations [email protected]

Transcript of Parallel Programing Languages and Accelerations

Page 1: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 1

Parallel Programing Languages and Accelerations

[email protected]

Page 2: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 2

Mellanox ScalableHPC

Page 3: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 3

Offer high performing and scalable parallel programming libraries for HPC

Support a comprehensive set of MPIs and PGAS languages • Integration of Mellanox acceleration technology into broad list of languages

• Provide our own language library package when there is no open source alternative

Integrates Mellanox acceleration components into MPIs/PGAS languages • MXM – MellanoX Messaging Accelerator

• FCA – Mellanox Fabric Collective Accelerator

Mellanox ScalableHPC

Page 4: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 4

The I/O Bottleneck Paradigm

Network

Server/Storage

Application

Communication Libraries

Network

Server/Storage

Application

Communication Libraries

Bottleneck

Page 5: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 5

The I/O Bottleneck Paradigm

Network

Server/Storage

Application

Communication Libraries

Network

Server/Storage

Application

Communication Libraries

Bottleneck

Highest throughput

Lowest latency, Message rate

Low CPU overhead,

Hardware accelerations

Page 6: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 6

The I/O Bottleneck Paradigm – Scaling Issues

Network

Server/Storage

Application

Communication Libraries

Network

Server/Storage

Application

Communication Libraries

Bottleneck

Bottleneck

Highest throughput

Lowest latency, Message rate

Low CPU overhead,

Hardware accelerations

Page 7: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 7

The I/O Bottleneck Paradigm – Co-Design Architecture

Network

Server/Storage

Application

Communication Libraries

Network

Server/Storage

Application

Communication Libraries

Bottleneck

Bottleneck

Highest throughput

Lowest latency, Message rate

Low CPU overhead,

Hardware accelerations

Extension of I/O

communications

(RDMA, collectives,

synchronization etc)

Page 8: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 8

MPI/SHMEM/PGAS

Application

InfiniBand Network

InfiniBand Verbs

MPI/SHMEM/PGAS Architecture

Page 9: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 9

MPI/SHMEM/PGAS

Mellanox ScalableHPC Architecture

Application

InfiniBand Network (with Hardware Offloading)

InfiniBand Verbs

Mellanox Collectives

Collectives accelerations

(FCA with CORE-Direct)

Mellanox Messaging (MXM)

One-sided/Two-sided

communication

Intra-Node Shared

Memory

Page 10: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 10

High performing and scalable accelerations for collective operations

• Topology aware collectives take advantage of optimized message coalescing

• Makes use of powerful multicast capabilities in network for one-to-many

communications

• Run collectives on separate service level so no interference with other

communications

• Utilizes Mellanox CoreDirect collective hardware offload to minimize system noise

Mellanox Fabric Collective Accelerations (FCA)

Page 11: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 11

High performance and scalability for send/receive (or put/get) messages • Proper management of HCA resources and memory structures

• Optimized intra-node communication

• Hybrid transport technology for large scale deployments

• Efficient memory registration

• Connection management

• Receive Side tag matching

• Fully utilizes hardware offloads and capabilities

• Incorporated in MLNX_OFED-1.5.3-300 and later

- Also provided as a stand-along package

MellanoX Messaging (MXM)

Page 12: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 12

ScalableHPC Communication Libraries

Page 13: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 13

MPI - Message Passing Interface • Based on Send/Receive and collectives communication semantics

SHMEM - Shared Memory • Provides logically shared memory model and one-way put/get communications

PGAS - Partitioned Global Address Space • Message passing abstracted into a partitioned global address space • UPC (Unified Parallel C) is one example of a PGAS language

HPC Parallel Programming Models

Message Passing Model

P1 P2 P3

Memory Memory Memory

Distributed Memory Model

MPI (Message Passing Interface)

SHMEM

P1 P2 P3

Shared Memory Model

SHMEM, DSM

PGAS

P1 P2 P3

Memory Memory Memory

Logical shared memory

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, CAF, …

Memory

Logical shared memory

Page 14: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 14

SHMEM Details

SHared MEMory library • Library of functions somewhat similar to MPI (e.g. shmem_get()) • ….but SHMEM supports one-sided communication (puts/gets vs. MPI’s

send/receive)

SHMEM and PGAS both allow for a unique combination of using a ‘Distributed Memory Model’ (like MPI), and a ‘Shared Memory Model’ (like SMP machines)

Cray first introduced SHMEM in 1993

OpenSHMEM consortium formed to consolidate the various

SHMEM versions into a widely accepted standard

Mellanox ScalableSHMEM based on OpenSHMEM-1.0 specification

with FCA/MXM integration

Page 15: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 15

UPC Details

UPC, or ‘Unified Parallel C’ is another PGAS language

Higher level abstraction than MPI or SHMEM

Allows programmers to directly represent and manipulate distributed

data structures

Commercial compilers are available for Cray, SGI and HP machines

Open source compiler from LBNL/UCB (Berkeley UPC) available on

InfiniBand

Mellanox ScalableUPC based on BUPC with FCA/MXM integration

Page 16: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 16

FCA Details

Page 17: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 17

What are Collective Operations?

Collective Operations are Group Communications involving all processes

in job

Synchronous operations • By nature consume many ‘Wait’ cycles on large clusters

Popular examples • Barrier

• Reduce

• Allreduce

• Gather

• Allgather

• Bcast

Page 18: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 18

Collective Operation Challenges at Large Scale

Collective algorithms are not topology aware

and can be inefficient

Congestion due to many-to-many

communications

Slow nodes and OS jitter affect

scalability and increase variability

Ideal Actual

Page 19: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 19

Mellanox InfiniBand Switches High performance IB multicast for result distribution

FCA Manager

Topology-based collective tree Separate Virtual network IB multicast for result distribution

Mellanox Fabric Collectives Accelerations (FCA)

FCA Agents

Library integrated with MPI Intra-node optimizations CoreDirect integration

Page 20: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 20

Collective Example – Allreduce using Recursive Doubling

Collective Operations are Group Communications involving all

processes in job

7 8 5 6 3 4 1 2 15 11 7 3 3 10 11 27 10 37 37 37 37 37 37 37 37

A 4000 process Allreduce using recursive doubling is 12 stages

Page 21: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 21

Scalable Collectives with FCA

1 2

3 4

5 6

7 8 1 2

3 4

5 6

7 8 32

1 2

3 4

5 6

7 8 32

Intra-node processing

1 2

3 4

5 6

7 8 32

1st tier coalescing

32 648 648

2nd tier coalescing

(result at root)

11664 11664 11664 11664

Multicast Result Host1 Host18 Host324 Host306

SW-1 SW-18

Page 22: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 22

Performance Results

Page 23: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 23

FCA collective performance with OpenMPI

Page 24: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 24

FCA collective scalability for SHMEM

0.0

20.0

40.0

60.0

80.0

100.0

0 500 1000 1500 2000 2500

Late

nc

y (

us)

Processes (PPN=8)

Barrier Collective

Without FCA With FCA

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500

La

ten

cy (

us

)

processes (PPN=8)

Reduce Collective

Without FCA With FCA

0

2000

4000

6000

8000

10000

0 500 1000 1500 2000 2500

Ba

nd

wid

th (

KB

*pro

ces

se

s)

Processes (PPN=8)

8-Byte Broadcast

Without FCA With FCA

Page 25: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 25

Mellanox MXM – HPCC Random Ring Latency

Page 26: Parallel Programing Languages and Accelerations

© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 26

Thank You [email protected]