Interconnect Your Future - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/... ·...

Smart Interconnect for Next Generation HPC Platforms

Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting

Interconnect Your Future

© 2016 Mellanox Technologies 2

Mellanox Connects the World’s Fastest Supercomputer

93 Petaflop performance, 3X higher versus #2 on the TOP500

41K nodes, 10 million cores, 256 cores per CPU

Mellanox adapter and switch solutions

National Supercomputing Center in Wuxi, China

#1 on the TOP500 Supercomputing List

The TOP500 list has evolved, includes HPC & Cloud / Web2.0 Hyperscale systems

Mellanox connects 41.2% of overall TOP500 systems

Mellanox connects 70.4% of the TOP500 HPC platforms

Mellanox connects 46 Petascale systems, Nearly 50% of the total Petascale systems

InfiniBand is the Interconnect of Choice for

HPC Compute and Storage Infrastructures


Accelerating HPC Leadership Systems

“Summit” System “Sierra” System

Proud to Pave the Path to Exascale


The Ever Growing Demand for Higher Performance

2000 202020102005

“Roadrunner”

1st

2015

Terascale Petascale Exascale

Single-Core to Many-CoreSMP to Clusters

Performance Development

Co-Design

HW SW

APP

Hardware

Software

Application

The Interconnect is the Enabling Technology


The Intelligent Interconnect to Enable Exascale Performance

CPU-Centric Co-Design

Work on The Data as it Moves

Enables Performance and Scale

Must Wait for the Data

Creates Performance Bottlenecks

Limited to Main CPU Usage

Results in Performance Limitation

Creating Synergies

Enables Higher Performance and Scale


Breaking the Application Latency Wall

Today: Network device latencies are on the order of 100 nanoseconds

Challenge: Enabling the next order of magnitude improvement in application performance

Solution: Creating synergies between software and hardware – intelligent interconnect

Intelligent Interconnect Paves the Road to Exascale Performance

10 years ago

~10

microsecond

~100

microsecond

NetworkCommunication

Framework

Today

~10

microsecond

Communication

Framework

~0.1

microsecond

Network

~1

microsecond

Communication

Framework

Future

~0.05

microsecond

Co-Design

Network


Mellanox Smart Interconnect Solutions: Switch-IB 2 and ConnectX-5


Switch-IB 2 and ConnectX-5 Smart Interconnect Solutions

Delivering 10X Performance Improvement for MPI

and SHMEM/PAGS Communications

Barrier, Reduce, All-Reduce, Broadcast

Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND

Integer and Floating-Point, 32 / 64 bit

SHArP Enables Switch-IB 2 to Execute Data

Aggregation / Reduction Operations in the Network

PCIe Gen3 and Gen4

Integrated PCIe Switch

Advanced Dynamic Routing

MPI Collectives in Hardware

MPI Tag Matching in Hardware

In-Network Memory

100Gb/s Throughput

0.6usec Latency (end-to-end)

200M Messages per Second


ConnectX-5 EDR 100G Architecture

X86Open

POWERGPU ARM FPGA

Offload Engines

RDMA TransporteSwitch &

Routing

Memory

Multi-Host Technology

PCIe Switch

InfiniBand / Ethernet

10,20,40,50,56,100G

PCIe Gen4

PCIe Gen4


Switch-IB 2 SHArP Performance Advantage

MiniFE is a Finite Element mini-application

• Implements kernels that represent

implicit finite-element applications

10X to 25X Performance Improvement

Allreduce MPI Collective


SHArP Performance Advantage with Intel Xeon Phi Knight Landing

Maximizing KNL Performance – 50% Reduction in Run Time

(Customer Results)

Lower is better

Tim

e

OSU - OSU MPI benchmark; IMB – Intel MPI Benchmark


SHArP Technology Performance

SHArP Delivers 2.2X Higher Performance

OpenFOAM is a popular computational fluid dynamics application


BlueField System-on-a-Chip (SoC) Solution

Integration of ConnectX5 + Multicore ARM

State of the art capabilities• 10 / 25 / 40 / 50 / 100G Ethernet & InfiniBand

• PCIe Gen3 /Gen4• Hardware acceleration offload

- RDMA, RoCE, NVMeF, RAID

Family of products• Range of ARM core counts and I/O ports/speeds

• Price/Performance points

NVMe Flash Storage Arrays

Scale-Out Storage (NVMe over Fabric)

Accelerating & Virtualizing VNFs

Open vSwitch (OVS), SDN

Overlay networking offloads


InfiniBand Router Solutions

SB7780 Router 1U

Supports up to 6 Different Subnets

Isolation Between Different InfiniBand

Networks (Each Network can be Managed

Separately)

Native InfiniBand Connectivity Between

Different Network Topologies (Fat-Tree,

Torus, Dragonfly, etc.)


Technology Roadmap

2016 20212019

200G 400G

Petascale Exascale

1000G100G

#1 TOP500, 100Petaflop

World-wide

Programs

Standard-based Interconnect, Same Software, Backward and Future Compatible


Maximize Performance via Accelerator and GPU Offloads

GPUDirect RDMA Technology


GPUs are Everywhere!

//upload.wikimedia.org/wikipedia/de/7/74/Boehringer_Ingelheim_Logo.svg

//upload.wikimedia.org/wikipedia/de/7/74/Boehringer_Ingelheim_Logo.svg

http://www.gpi-site.com/gpi2/

http://www.gpi-site.com/gpi2/

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=NNTW30FI4v9AvM&tbnid=kfzoXf4kbQLrhM:&ved=0CAgQjRwwAA&url=http://ictr-phe14.web.cern.ch/ictr-phe14/supporting_entities.html&ei=cEFYUrrZOYX69QSPuYCQAw&psig=AFQjCNF55dnaGK56BdFC4ghgcjbkq18-jA&ust=1381602033024591

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=NNTW30FI4v9AvM&tbnid=kfzoXf4kbQLrhM:&ved=0CAgQjRwwAA&url=http://ictr-phe14.web.cern.ch/ictr-phe14/supporting_entities.html&ei=cEFYUrrZOYX69QSPuYCQAw&psig=AFQjCNF55dnaGK56BdFC4ghgcjbkq18-jA&ust=1381602033024591

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&docid=O1T_pBXRDdnd-M&tbnid=tVw8XKIcwYTrmM:&ved=0CAgQjRw&url=http://www.map-d.com/&ei=b-1xU8q1Feug7Abb-oHICg&psig=AFQjCNFQxJcyKyZ5mMv3PvUHDpPxhqM5hA&ust=1400061679431285

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&docid=O1T_pBXRDdnd-M&tbnid=tVw8XKIcwYTrmM:&ved=0CAgQjRw&url=http://www.map-d.com/&ei=b-1xU8q1Feug7Abb-oHICg&psig=AFQjCNFQxJcyKyZ5mMv3PvUHDpPxhqM5hA&ust=1400061679431285

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&docid=Qf1pMoH5D5-nqM&tbnid=FztyZoz7MpHaNM:&ved=0CAgQjRw&url=http://en.wikipedia.org/wiki/File:University_of_Cambridge_logo.svg&ei=2hNyU5CZM7TA7AagqIGIDQ&psig=AFQjCNG59PbJCtrkc8oUQp5ERzi1FC3EeQ&ust=1400071514926848



GPU-GPU Internode MPI Latency

Low

er is

Bette

r

Performance of MPI with GPUDirect RDMA

88% Lower Latency

GPU-GPU Internode MPI Bandwidth

Hig

her

is B

ett

er

10X Increase in Throughput

Source: Prof. DK Panda

9.3X

2.18 usec

10x

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892


Mellanox GPUDirect RDMA Performance Advantage

HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs

GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand• Unlocks performance between GPU and InfiniBand

• This provides a significant decrease in GPU-GPU communication latency

• Provides complete CPU offload from all GPU communications across the network

102%

2X Application

Performance!




GPUDirect Sync (GPUDirect 4.0)

GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect

• Control path still uses the CPU

- CPU prepares and queues communication tasks on GPU

- GPU triggers communication on HCA

- Mellanox HCA directly accesses GPU memory

GPUDirect Sync (GPUDirect 4.0)

• Both data path and control path go directly

between the GPU and the Mellanox interconnect

0

10

20

30

40

50

60

70

80

2 4

Ave

rag

e t

ime

pe

r it

era

tio

n (

us

)

Number of nodes/GPUs

2D stencil benchmark

RDMA only RDMA+PeerSync

27% faster 23% faster

Maximum Performance

For GPU Clusters


Unified Communication X

UCX

UCXUnified Communication - X

Framework

www.openucx.org

[email protected]

http://www.openucx.org/

mailto:[email protected]

Background

MXM● Developed by Mellanox Technologies

● HPC communication library for InfiniBand

devices and shared memory

● Primary focus: MPI, PGAS

UCCS● Developed by ORNL, UH, UTK

● Originally based on Open MPI BTL and

OPAL layers

● HPC communication library for InfiniBand,

Cray Gemini/Aries, and shared memory

● Primary focus: OpenSHMEM, PGAS

● Also supports: MPI

PAMI● Developed by IBM on BG/Q, PERCS, IB

VERBS

● Network devices and shared memory

● MPI, OpenSHMEM, PGAS, CHARM++, X10

● C++ components

● Aggressive multi-threading with contexts

● Active Messages

● Non-blocking collectives with hw accleration

support

High-level Overview

UCT (Transports)

InfiniBand VERBs

RC UD XRC DCT

Host Memory

SYSV POSIX KNEM CMA XPMEM

Accelerator Memory

CUDA

Gemini/Aries

GNI

UCS (Services)

UCP (Protocols)

Message Passing API Domain

PGAS API Domain

Task Based API Domain

I/O API Domain

Utilities Platforms

Hardware/Driver

MPICH, Open-MPI, etc.OpenSHMEM, UPC, CAF, X10,

Chapel, etc.Parsec, OCR, Legions, etc. Burst buffer, ADIOS, etc.

Applications

UC

X


Technology Roadmap

2016 20212019

200G 400G

Petascale Exascale

1000G100G

#1 TOP500, 100Petaflop

World-wide

Programs

Standard-based Interconnect, Same Software, Backward and Future Compatible

Thank You

Interconnect Your Future - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/... ·...

Documents

Transcript of Interconnect Your Future - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/... ·...