GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted...

17
Pak Lui March 2013, GTC GPU-InfiniBand Accelerations for Hybrid Compute Systems

Transcript of GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted...

Page 1: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

Pak Lui

March 2013, GTC

GPU-InfiniBand Accelerations for Hybrid Compute Systems

Page 2: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 2 - Mellanox Confidential -

Leading Supplier of End-to-End Interconnect Solutions

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Virtual Protocol Interconnect

Storage Front / Back-End

Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Fibre Channel

Virtual Protocol Interconnect

Page 3: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 3 - Mellanox Confidential -

Virtual Protocol Interconnect (VPI) Technology

64 ports 10GbE

36 ports 40/56GbE

48 10GbE + 12 40/56GbE

36 ports IB up to 56Gb/s

8 VPI subnets

Switch OS Layer

Mezzanine Card

VPI Adapter VPI Switch

Ethernet: 10/40/56 Gb/s

InfiniBand:10/20/40/56 Gb/s

Unified Fabric Manager

Networking Storage Clustering Management

Applications

Acceleration Engines

LOM Adapter Card

3.0

From data center to

campus and metro

connectivity

Page 4: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 4 - Mellanox Confidential -

A new interconnect architecture for compute intensive applications

World’s fastest server and storage interconnect solution

Enables unlimited clustering (compute and storage) scalability

Accelerates compute-intensive and parallel-intensive applications

Optimized for multi-tenant environments of 100s of Virtual Machines per server

Connect-IB: The Exascale Foundation

Page 5: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 5 - Mellanox Confidential -

World’s first 100Gb/s InfiniBand interconnect adapter

• PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s

Highest InfiniBand message rate: 137 million messages per second

• 4X higher than other InfiniBand solutions

Connect-IB Performance Highlights

Unparalleled Throughput and Message Injection Rates

Page 6: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 6 - Mellanox Confidential -

The GPUDirect project was announced Nov 2009

• “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”,

http://www.nvidia.com/object/io_1258539409179.html

GPUDirect was developed together by Mellanox and NVIDIA

• New interface (API) within the Tesla GPU driver

• New interface within the Mellanox InfiniBand drivers

• Linux kernel modification to allow direct communication between drivers

GPUDirect 1.0 was announced Q2’10

• “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC

Performance and Efficiency”

• “Mellanox was the lead partner in the development of NVIDIA GPUDirect”

GPUDirect RDMA will be released Q2’13

GPUDirect History

Page 7: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 7 - Mellanox Confidential -

GPU communications uses “pinned” buffers for data movement

• A section in the host memory that is dedicated for the GPU

• Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best

performance

InfiniBand uses “pinned” buffers for efficient RDMA transactions

• Zero-copy data transfers, Kernel bypass

• Reduces CPU overhead

GPU-InfiniBand Bottleneck (pre-GPUDirect)

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

Page 8: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 8 - Mellanox Confidential -

GPUDirect 1.0

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory 1 2

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

2

Transmit Receive

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

Non GPUDirect

GPUDirect 1.0

Page 9: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 9 - Mellanox Confidential -

LAMMPS

• 3 nodes, 10% gain

Amber – Cellulose

• 8 nodes, 32% gain

Amber – FactorIX

• 8 nodes, 27% gain

GPUDirect 1.0 – Application Performance

3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node

Page 10: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 10 - Mellanox Confidential -

GPUDirect RDMA

Transmit Receive

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

GPUDirect RDMA

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

GPUDirect 1.0

Page 11: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

• Preliminary driver for GPU-Direct is under work by NVIDIA and

Mellanox

• OSU has done an initial design of MVAPICH2 with the latest GPU-

Direct-RDMA Driver

– Hybrid design

– Takes advantage of GPU-Direct-RDMA for short messages

– Uses host-based buffered design in current MVAPICH2 for large messages

– Alleviates Sandybridge chipset bottleneck

DK-OSU-MVAPICH2-GPU-Direct-RDMA 11

Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA

Page 12: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

DK-OSU-MVAPICH2-GPU-Direct-RDMA 12

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency

0

5

10

15

20

25

30

35

40

1 4 16 64 256 1K 4K

MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

0

200

400

600

800

1000

1200

16K 64K 256K 1M 4M

MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 13: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

DK-OSU-MVAPICH2-GPU-Direct-RDMA 13

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Uni-directional Bandwidth

0

100

200

300

400

500

600

700

800

900

1 4 16 64 256 1K 4K

MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Small Message Bandwidth

0

1000

2000

3000

4000

5000

6000

7000

16K 64K 256K 1M 4M

MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Large Message Bandwidth

Page 14: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

DK-OSU-MVAPICH2-GPU-Direct-RDMA 14

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Bi-directional Bandwidth

0

200

400

600

800

1000

1200

1 4 16 64 256 1K 4K

MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Small Message Bi-Bandwidth

0

2000

4000

6000

8000

10000

12000

16K 64K 256K 1M 4M

MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Large Message Bi-Bandwidth

Page 15: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 15 - Mellanox Confidential -

Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network Interface CUDA

Driver + runtime Network Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA

Driver + runtime

CUDA Application

rCUDA provides remote access from

every node to any GPU in the system

Page 16: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 16 - Mellanox Confidential -

GPU as a Service

GPUs as a network-resident service • Little to no overhead when using FDR InfiniBand

Virtualize and decouple GPU services from CPU services • A new paradigm in cluster flexibility

• Lower cost, lower power and ease of use with shared GPU resources

• Remove difficult physical requirements of the GPU for standard compute

servers

GPU

CPU

GPU

CPU GPU

CPU

GPU

CPU

GPU

CPU

GPUs in every server GPUs as a Service

CPU

VGPU

CPU

VGPU

CPU

VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

Page 17: GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted at a rapid pace in high performance computing clusters to perform compute-intensive

© 2013 Mellanox Technologies 17 - Mellanox Confidential -

Thank You