Multi-GPU Programming | GTC 2013

124
Levi Barnes Developer Technology, NVIDIA Multi-GPU Programming © 1

Transcript of Multi-GPU Programming | GTC 2013

Page 1: Multi-GPU Programming | GTC 2013

Levi Barnes Developer Technology, NVIDIA

Multi-GPU Programming

©

2013,

NVIDIA

1

Page 2: Multi-GPU Programming | GTC 2013

Hands-on Multi-GPU training

S3529

Hands-on Lab: Multi-GPU Acceleration Example

Justin Luitjens

Wednesday 17:00-17:50

Rm 230A1

©

2013,

NVIDIA

2

Page 3: Multi-GPU Programming | GTC 2013

Where can I find multi-GPU nodes?

GPU Test Drive

http://www.nvidia.com/GPUTestDrive

©

2013,

NVIDIA

3

Page 4: Multi-GPU Programming | GTC 2013

Where can I find multi-GPU nodes?

GPU Test Drive

http://www.nvidia.com/GPUTestDrive

Commercial/Academic Clusters

— Emerald (8 x M2090)

— Keeneland (3/2 x M2070)

— Tsubame (3 x M2050)

— JSCC RAS (8x M2090)

— Amazon Cloud (2 x C2050)

©

2013,

NVIDIA

4

Page 5: Multi-GPU Programming | GTC 2013

Outline

Why More GPUs?

Executing on multiple GPUs

Hiding inter-GPU communication

Tuning for node topology

CUDA C Case study

— One Host, Many GPUs

— Many Host, Many GPUs

©

2012,

NVIDIA

5

Page 6: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

GPU

CPU

Page 7: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

7

CPU

MemCpy

GPU

time

Page 8: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

8

Serial

computation

CPU

MemCpy

GPU

time

Page 9: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

9

Serial

computation

communication

CPU

MemCpy

GPU

time

Page 10: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

10

Serial

computation

communication

idle

CPU

MemCpy

GPU

time

Page 11: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

GPU

CPU

GPU

CPU

$$

Page 12: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

GPU

CPU

GPU

CPU

GPU

CPU

$$ $$

Page 13: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

GPU

CPU

GPU

CPU

GPU

CPU

$$ $$

GPU

CPU

$$

Page 14: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

GPU GPU

CPU

GPU GPU

Page 15: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

15

CPU

MemCpy

GPU 1

time

Page 16: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

16

CPU

MemCpy

GPU 1

time

Page 17: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

17

CPU

MemCpy

GPU 1

time

GPU 2

Page 18: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

18

CPU

MemCpy

GPU 1

time

GPU 2

GPU 3

Page 19: Multi-GPU Programming | GTC 2013

Move to multiple GPUs

©

2012,

NVIDIA

19

CPU

MemCpy

GPU 1

time

GPU 2

GPU 3

GPU 4

Page 20: Multi-GPU Programming | GTC 2013

Nvidia K10 – multiple GPUs on a board

GPU GPU

Page 21: Multi-GPU Programming | GTC 2013

Nvidia K10 – multiple GPUs on a board

GPU GPU

CPU

GPU GPU

Page 22: Multi-GPU Programming | GTC 2013

Executing on multiple GPUs

Page 23: Multi-GPU Programming | GTC 2013

Managing multiple GPUs from a single CPU thread

cudaSetDevice() - sets the current GPU

©

2012,

NVIDIA

24

Page 24: Multi-GPU Programming | GTC 2013

Managing multiple GPUs from a single CPU thread

cudaSetDevice() - sets the current GPU

©

2012,

NVIDIA

25

cudaSetDevice( 0 );

kernel<<<...>>>(...);

cudaMemcpyAsync(...);

cudaSetDevice( 1 );

kernel<<<...>>>(...);

Page 25: Multi-GPU Programming | GTC 2013

Peer-to-peer memcopy

cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,

void* src_addr, int src_dev,

size_t num_bytes, cudaStream_t stream )

©

2012,

NVIDIA

26

Page 26: Multi-GPU Programming | GTC 2013

Peer-to-peer memcopy

cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,

void* src_addr, int src_dev,

size_t num_bytes, cudaStream_t stream )

©

2012,

NVIDIA

27

GPU 0 GPU 1

Page 27: Multi-GPU Programming | GTC 2013

Peer-to-peer memcopy

cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,

void* src_addr, int src_dev,

size_t num_bytes, cudaStream_t stream )

©

2012,

NVIDIA

28

GPU 0 GPU 1

CPU

5 GB/s

Page 28: Multi-GPU Programming | GTC 2013

Enabling GPUDirect

cudaDeviceEnablePeerAccess( peer_device, 0 )

Enables current GPU to access addresses on peer_device GPU

cudaDeviceCanAccessPeer( &accessible, dev_X, dev_Y )

Checks whether dev_X can access memory of dev_Y

©

2012,

NVIDIA

29

Page 29: Multi-GPU Programming | GTC 2013

Peer-to-peer memcopy

©

2012,

NVIDIA

30

GPU 0 GPU 1

6.6

GB/s*

* PCIe gen. 2 (12 GB/s for gen. 3)

cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,

void* src_addr, int src_dev,

size_t num_bytes, cudaStream_t stream )

Page 30: Multi-GPU Programming | GTC 2013

Peer-to-peer memcopy

©

2012,

NVIDIA

31

GPU 0 GPU 1

6 + 6 = 12 GB/s*

* PCIe gen. 2 (22 GB/s for gen. 3)

cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,

void* src_addr, int src_dev,

size_t num_bytes, cudaStream_t stream )

Page 31: Multi-GPU Programming | GTC 2013

Unified Virtual Addressing (UVA)

Driver/GPU can determine from an address where data

resides

©

2012,

NVIDIA

32

Page 32: Multi-GPU Programming | GTC 2013

Unified Virtual Addressing (UVA)

Driver/GPU can determine from an address where data

resides

Requires:

64-bit Linux or 64-bit Windows with TCC driver

Fermi or later architecture GPUs (compute capability 2.0 or higher)

CUDA 4.0 or later

©

2012,

NVIDIA

33

Page 33: Multi-GPU Programming | GTC 2013

Peer-to-peer memcopy

cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,

void* src_addr, int src_dev,

size_t num_bytes, cudaStream_t stream )

cudaMemcpyAsync( void* dst_addr, void* src_addr,

size_t num_bytes, cudaStream_t stream,

cudaMemcpyDefault )

©

2012,

NVIDIA

34

Page 34: Multi-GPU Programming | GTC 2013

Hiding inter-GPU communication

35 ©

2012,

NVIDIA

Page 35: Multi-GPU Programming | GTC 2013

Overlap kernel and memory copy

©

2012,

NVIDIA

36

CPU

MemCpy

GPU

time

Page 36: Multi-GPU Programming | GTC 2013

Overlap kernel and memory copy

©

2012,

NVIDIA

37

CPU

MemCpy

GPU

time

Page 37: Multi-GPU Programming | GTC 2013

Overlap kernel and memory copy

Requirements:

— D2H or H2D memcopy from pinned memory

— Device with compute capability ≥ 1.1 (G84 and later)

— Kernel and memcopy in different, non-0 streams

©

2012,

NVIDIA

38

CPU

MemCpy

GPU

time

Page 38: Multi-GPU Programming | GTC 2013

Pinned memory

cudaHostAlloc(void** pHost, size_t size, int flags)

cudaFree(void * pHost)

allocates pinned (page-locked) CPU memory

cudaHostRegister(void* pHost, size_t size, int flags)

cudaHostUnregister(void* pHost)

pins/unpins previously allocated memory

Page 39: Multi-GPU Programming | GTC 2013

Pinned memory

cudaHostAlloc(void** pHost, size_t size, int flags)

cudaFree(void * pHost)

allocates pinned (page-locked) CPU memory

cudaHostRegister(void* pHost, size_t size, int flags)

cudaHostUnregister(void* pHost)

pins/unpins previously allocated memory

Page 40: Multi-GPU Programming | GTC 2013

Overlap kernel and memory copy

©

2012,

NVIDIA

41

cudaStream_t stream1, stream2;

cudaStreamCreate(&stream1);

cudaStreamCreate(&stream2);

cudaHostAlloc(&src, size, 0);

cudaMemcpyAsync( dst, src, size, dir, stream1 );

kernel<<<grid, block, 0, stream1>>>(…);

cudaMemcpyAsync( dst, src, size, dir, stream2 );

kernel<<<grid, block, 0, stream2>>>(…);

Page 41: Multi-GPU Programming | GTC 2013

Overlap kernel and memory copy

©

2012,

NVIDIA

42

potentially

overlapped

cudaStream_t stream1, stream2;

cudaStreamCreate(&stream1);

cudaStreamCreate(&stream2);

cudaHostAlloc(&src, size, 0);

cudaMemcpyAsync( dst, src, size, dir, stream1 );

kernel<<<grid, block, 0, stream1>>>(…);

cudaMemcpyAsync( dst, src, size, dir, stream2 );

kernel<<<grid, block, 0, stream2>>>(…);

stream1

stream2

CPU

Page 42: Multi-GPU Programming | GTC 2013

Creating CUDA Events

cudaEventCreate(cudaEvent_t *event)

cudaEventDestroy(cudaEvent_t event)

cudaEventRecord(cudaEvent_t event,

cudaStream_t stream=0)

Page 43: Multi-GPU Programming | GTC 2013

Using CUDA Events

cudaEventSynchronize(cudaEvent_t event)

cudaStreamWaitEvent(cudaStream_t stream,

cudaEvent_t event)

Page 44: Multi-GPU Programming | GTC 2013

Expressing dependency through events

©

2012,

NVIDIA

45

cudaEvent_t ev; cudaEventCreate(&ev); … cudaMemcpyAsync( dst, src, size, dir, stream1 ); cudaMemcpyAsync( dst2, src2, size, dir, stream2 ); cudaEventRecord( ev, stream2 ); cudaStreamWaitEvent(stream1, ev); kernel<<<grid, block, 0, stream1>>>(…); Kernel<<<grid, block, 0, stream2>>>(…);

stream1

stream2

CPU

Page 45: Multi-GPU Programming | GTC 2013

Multi-GPU Streams and Events

•Multi-GPU, Streams, and Events

46 ©

2012,

NVIDIA

Page 46: Multi-GPU Programming | GTC 2013

The Rules

CUDA streams and events are per device (GPU)

— Each device has its own default stream (aka 0- or NULL-stream)

Streams and:

— Kernels: can be launched to a stream only if the stream’s GPU is

current

— Memcopies: can be issued to any stream

— Events: can be recorded only to a stream if the stream’s GPU is

current

Synchronization/query:

— It is OK to query or synchronize with any event/stream

©

2012,

NVIDIA

47

Page 47: Multi-GPU Programming | GTC 2013

The Rules

CUDA streams and events are per device (GPU)

— Each device has its own default stream (aka 0- or NULL-stream)

Streams and:

— Kernels: can be launched to a stream only if the stream’s GPU is

current

— Memcopies: can be issued to any stream

— Events: can be recorded only to a stream if the stream’s GPU is

current

Synchronization/query:

— It is OK to query or synchronize with any event/stream

©

2012,

NVIDIA

48

Page 48: Multi-GPU Programming | GTC 2013

The Rules

CUDA streams and events are per device (GPU)

— Each device has its own default stream (aka 0- or NULL-stream)

Streams and:

— Kernels: can be launched to a stream only if the stream’s GPU is

current

— Memcopies: can be issued to any stream

— Events: can be recorded only to a stream if the stream’s GPU is

current

Synchronization/query:

— It is OK to query or synchronize with any event/stream

©

2012,

NVIDIA

49

Page 49: Multi-GPU Programming | GTC 2013

The Rules

CUDA streams and events are per device (GPU)

— Each device has its own default stream (aka 0- or NULL-stream)

Streams and:

— Kernels: can be launched to a stream only if the stream’s GPU is

current

— Memcopies: can be issued to any stream

— Events: can be recorded only to a stream if the stream’s GPU is

current

Synchronization/query:

— It is OK to query or synchronize with any event/stream

©

2012,

NVIDIA

50

Page 50: Multi-GPU Programming | GTC 2013

The Rules

CUDA streams and events are per device (GPU)

— Each device has its own default stream (aka 0- or NULL-stream)

Streams and:

— Kernels: can be launched to a stream only if the stream’s GPU is

current

— Memcopies: can be issued to any stream

— Events: can be recorded only to a stream if the stream’s GPU is

current

Synchronization/query:

— It is OK to query or synchronize with any event/stream

©

2012,

NVIDIA

51

Page 51: Multi-GPU Programming | GTC 2013

Example 1

©

2012,

NVIDIA

52

cudaStream_t streamA, streamB;

cudaEvent_t eventA, eventB;

cudaSetDevice( 0 );

cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0

cudaEventCreate( &eventA );

cudaSetDevice( 1 );

cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1

cudaEventCreate( &eventB );

kernel<<<..., streamB>>>(...);

cudaEventRecord( eventB, streamB );

cudaEventSynchronize( eventB );

OK:

• device 1 is current

• eventB and streamB belong to device 1

Page 52: Multi-GPU Programming | GTC 2013

Example 2

©

2012,

NVIDIA

53

cudaStream_t streamA, streamB;

cudaEvent_t eventA, eventB;

cudaSetDevice( 0 );

cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0

cudaEventCreate( &eventA );

cudaSetDevice( 1 );

cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1

cudaEventCreate( &eventB );

kernel<<<..., streamA>>>(...);

cudaEventRecord( eventB, streamB );

cudaEventSynchronize( eventB );

ERROR:

• device 1 is current

• streamA belongs to device 0

Page 53: Multi-GPU Programming | GTC 2013

Example 3

©

2012,

NVIDIA

54

cudaStream_t streamA, streamB;

cudaEvent_t eventA, eventB;

cudaSetDevice( 0 );

cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0

cudaEventCreate( &eventA );

cudaSetDevice( 1 );

cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1

cudaEventCreate( &eventB );

kernel<<<..., streamB>>>(...);

cudaEventRecord( eventA, streamB );

cudaEventSynchronize( eventB );

ERROR:

• eventA belongs to device 0

• streamB belongs to device 1

Page 54: Multi-GPU Programming | GTC 2013

Example 4

©

2012,

NVIDIA

55

cudaStream_t streamA, streamB;

cudaEvent_t eventA, eventB;

cudaSetDevice( 0 );

cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0

cudaEventCreate( &eventA );

cudaSetDevice( 1 );

cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1

cudaEventCreate( &eventB );

kernel<<<..., streamB>>>(...);

cudaEventRecord( eventB, streamB );

cudaSetDevice( 0 );

cudaEventSynchronize( eventB );

kernel<<<..., streamA>>>(...);

device-1 is current

device-0 is current

Page 55: Multi-GPU Programming | GTC 2013

Example 4

©

2012,

NVIDIA

56

cudaStream_t streamA, streamB;

cudaEvent_t eventA, eventB;

cudaSetDevice( 0 );

cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0

cudaEventCreate( &eventA );

cudaSetDevice( 1 );

cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1

cudaEventCreate( &eventB );

kernel<<<..., streamB>>>(...);

cudaEventRecord( eventB, streamB );

cudaSetDevice( 0 );

cudaEventSynchronize( eventB );

kernel<<<..., streamA>>>(...);

OK:

• device-0 is current

• synchronizing/querying events/streams of

other devices is allowed

• here, device-0 won’t start executing the

kernel until device-1 finishes its kernel

Page 56: Multi-GPU Programming | GTC 2013

Example 5

©

2012,

NVIDIA

57

int gpu_A = 0;

int gpu_B = 1;

cudaSetDevice( gpu_A );

cudaMalloc( &d_A, num_bytes );

int accessible = 0;

cudaDeviceCanAccessPeer( &accessible, gpu_B, gpu_A );

if( accessible )

{

cudaSetDevice(gpu_B );

cudaDeviceEnablePeerAccess( gpu_A, 0 );

kernel<<<...>>>( d_A);

}

Even though kernel executes on

gpu2, it will access (via PCIe)

memory allocated on gpu1

Page 57: Multi-GPU Programming | GTC 2013

Subdividing the problem

Page 58: Multi-GPU Programming | GTC 2013

Stage 1

Compute halo regions

Page 59: Multi-GPU Programming | GTC 2013

Stage 2

Compute internal regions

Page 60: Multi-GPU Programming | GTC 2013

Stage 2

Compute internal regions

Exchange halo regions

Page 61: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

63

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

}

Compute halos

Page 62: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

64

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

}

Compute internal

Compute halos

Page 63: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

65

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

for( int i=0; i<num_gpus-1; i++ ) {

cudaStreamWaitEvent(stream_copy[i], event_i[i]

cudaMemcpyPeerAsync( ..., stream_compute[i] );

}

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_compute[i] );

}

Compute halos

Exchange halos

Compute internal

Page 64: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

66

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

for( int i=0; i<num_gpus-1; i++ ) {

cudaStreamWaitEvent(stream_copy[i], event_i[i]

cudaMemcpyPeerAsync( ..., stream_compute[i] );

}

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_compute[i] );

}

Compute halos

Exchange halos

Compute internal

halo stream_compute int halo int

Page 65: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

67

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

for( int i=0; i<num_gpus-1; i++ ) {

cudaStreamWaitEvent(stream_copy[i], event_i[i]);

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

Compute halos

Exchange halos

Compute internal

Page 66: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

68

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

for( int i=0; i<num_gpus-1; i++ ) {

cudaStreamWaitEvent(stream_copy[i], event_i[i]);

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

Compute halos

Exchange halos

Compute internal

halo stream_compute int

stream_copy

halo int halo int

Page 67: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

69

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

for( int i=0; i<num_gpus-1; i++ ) {

cudaStreamWaitEvent(stream_copy[i], event_i[i]);

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

Compute halos

Exchange halos

Compute internal

Page 68: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

70

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

for( int i=0; i<num_gpus-1; i++ ) {

cudaStreamWaitEvent(stream_copy[i], event_i[i]);

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

Compute halos

Exchange halos

Compute internal

halo stream_compute int

stream_copy

halo int halo int

Page 69: Multi-GPU Programming | GTC 2013

Code Pattern

©

2012,

NVIDIA

71

for( int istep=0; istep<nsteps; istep++)

{

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

kernel_halo<<<..., stream_compute[i]>>>( ... );

cudaEventRecord(event_i[i], stream_compute[i] );

kernel_int<<<..., stream_compute[i]>>>( ... );

}

for( int i=0; i<num_gpus-1; i++ ) {

cudaStreamWaitEvent(stream_copy[i], event_i[i]);

cudaMemcpyPeerAsync( ..., stream_copy[i] );

}

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_copy[i] );

for( int i=0; i<num_gpus; i++ )

{

cudaSetDevice( gpu[i] );

cudaDeviceSynchronize();

// swap input/output pointers

}

}

Synchronize

Compute halos

Exchange halos

Compute internal

Page 70: Multi-GPU Programming | GTC 2013

OpenACC in 1 slide

©

2013,

NVIDIA

73

#pragma acc data copyin(in[0:SIZE+2])

#pragma acc kernels loop present(out, in)

for (int i=1; i< SIZE+1; i++) {

out[i] = 0.5*in[i] +

0.25*in[i-1] +

0.25*in[i+1];

}

#pragma end kernels

#pragma acc update host(out)

Page 71: Multi-GPU Programming | GTC 2013

OpenACC in 1 slide

©

2013,

NVIDIA

74

#pragma acc data copyin(in[0:SIZE+2])

#pragma acc kernels loop present(out, in)

for (int i=1; i< SIZE+1; i++) {

out[i] = 0.5*in[i] +

0.25*in[i-1] +

0.25*in[i+1];

}

#pragma end kernels

#pragma acc update host(out)

Page 72: Multi-GPU Programming | GTC 2013

Multiple GPUs with OpenACC

©

2013,

NVIDIA

76

#pragma omp parallel num_threads(4)

{

int t = omp_get_thread_num();

acc_set_device_num(t, acc_device_nvidia);

#pragma acc data copyin(in)

#pragma acc kernels loop present(in)

for (int i=1; i< 1+SIZE; i++) {

out[i] = 0.5*in[i] +

0.25*in[i-1] +

0.25*in[i+1]; }

#pragma end kernels

#pragma acc update host(out)

}

//compile with –mp -acc

Page 73: Multi-GPU Programming | GTC 2013

Multiple GPUs with OpenACC

©

2013,

NVIDIA

77

#pragma omp parallel num_threads(4)

{

int t = omp_get_thread_num();

acc_set_device_num(t, acc_device_nvidia);

#pragma acc data copyin(in)

#pragma acc kernels loop present(in)

for (int i=1; i< 1+SIZE; i++) {

out[i] = 0.5*in[i] +

0.25*in[i-1] +

0.25*in[i+1]; }

#pragma end kernels

#pragma acc update host(out)

}

//compile with –mp -acc

Page 74: Multi-GPU Programming | GTC 2013

Multiple GPUs with OpenACC

©

2013,

NVIDIA

78

#pragma omp parallel num_threads(4)

{

int t = omp_get_thread_num();

acc_set_device_num(t, acc_device_nvidia);

#pragma acc data copyin(in)

#pragma acc kernels loop present(in)

for (int i=1+t*SIZE/4; i< 1+(t+1)*SIZE/4; i++) {

out[i] = 0.5*in[i] +

0.25*in[i-1] +

0.25*in[i+1]; }

#pragma end kernels

#pragma acc update host(out)

}

//compile with –mp -acc

Page 75: Multi-GPU Programming | GTC 2013

Multiple GPUs with OpenACC

©

2013,

NVIDIA

79

#pragma omp parallel num_threads(4)

{

int t = omp_get_thread_num();

acc_set_device_num(t, acc_device_nvidia);

#pragma acc data copyin(in[t*SIZE/4:SIZE/4+2])

#pragma acc kernels loop present(in[t*SIZE/4:SIZE/4+2])

for (int i=1+t*SIZE/4; i< 1+(t+1)*SIZE/4; i++) {

out[i] = 0.5*in[i] +

0.25*in[i-1] +

0.25*in[i+1]; }

#pragma end kernels

#pragma acc update host(out[t*SIZE/4+1:SIZE/4])

}

//compile with –mp -acc

Page 76: Multi-GPU Programming | GTC 2013

Tuning for your topology

Page 77: Multi-GPU Programming | GTC 2013

CPU/GPU interface

GPU-0 GPU-1 GPU-2 GPU-3

Host PCIe

Page 78: Multi-GPU Programming | GTC 2013

Example: 4-GPU Topology

©

2012,

NVIDIA

83

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

to host system

Page 79: Multi-GPU Programming | GTC 2013

Example: 4-GPU Topology

• Two ways to exchange

– Left-right approach

©

2012,

NVIDIA

84

Page 80: Multi-GPU Programming | GTC 2013

Example: 4-GPU Topology

• Two ways to exchange

– Left-right approach

©

2012,

NVIDIA

85

Page 81: Multi-GPU Programming | GTC 2013

Example: 4-GPU Topology

• Two ways to exchange

– Left-right approach

– Pairwise approach

©

2012,

NVIDIA

86

Page 82: Multi-GPU Programming | GTC 2013

Example: 4-GPU Topology

• Two ways to exchange

– Left-right approach

– Pairwise approach

©

2012,

NVIDIA

87

Page 83: Multi-GPU Programming | GTC 2013

Example: Left-Right Approach for 4 GPUs

©

2012,

NVIDIA

88

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

Stage 1: send “right” / receive from “left”

Page 84: Multi-GPU Programming | GTC 2013

Example: Left-Right Approach for 4 GPUs

©

2012,

NVIDIA

89

GPU-3 GPU-2

PCIe switch

GPU-1 GPU-0

PCIe switch

PCIe switch

Stage 2: send “left” / receive from “right”

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

Stage 1: send “right” / receive from “left”

Page 85: Multi-GPU Programming | GTC 2013

Example: Left-Right Approach for 4 GPUs

Achieved throughput: ~15 GB/s

©

2012,

NVIDIA

90

GPU-3 GPU-2

PCIe switch

GPU-1 GPU-0

PCIe switch

PCIe switch

Stage 2: send “left” / receive from “right”

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

Stage 1: send “right” / receive from “left”

Page 86: Multi-GPU Programming | GTC 2013

Example: Left-Right Approach for 8 GPUs

Achieved aggregate throughput: ~34 GB/s

©

2012,

NVIDIA

93

IOH

CPU CPU

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

GPU-4 GPU-5

PCIe switch

GPU-6 GPU-7

PCIe switch

PCIe switch PCIe switch

Page 87: Multi-GPU Programming | GTC 2013

Example: Pairwise Approach for 4 GPUs

No contention for PCIe links

©

2012,

NVIDIA

94

GPU-3 GPU-2

PCIe switch

GPU-1 GPU-0

PCIe switch

PCIe switch

Stage 2: odd-even pairs

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

Stage 1: even-odd pairs

Page 88: Multi-GPU Programming | GTC 2013

IOH

CPU CPU

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

GPU-4 GPU-5

PCIe switch

GPU-6 GPU-7

PCIe switch

PCIe switch PCIe switch

Example: Even-Odd Stage of Pairwise Approach for 8 GPUs

©

2012,

NVIDIA

95

Page 89: Multi-GPU Programming | GTC 2013

IOH

CPU CPU

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

GPU-4 GPU-5

PCIe switch

GPU-6 GPU-7

PCIe switch

PCIe switch PCIe switch

Example: Even-Odd Stage of Pairwise Approach for 8 GPUs

©

2012,

NVIDIA

96

Contention !

Page 90: Multi-GPU Programming | GTC 2013

IOH

CPU CPU

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

GPU-4 GPU-5

PCIe switch

GPU-6 GPU-7

PCIe switch

PCIe switch PCIe switch

Example: Even-Odd Stage of Pairwise Approach for 8 GPUs

• Odd-even stage:

– Will always have contention for 8 or more GPUs

• Even-odd stage:

– Will not have contention

©

2012,

NVIDIA

97

Contention !

Page 91: Multi-GPU Programming | GTC 2013

1D Communication

For 2 GPUs, the approaches are equivalent

Left-Right approach better for the other cases

In all cases, duplex improves bandwidth

©

2012,

NVIDIA

98

Page 92: Multi-GPU Programming | GTC 2013

Code for the Left-Right Approach

©

2012,

NVIDIA

99

for( int i=0; i<num_gpus-1; i++ ) // “right” stage

cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );

gpu[] = device ids

d_out[], d_in[] = gpu memory

Page 93: Multi-GPU Programming | GTC 2013

Code for the Left-Right Approach

©

2012,

NVIDIA

10

0

for( int i=0; i<num_gpus-1; i++ ) // “right” stage

cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream[i] );

for( int i=1; i<num_gpus; i++ ) // “left” stage

cudaMemcpyPeerAsync( d_out[i-1], gpu[i-1], d_in[i], gpu[i], num_bytes, stream[i] );

Page 94: Multi-GPU Programming | GTC 2013

PCIe contention

©

2012,

NVIDIA

10

1

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

for( int i=0; i<num_gpus-1; i++ ) // “right” stage

cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream[i] );

for( int i=1; i<num_gpus; i++ ) // “left” stage

cudaMemcpyPeerAsync( d_out[i-1], gpu[i-1], d_in[i], gpu[i], num_bytes, stream[i] );

Page 95: Multi-GPU Programming | GTC 2013

PCIe contention

©

2012,

NVIDIA

10

2

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

for( int i=0; i<num_gpus-1; i++ ) // “right” stage

cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );

//for( int i=0; i<num_gpus; i++ )

// cudaStreamSynchronize( stream[i] );

for( int i=1; i<num_gpus; i++ ) // “left” stage

cudaMemcpyPeerAsync( d_out[i-1], gpu[i-1], d_in[i], gpu[i], num_bytes, stream[i] );

Page 96: Multi-GPU Programming | GTC 2013

PCIe contention

©

2012,

NVIDIA

10

3

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

GPU-3 GPU-2

PCIe switch

GPU-1 GPU-0

PCIe switch

PCIe switch

Page 97: Multi-GPU Programming | GTC 2013

Fix

©

2012,

NVIDIA

10

4

for( int i=0; i<num_gpus-1; i++ ) // “right” stage

cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream[i] );

for( int i=1; i<num_gpus; i++ ) // “left” stage

cudaMemcpyPeerAsync( d_out[i-1], gpu[i-1], d_in[i], gpu[i], num_bytes, stream[i] );

Page 98: Multi-GPU Programming | GTC 2013

Better performance with Synchronize

©

2012,

NVIDIA

10

5

GPU-3 GPU-2

PCIe switch

GPU-1 GPU-0

PCIe switch

PCIe switch

GPU-0 GPU-1

PCIe switch

GPU-2 GPU-3

PCIe switch

PCIe switch

Synchronize

Page 99: Multi-GPU Programming | GTC 2013

Determining Topology/Locality of a System

Hardware Locality tool:

— http://www.open-mpi.org/projects/hwloc/

— Cross-OS, cross-platform

lspci -tvv

©

2012,

NVIDIA

10

6

Page 100: Multi-GPU Programming | GTC 2013

Hiding inter-node communication

Page 101: Multi-GPU Programming | GTC 2013

Multiple GPUs, multiple nodes

©

2012,

NVIDIA

109

cudaMemcpyAsync( ..., stream_halo[i] );

cudaStreamSynchronize( stream_halo[i] );

MPI_Sendrecv( ... );

cudaMemcpyAsync( ..., stream_halo[i] );

GPU

CPU CPU

GPU GPU GPU GPU GPU GPU GPU

Page 102: Multi-GPU Programming | GTC 2013

GPU-aware MPI

MVAPICH

OpenMPI

Cray

©

2013,

NVIDIA

11

0

Page 103: Multi-GPU Programming | GTC 2013

GPU-aware MPI

S3047

Introduction to CUDA-aware MPI and NVIDIA GPUDirect™

Jiri Kraus

Wednesday 16:00-16:50

Rm 230C

©

2013,

NVIDIA

11

1

Page 104: Multi-GPU Programming | GTC 2013

Overlapping MPI and PCIe Transfers

©

2012,

NVIDIA

11

3

for( int i=0; i<num_gpus-1; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream_halo[i] );

Page 105: Multi-GPU Programming | GTC 2013

Overlapping MPI and PCIe Transfers

©

2012,

NVIDIA

11

4

for( int i=0; i<num_gpus-1; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream_halo[i] );

cudaSetDevice( gpu[num_gpus-1] );

cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );

Page 106: Multi-GPU Programming | GTC 2013

Overlapping MPI and PCIe Transfers

©

2012,

NVIDIA

11

5

for( int i=0; i<num_gpus-1; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream_halo[i] );

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

cudaSetDevice( gpu[0] );

cudaMemcpyAsync( ..., stream_halo[0] );

cudaSetDevice( gpu[num_gpus-1] );

cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );

Page 107: Multi-GPU Programming | GTC 2013

Overlapping MPI and PCIe Transfers

©

2012,

NVIDIA

11

6

for( int i=0; i<num_gpus-1; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream_halo[i] );

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

cudaSetDevice( gpu[0] );

cudaMemcpyAsync( ..., stream_halo[0] );

cudaSetDevice( gpu[num_gpus-1] );

cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );

MPI_Sendrecv( ... );

Page 108: Multi-GPU Programming | GTC 2013

Overlapping MPI and PCIe Transfers

©

2012,

NVIDIA

11

7

for( int i=0; i<num_gpus-1; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream_halo[i] );

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

cudaSetDevice( gpu[0] );

cudaMemcpyAsync( ..., stream_halo[0] );

cudaSetDevice( gpu[num_gpus-1] );

cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );

MPI_Sendrecv( ... );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream_halo[i] );

cudaSetDevice( gpu[0] );

cudaMemcpyAsync( ..., stream_halo[0] );

MPI_Sendrecv( ... );

Page 109: Multi-GPU Programming | GTC 2013

Overlapping MPI and PCIe Transfers

©

2012,

NVIDIA

11

8

for( int i=0; i<num_gpus-1; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream_halo[i] );

for( int i=1; i<num_gpus; i++ )

cudaMemcpyPeerAsync( ..., stream_halo[i] );

cudaSetDevice( gpu[0] );

cudaMemcpyAsync( ..., stream_halo[0] );

cudaSetDevice( gpu[num_gpus-1] );

cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );

MPI_Sendrecv( ... );

for( int i=0; i<num_gpus; i++ )

cudaStreamSynchronize( stream_halo[i] );

cudaSetDevice( gpu[0] );

cudaMemcpyAsync( ..., stream_halo[0] );

MPI_Sendrecv( ... );

cudaSetDevice( gpu[num_gpus-1] );

cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );

Page 110: Multi-GPU Programming | GTC 2013

NVIDIA® GPUDirect™ Support for RDMA Direct Communication Between GPUs and PCIe devices

GPU1 GPU2

PCI-e

System

Memory GDDR5

Memory GDDR5

Memory

CPU

Network

Card

Node 1

PCI-e

GPU1 GPU2

GDDR5

Memory GDDR5

Memory

System

Memory

CPU

Network

Card

Node 2

Network

Page 111: Multi-GPU Programming | GTC 2013

Case Study

120 ©

2012,

NVIDIA

Page 112: Multi-GPU Programming | GTC 2013

Case Study: TTI FWM

• TTI Forward Wave Modeling

– Fundamental part of TTI RTM

– 3DFD, 8th order in space, 2nd order in time

– 1D domain decomposition

• Experiments:

– Throughput increase over 1 GPU

– Single node, 4-GPU “tree”

©

2012,

NVIDIA

12

1

Page 113: Multi-GPU Programming | GTC 2013

Case Study: TTI FWM

• Experiments:

– 512x512x512 cube

– Requires ~7 GB working set

– Single node, 4-GPU “tree”

– 95% scaling to 8 GPUs/node

©

2012,

NVIDIA

12

2

Page 114: Multi-GPU Programming | GTC 2013

Case Study: TTI FWM

©

2012,

NVIDIA

12

3

• Experiments:

– 512x512x512 cube

– Requires ~7 GB working set

– Single node, 4-GPU “tree”

– 95% scaling to 8 GPUs/Node

– 95% scaling to 8 GPUs/node

Page 115: Multi-GPU Programming | GTC 2013

Case Study: Time Breakdown

Single step (single 8-GPU node):

Halo computation: 1.1 ms

Internal computation: 12.7 ms

Halo-exchange: 5.9 ms

Total: 13.8 ms

Communication is completely hidden

— ~95% scaling: halo+internal: 13.8 ms (13.0 ms if done without

splitting computation into halo and internal)

— Time enough for slower communication (network)

©

2012,

NVIDIA

124

Page 116: Multi-GPU Programming | GTC 2013

Case Study: Multiple Nodes

Test system:

— 3 servers, each with 2 GPUs, Infiniband DDR interconnect

Performance:

— 512x512x512 domain:

1 node x 2 GPUs: 1.98x

2 nodes x 1 GPU: 1.97x

2 nodes x 2 GPUs: 3.98x

3 nodes x 2 GPUs: 4.50x

— 768x768x768 domain:

3 nodes x 2 GPUs: 5.60x

Test system:

— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices

Network is ~68% of all communication time

— IB QDR hides communication when each GPU gets ~70 slices

©

2012,

NVIDIA

12

5

Page 117: Multi-GPU Programming | GTC 2013

Case Study: Multiple Nodes

Test system:

— 3 servers, each with 2 GPUs, Infiniband DDR interconnect

Performance:

— 512x512x512 domain:

1 node x 2 GPUs: 1.98x

2 nodes x 1 GPU: 1.97x

2 nodes x 2 GPUs: 3.98x

3 nodes x 2 GPUs: 4.50x

— 768x768x768 domain:

3 nodes x 2 GPUs: 5.60x

Test system:

— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices

Network is ~68% of all communication time

— IB QDR hides communication when each GPU gets ~70 slices

©

2012,

NVIDIA

12

6

Page 118: Multi-GPU Programming | GTC 2013

Case Study: Multiple Nodes

Test system:

— 3 servers, each with 2 GPUs, Infiniband DDR interconnect

Performance:

— 512x512x512 domain:

1 node x 2 GPUs: 1.98x

2 nodes x 1 GPU: 1.97x

2 nodes x 2 GPUs: 3.98x

3 nodes x 2 GPUs: 4.50x

— 768x768x768 domain:

3 nodes x 2 GPUs: 5.60x

Test system:

— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices

Network is ~68% of all communication time

— IB QDR hides communication when each GPU gets ~70 slices

©

2012,

NVIDIA

12

7

Page 119: Multi-GPU Programming | GTC 2013

Case Study: Multiple Nodes

Test system:

— 3 servers, each with 2 GPUs, Infiniband DDR interconnect

Performance:

— 512x512x512 domain:

1 node x 2 GPUs: 1.98x

2 nodes x 1 GPU: 1.97x

2 nodes x 2 GPUs: 3.98x

3 nodes x 2 GPUs: 4.50x

— 768x768x768 domain:

3 nodes x 2 GPUs: 5.60x

Test system:

— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices

Network is ~68% of all communication time

— IB QDR hides communication when each GPU gets ~70 slices

©

2012,

NVIDIA

12

8

Communication takes longer than

internal computation

Page 120: Multi-GPU Programming | GTC 2013

Case Study: Multiple Nodes

Test system:

— 3 servers, each with 2 GPUs, Infiniband DDR interconnect

Performance:

— 512x512x512 domain:

1 node x 2 GPUs: 1.98x

2 nodes x 1 GPU: 1.97x

2 nodes x 2 GPUs: 3.98x

3 nodes x 2 GPUs: 4.50x

— 768x768x768 domain:

3 nodes x 2 GPUs: 5.60x

Test system:

— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices

Network is ~68% of all communication time

— IB QDR hides communication when each GPU gets ~70 slices

©

2012,

NVIDIA

12

9

Communication takes longer than

internal computation

Page 121: Multi-GPU Programming | GTC 2013

Case Study: Multiple Nodes

Test system:

— 3 servers, each with 2 GPUs, Infiniband DDR interconnect

Performance:

— 512x512x512 domain:

1 node x 2 GPUs: 1.98x

2 nodes x 1 GPU: 1.97x

2 nodes x 2 GPUs: 3.98x

3 nodes x 2 GPUs: 4.50x

— 768x768x768 domain:

3 nodes x 2 GPUs: 5.60x

Test system:

— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices

Network is ~68% of all communication time

— IB QDR hides communication when each GPU gets ~70 slices

©

2012,

NVIDIA

13

0

Communication takes longer than

internal computation

Page 122: Multi-GPU Programming | GTC 2013

Case Study: Multiple Nodes

Test system:

— 3 servers, each with 2 GPUs, Infiniband DDR interconnect

Performance:

— 512x512x512 domain:

1 node x 2 GPUs: 1.98x

2 nodes x 1 GPU: 1.97x

2 nodes x 2 GPUs: 3.98x

3 nodes x 2 GPUs: 4.50x

— 768x768x768 domain:

3 nodes x 2 GPUs: 5.60x

Conclusions

— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices

Network is ~68% of all communication time

— IB QDR hides communication when each GPU gets ~70 slices

©

2012,

NVIDIA

13

1

Communication takes longer than

internal computation

Page 123: Multi-GPU Programming | GTC 2013

Summary

Multiple GPUs can stretch your compute dollar

PeerToPeer and can move data directly between GPUs

Streams and asynchronous kernel/copies facilitate

concurrent execution

Many apps can scale to 8 GPUs and beyond

reat scaling to many GPUs

©

2012,

NVIDIA

14

1

Page 124: Multi-GPU Programming | GTC 2013

Get started

Hands on Training

S3529 Wednesday 5pm

GPU Test Drive

http://www.nvidia.com/GPUTestDrive

Visit us at the CUDA expert table

©

2012,

NVIDIA

143