THE GOD, THE BAD AND THE UGLY

16
D.Rossetti, 8/16/2016 THE GOD, THE BAD AND THE UGLY

Transcript of THE GOD, THE BAD AND THE UGLY

Page 1: THE GOD, THE BAD AND THE UGLY

D.Rossetti, 8/16/2016

THE GOD, THE BAD AND THE UGLY

Page 2: THE GOD, THE BAD AND THE UGLY

2

DUELING FOR GOLD

Page 3: THE GOD, THE BAD AND THE UGLY

3

DUELING FOR POWER

Page 4: THE GOD, THE BAD AND THE UGLY

4

The three chaps still around

IBM POWER9 CPU + NVIDIA Volta GPU + Mellanox Connect HCA

NVLink High Speed Interconnect

>40 TFLOPS per Node, >3,400 Nodes

2017

SUMMIT SIERRA 150-300 PFLOPS

Peak Performance > 100 PFLOPS

Peak Performance

Page 5: THE GOD, THE BAD AND THE UGLY

5

ENERGY SPENT MOVING DATA

power budget is fixed by thermal ~ 300W

keep data close to FP/INT ops, share data, caching, hierarchical design

integration, multi chip packages, HBM

optimize data movement, NIC on package

B.Dally, 2015

Page 6: THE GOD, THE BAD AND THE UGLY

6

KEEP DATA CLOSE …

UVA, single address space, still needs cudaMemcpy

UVM lite, move data closer to accessor, with limitations

UVM full, simultaneous access, atomics

goals:

usability

performance, e.g. transparent data movement

broaden design space, e.g. platform dependent optimizations, cache coherency

SW programming model can contribute

Page 7: THE GOD, THE BAD AND THE UGLY

7

MEM

IB

CPU GPU

MEM

PCIe Switch

MEM MEM

Node N-1

MEM

IB

CPU GPU

MEM

PCIe Switch

MEM MEM

Node 0

MEM

IB

CPU GPU

MEM

PCIe Switch

MEM MEM

Node 1

FOR CLUSTERS ?

Page 8: THE GOD, THE BAD AND THE UGLY

8

GETTING RID OF THE CPU

APEnet+, NaNet, NaNet-10 D.Rossetti et al.

PEACH2, T.Hanawa, T.Boku, et al.

GGAS, H.Fröning, L.Oden

project Donard, S.Bates

GPUnet, GPUfs, M. Silberstein et al.

challenges: optimize data movement, be friendly to the SIMT model

past and current efforts

Page 9: THE GOD, THE BAD AND THE UGLY

9

GETTING RID OF THE CPU

A.  plain comm

B.  GPU-aware comm: MVAPICH2, Open MPI

C.  CPU-prepared, CUDA stream triggered: Offloaded MPI + Async

D.  CPU-prepared, CUDA SM triggered: in-kernel comm APIs

E.  CUDA SM: PGAS/SHMEM/RMA

a tentative road map …

Page 10: THE GOD, THE BAD AND THE UGLY

10

(C) OFFLOADED MP(I)

while (!done) { mp_irecv (…, rreq) pack_boundary <<<…,stream1>>> (buf) compute_interior <<<…,stream2>>> (buf) mp_isend_on_stream(…, sreq, stream1) mp_wait_on_stream (rreq, stream1) unpack_boundary <<<…,stream1>>> (buf) compute_boundary <<<…,stream1>>> (buf) //synchronize between CUDA streams } mp_wait_all(sreq) cudaDeviceSynchronize()

two implementations

•  baseline, CUDA Async APIs

•  HW optimized, need NIC features

helps strong scaling

Async as optimization

Page 11: THE GOD, THE BAD AND THE UGLY

11

(D) CUDA SM TRIGGERED COMMS

mp::mlx5::send_desc_t tx_descs[n_tx]; mp::mlx5::send_desc_t rx_descs[n_rx]; for(...) { mp_irecv(recv_buf, nbytes, rreq); mp::mlx5::get_descriptors(&rx_descs[i]) mp_send_prepare(send_buf, nbytes, peer, reg, sreq); mp::mlx5::get_descriptors(&tx_descs[i], sreq); } fused_pack_unpack<<<>>>(…);

__global__ fused_pack_unpack(desc descs, txbuf, rxbuf) { block_id = elect_block(); if (block_id < n_pack_nthrs) { pack(send_buf); __threadfence(); last_block = elect_block(); if (last_block && threadIdx.x < n_tx) mp::device::mlx5::send(tx_descs[…]); __syncthreads(); } else { block_id -= n_pack_threads; if (!block_id) { if (threadIdx.x < n_tx) { mp::device::mlx5::wait(rx_descs[…]); __syncthreads(); if (threadIdx.x == 0) sched.done = 1; } while (ThreadLoad<LOAD_CG>(&sched.done)); unpack(recv_buf);

CPU Code CUDA code

Page 12: THE GOD, THE BAD AND THE UGLY

12

(E) SHMEM like Long running CUDA kernels Communication within parallel region

__global__ void 2dstencil (u, v, sync, …) { for(timestep = 0; …) { if (i+1 > nx) { v[i+1] = shmem_float_g (v[1], rightpe); } if (i-1 < 1) { v[i-1] = shmem_float_g (v[nx], leftpe); } u[i] = (u[i] + (v[i+1] + v[i-1] . . . if (i < 2) { shmem_int_p (sync + i, 1, peers[i]); shmem_quiet(); shmem_wait_until (sync + i, EQ, 1); } //intra-kernel sync … } }

Page 13: THE GOD, THE BAD AND THE UGLY

13

PROGRAMMING MODEL

A,B,C,D: pt-to-pt, RMA for coarse-grained communications

e.g. avoid many small transfers

E: RMA,SHMEM for fine-grained communications

e.g. avoid many multi-word PUT/GET

Have all guarantees ? e.g. I/O vs compute

GPUs have loose consistency, SW enforced

Page 14: THE GOD, THE BAD AND THE UGLY

14

FAT NODES

two separate networks, more power…

why not one ?

GPU0

GPU2

GPU1

GPU3

GPU4

GPU6

GPU5

GPU7

PLXswitch

PLXswitch

PLXswitch

PLXswitch

CPU

CPU

IB 100 GB/sec IB 100 GB/sec

IB 100 GB/sec IB 100 GB/sec

Page 15: THE GOD, THE BAD AND THE UGLY

15

CONCLUSIONS

Three chaps compete for power

Aim: maximize Flops given technology constraints

Can we get rid of (at least) one ?

Page 16: THE GOD, THE BAD AND THE UGLY

GAME OVER