Persistent RNNs: Stashing Recurrent Weights On-Chip

Persistent RNNs(stashing recurrent weights on-chip)

Presenter: Gregory Diamos

Silicon Valley AI LabBaidu

Jun 20, 2016

Presenter: Gregory Diamos Persistent RNNs

Machine learning has benefited greatly from faster computer systems.

GPUs in particular, have delivered a step forward.


Imagine the problems that you could solve

with even faster systems.


HPC is an opportunity

10,000x

TitanX GPU

Fastest superco

mputer


Limits of data-parallelism


Hardware limits

wal

l-cl

ock

tim

e to

con

verg

ence

mini-batch size

inefficient hardware

Hardware becomes less efficient at small batch sizes.


Optimization limits

wal

l-cl

ock

tim

e to

con

verg

ence

mini-batch size

inefficient optimization

Optimization algorithms perform more work at large batch sizes.


Mini-batch limits

wal

l-cl

ock

tim

e to

con

verg

ence

mini-batch size

inefficient hardware inefficient optimization

These effects combine to limit the maximum number of GPUs.


Persistent RNNs

Open source CUDA implementation:

https://github.com/baidu-research/persistent-rnns


https://github.com/baidu-research/persistent-rnns

Persistent RNN Details


Persistent RNNs

weights

GEMM GEMM GEMM GEMM

Persistent RNN

weights

weights weights weights

data0 data1 data2 data3 data4

data0 data1 data2 data3 data4

RNNs built on GEMM routines reload the weights each timestep.

However, the weights are constant, and this is wasteful.


Cache weights in registers

weights

GPU thread

registers

datapath


A global barrier

data0 GPU data1 GPUbarrier


Experiments


Scaling to 128 GPUs


Exploring deep residual RNNs


Pascal and future

Future GPUs will enable bigger and faster RNN layers.


Three challenges


Close the gap with the fastest supercomputers.


Do not settle for inefficient algorithms.


Push performance to the edge of physical limits.

10 PetaFlops in 300 Watts.

150 ExaFlops in 25 MegaWatts.


Persistent RNNs: Stashing Recurrent Weights On-Chip

Technology

Transcript of Persistent RNNs: Stashing Recurrent Weights On-Chip