Jeff Johnson, Research Engineer, Facebook at MLconf NYC

34
Hacking GPUs for Deep Learning MLConf New York Jeff Johnson Facebook AI Research [email protected]

Transcript of Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Page 1: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Hacking GPUs for Deep Learning

MLConf New York

Jeff Johnson

Facebook AI Research

[email protected]

Page 2: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Deep (convolutional) Neural Networks

Revolution in machine learning

Convolution: since 1980s. Deep: flops since 2000s

Avoid feature engineering

▪ With enough data, let network discover feature

representations

▪ Can work even for NLP. No word segmentation, use

raw character data.

Page 3: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

2D Convolutional Nets (images)

LeCun, Bottou, Bengio and Haffner, 1998

Krizhevsky, Sutskever and Hinton, 2012

Page 4: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

2D Convolutional Nets

Progress towards smaller kernels and deeper nets

Network architecture ImageNet 1000 class top-5

error

AlexNet ~15%

OverFeat ~13%

ZeilerNet ~11%

Oxford-VGG ~7%

GoogLeNet ~6%, ~4.5%

PReLU (MSR) ~4.9%

Human performance 3-5%

Page 5: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

3D Convolutional Nets (videos)

C3D (Tran et al., 2014)

DeepVideo (Karpathy et al., 2014)

Page 6: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

1D Convolutional Nets (text, sequences)

Collobert et al., 2011

Zhang and LeCun, 2015

Page 7: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

RNNs and LSTMs (text, sequences)

Graves, Mohamed and Hinton, 2013

Mikolov, 2014

Page 8: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Deep Neural Networks

Supervised learning. Unsupervised ???

Train with back-propagation/SGD variants

Strong scaling is unsolved

▪ Distributed parameter space exploration (e.g.,

Hogwild!; Niu et al. 2011)

▪ Distributed hyperparameter space exploration

(e.g., Bayesian optimization; Snoek et al. 2012)

Page 9: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Characteristics

Page 10: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Deep nets are flop eaters

Convolutions are expensive

Pointwise calcuations (log/exp, ReLU, */+, ...)

Neighborhood reductions (pooling, convolution)

Scaling network parameters

increased learning capacity; overfitting

more training data (real or synthetic),

regularization required

Page 11: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Deep nets are bandwidth eaters

More parameters = more memory, data to exchange

Barrier to cross-machine parallelism

▪ periodic exchanges, compression, quantization

Increase reuse of memory while local?

▪ interspersed reductions are resistant to fusion of

computations

▪ generalized programming language problem

Page 12: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Deep nets are latency sensitive

Serial dependency of training

fprop => bprop => fprop => ...

Serial dependency of multi-layer networks

layer 1 => layer 2 => layer 3 => ...

Multiple path dependent networks (RNNs, multi-layer

LSTMs)

Page 13: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Deep nets are also small?

Deeper = smaller feature planes, more of them

input Rm => expand to Rn => non-lin => reduce to

Rk

Problems are tiny in HPC terms

4096×4096 FFT, FE/PDE on massive grids, ...

NLP tasks can be sparse

Setup/kernel launch latency on GPU can dominate

compute

Page 14: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

The tools

Page 15: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Vector processors

SIMD: Single Instruction,

Multiple Data

Serial processor with ability to

operate on more than one

piece of data concurrently

Cray-1 (1976)

Page 16: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Vector processors

Hard to use: instructions only operate on 4, 8, 16, ...

pieces of data at a time. Boundary/alignment

effects. Great if your vectors are large, but...

float* a = ...; // is this aligned (a % 16 == 0)?

float* b = ...; // is this aligned (b % 16 == 0)?

for (i = 0; i < 18; ++i) { // how to handle [16, 17]?

b[i] += a[i]; // SIMD this?!? masking/loop epilogue

}

Page 17: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

“Vector cores”?

SIMD variant: NVIDIA calls

“SIMT”

Lots of simple cores (CM)

Hide latency through many

threads + switching (Tera)

“Pixel/vertex shaders” in 2000s

GPUs => GPGPU

CM-1 (1983)

Tera MTA (1995)

Page 18: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

GPU versus CPU

GPUs represent a different form of vector

programming (“vector cores”)

▪ 32-wide vector of threads (“warp”)

Sufficiently optimized CPU code can be on par with

GPU perf (Tflop range with AVX2/512, exploit multi-

level caches, deep pipelines, prefetch, ...)

Vector programming: easier with GPUs than CPUs

Sweetspot is different from GPU codes

Page 19: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Parallelization + vectorization

Serial nature of commonly used CPU programming

languages sometimes hides opportunities

Auto-vectorizing/parallelizing compilers + DSLs can’t

yet compete with expert hand-rolled

▪ DSLs like Halide (Ragan-Kelley et al. 2013) show

promise but need a few more generations

Sprinkle in (OpenMP) doesn’t cut it

Page 20: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Who winsCPU GPU

flops ✔

(vectorize: AVX2/512 gives

Tflop range)

Tesla K40: 2880 fp32 ALU

pipelines

main memory b/w ✖

(Xeon Phi improves)

latency ✔

(high clock, reordering;

caches are large and work if

you obey them)

(threads slow, non-smem

caches irrelevant, CPU ->

GPU control overhead)

boundary effects,

small/irregular sizes

✔✖

(branches easy, vectorization

hard)

(warp divergence, load

imbalance)

parallel programming model ✖

(vectorization hard, perf

black box)

✔✖

(CUDA is very different,

domain knowledge)

Page 21: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Tool + problem = solution?

Page 22: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Dive into 2D Convolutional Nets

Somewhat computationally expensive

O(b × f × f’ × n2 × k2)

1st layer AlexNet:

▪ 13.493 Gflop (1 flop here = fp32 multiply-add)

▪ 77.2 Mbyte in, 63.7 Mbyte out (fp32)

▪ Perfect caching + reuse, 175 flop/byte in

▪ No caching + reuse, 0.125 flop/byte in

Page 23: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

The problem

Programmable caches (shared memory, registers, ...)

not large enough for perfect reuse

Space of all possible square 2D convolution

problems is 5/6-dimensional

Parameter Size

minibatch size (b) 128

input feature maps (f) 3

output feature maps (f’) 96

input feature size (n x n) 224

convolution kernel size (k x k) 11

convolution kernel stride (SxS) (optional) 4

Page 24: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Converting

Space of all possible matrix multiplications = 3

dimensional (ANxMBMxP = CNxP)

NVIDIA, Intel, others have put lots of effort into

optimizing many parts of this space

▪ Rephrase convolution as a matrix multiplication!

▪ NVIDIA’s cuDNN

Page 25: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

But:

Sgemm originally optimized for large problems

13x13 * 3x3 is a small convolution. Unrolling it 192

times it might be enough to feed GPU

Large convolutions are intractable?

Small feature maps/

convolutions = boundary

effects bad for GPUs

Page 26: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Facebook AI Research work

2D convolution via FFT

Fast convolutional nets with fbfft: A GPU Performance

Evaluation (Vasilache, Johnson et al., 2015 ICLR

conference track oral)

Convolution => pointwise × in Fourier basis

Choice of basis is wide open! 2i is great perf

O(b f f’ n2 k2) => O(b f f’ n2 + (b f + f f’ + bf’) n2 log n)

▪ >= 5x5 kernels, faster than cuDNN

Page 27: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

fbfft

cuFFT optimized for large FFT sizes

fbfft: smaller data, fit in registers, focus on warp

Page 28: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Data layout

Different problem sizes => different data layout

▪ cudaconv: DHWB (optimal for large b)

▪ deeper layers: HWBD/BHWD (many feature maps)

▪ b=1 faster convergence?

▪ b=128 better compute utilization

Smaller problems, exploit different layout/batching

▪ fbcunn 1D convolution

Page 29: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Latency hiding: what holds you back?

▪ Compute bound? (math)

▪ Memory b/w bound? (streaming)

▪ Memory latency bound? (sparse)

Almost all “deep learning” algorithms are b/w bound

on GPU. Low math intensity!

cuBLAS: Sgemm b/w bound. Dgemm compute

bound

Page 30: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Kernel fusion: CPU vs GPU

Reduces memory b/w pressure

Exploits cache locality and register reuse

CPU: fusion not necessary

Kernel tiling + interleaving works due to caches

GPU: fusion necessary

Tiling + interleaving doesn’t work: smem not

persistent, caches too small/irrelevant

Page 31: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Kernel fusion

CUDA kernel = hard optimization boundary on GPU

Loop interchange, lifting, better fusion on CPU

CUDA: parallelization layer not visible to optimizer.

Auto-tuning desired. HW specific non-linear tradeoffs

Scripting languages are further barrier to fusion on

both CPU and GPU (Torch)

Page 32: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Kernel fusion

Torch: transposition is common operation

▪ size (80, 40) stride (40, 1) => size (40, 80) stride

(1, 40)

▪ Old approach: transpose in memory, perform work,

copy back

▪ New approach: rewrite kernel to handle

transpositions. Optimize if non-transposed

Runtime fusion (CUDA 7.0, Theano)

Page 33: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Exploiting parallelism

Page 34: Jeff Johnson, Research Engineer, Facebook at MLconf NYC

end