OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir...

OPTIMIZED GPU KERNELS FOR DEEP LEARNING

Amir Khosrowshahi

GTC 17 Mar 2015

Outline

2

• About nervana

• Optimizing deep learning at assembler level

• Limited precision for deep learning

• neon benchmarks

3

About nervana

• A platform for machine intelligence

• enable deep learning at scale

• optimized from algorithms to silicon

About | Kernels | neon | Summary

X

4

Verticals

$Medical Finance Pharma Oil&Gas Agriculture


4

Verticals

$Medical Finance Pharma Oil&Gas Agriculture

• Deep learning supplanting traditional approaches everywhere

• Small improvements have large impact

• Customers require clear roadmap that scales to growing need.


5

nervana platform for deep learning

5 About | Kernels | neon | Summary

SolutionsData

nervana framework

train deploy

nervana cloud

explore

5

nervana platform for deep learning


SolutionsData

nervana framework

train deploy

nervana cloud

explore

GPUs

CPUs

nervana engine

6

• Full control of:

• register allocation

• instruction ordering

• control codes

• barriers, stall counts

• Built-in scheduler (optional)

• Meta-programming

About | Kernels [ maxas ] | neon | Summary

maxas: a Maxwell Assembler

6




• control codes






Scott Gray

6




• control codes






Scott Gray

See GitHub repo for docs and examples

7

ptxas struggles with Instruction Level Parallelism

0"

5"

10"

15"

20"

25"

1" 6" 11"

16"

21"

26"

31"

36"

41"

46"

51"

56"

61"

66"

71"

76"

81"

86"

91"

96"

101"

106"

111"

116"

121"

126"

131"

136"

141"

146"

151"

Coun

t&

FFMA&Line#&.&LDS&Line#&

Distribu4on&of&Number&of&Instruc4ons&Between&LDS&and&Dependant&FFMA&Operands&

ptx"cublas"

Bad Good


courtesy Scott Gray

8

Easy register allocation through maxas


Register banking for outer products

c = a bt

c

a b

9

Example GEMM code in maxas


9



Load from shared

9



Fused fp32 multiply add

Load from shared

9


Control Codes



Load from shared

9


Control Codes


Dual issue instr.


Load from shared

9


Control Codes


Dual issue instr.


Load from shared

Set barrier

9


Control Codes


Dual issue instr.


Load from shared

Barrier sync

Set barrier

C H x W R x S K P x Q N

10

Convolution kernels for deep learning

C

H

W

RS

RS

P

QK

K

Input Filters Output

* =

About | Kernels [ Convolution ] | neon | Summary

Number of input channelsInput spatial dimsFilter spatial dimsNumber of filtersOutput spatial dimsMini-batch dim (not shown)

11

Access patterns for matrix lowering• Convolution kernels:


11

Access patterns for matrix lowering

fprop• Convolution kernels:


11


Backprop(Step(1(

23(

N(=(3(C(=(3(H(=(W(=(3(

C(=(3(K(=(2(R(=(S(=(2((

P(=(Q(=(2(K(=(2(

(In(each(itera'on,(there(is(a(mul'ply(opera'on(between(an(NxK(matrix(and(a(Kx(C*R*S)(matrix.(The(operands(and(the(result(are(shown(shaded.(((((((((((stands(for(deconvolve.(

δ1(

δ0(

The(results(of(the(matrix(mul'plica'ons(are(accumulated(to(obtain(((((((((((.(δ0(

bprop• Convolution kernels:


11


Backprop(Step(2(–(Weight(Updates(

27(

N(=(3(C(=(3(H(=(W(=(3(

P(=(Q(=(2(K(=(2(

(In(each(itera'on,(there(is(a(mul'ply(opera'on(between(a(KxN(matrix(and(an(Nx(C*R*S)(matrix.(Note(that(the(delta(matrix(is(sliced(and(the(result(transposed(before(the(mul'plica'on.(

δ1(

The(results(of(the(matrix(mul'plica'ons(are(accumulated(to(obtain(the(weight(updates.(

Output(of(the(previous(layer(

C(=(3(K(=(2(R(=(S(=(2((

Weight(updates(

update• Convolution kernels:


12

Deep learning with low precision works

About | Kernels [ Limited Precision ] | neon | Summary

12



Improving the speed of neural networks on CPUs

Vincent Vanhoucke

Google, Inc.Mountain View, CA 94043

[email protected]

Andrew Senior

Google, Inc.New York, NY 10011

[email protected]

Mark Z. Mao

Google, Inc.Mountain View, CA [email protected]

Abstract

Recent advances in deep learning have made the use of large, deep neural net-works with tens of millions of parameters suitable for a number of applicationsthat require real-time processing. The sheer size of these networks can represent achallenging computational burden, even for modern CPUs. For this reason, GPUsare routinely used instead to train and run such networks. This paper is a tutorialfor students and researchers on some of the techniques that can be used to reducethis computational cost considerably on modern x86 CPUs. We emphasize datalayout, batching of the computation, the use of SSE2 instructions, and particularlyleverage SSSE3 and SSE4 fixed-point instructions which provide a 3⇥ improve-ment over an optimized floating-point baseline. We use speech recognition as anexample task, and show that a real-time hybrid hidden Markov model / neuralnetwork (HMM/NN) large vocabulary system can be built with a 10⇥ speedupover an unoptimized baseline and a 4⇥ speedup over an aggressively optimizedfloating-point baseline at no cost in accuracy. The techniques described extendreadily to neural network training and provide an effective alternative to the useof specialized hardware.

1 Introduction

The recent resurgence of interest in neural networks owes a certain debt to the availability of af-fordable, powerful GPUs which routinely speed up common operations such as large matrix com-putations by factors from 5⇥ to 50⇥ [1-3]. These enabled researchers to tackle much larger, moredifficult machine learning tasks using neural networks, auto-encoders or deep belief networks [4-6]. Due to a variety of factors, including cost, component reliability and programming complexity,GPUs are still however the exception rather than the norm in computing clusters. The question thenbecomes whether to invest in GPU resources, or whether traditional CPUs can be made to performfast enough that, using distributed computing, they will yield similar or superior scalability andperformance. The purpose of this paper is not to settle this debate, but rather to introduce to neuralnetwork researchers some tools which can significantly improve the performance of neural networkson Intel and AMD CPUs in accessible form. Some of these might not be novel to researchers wellversed in high-performance computing, but they lay the foundation for improvements going beyondwhat one might obtain using existing optimized BLAS packages. We will show in particular howone can outperform optimized BLAS packages by a factor of 3 using fixed point arithmetic andSSSE3 / SSE4 instructions.

1

Under review as a conference paper at ICLR 2015

LOW PRECISION ARITHMETIC FOR DEEP LEARNING

Matthieu Courbariaux & Jean-Pierre David

Department of Electrical EngineeringEcole Polytechnique de MontrealMontreal, QC H3T 1J4, Canada{matthieu.courbariaux,jean-pierre.david}@polymtl.ca

Yoshua Bengio

⇤

Department of Computer Science and Operations ResearchUniversite de MontrealMontreal, QC H3T 1J4, [email protected]

ABSTRACT

We simulate the training of a set of state of the art neural networks, the Maxoutnetworks (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST,CIFAR10 and SVHN, with three distinct arithmetics: floating point, fixed pointand dynamic fixed point. For each of those datasets and for each of those arith-metics, we assess the impact of the precision of the computations on the finalerror of the training. We find that very low precision computation is sufficient notjust for running trained networks but also for training them. For example, almoststate-of-the-art results were obtained on most datasets with around 10 bits forcomputing activations and gradients, and 12 bits for storing updated parameters.

1 INTRODUCTION

Deep learning is very often limited by memory and computational power. Lots of previous worksaddress the best exploitation of general-purpose hardware, typically CPU clusters (Dean et al.,2012) and GPUs (Coates et al., 2009; Krizhevsky et al., 2012a). Faster implementations usuallylead to state of the art results (Dean et al., 2012; Krizhevsky et al., 2012a; Sutskever et al., 2014).

Actually, such approaches always consist in adapting the algorithm to best exploit state of the arthardware. Nevertheless, some dedicated deep learning hardware is appearing as well. FPGA imple-mentations claim a better power efficiency than general-purpose hardware (Farabet et al., 2011; Kimet al., 2009). The corresponding ASIC implementations are even more efficient (Pham et al., 2012).In contrast with general-purpose hardware, dedicated hardware such as ASIC and FPGA enables tobuild the hardware from the algorithm. In this context, it is important to know what is the minimumprecision acceptable.

Actually, minimizing the size of the arithmetic operators and the size of the memories would lead toarchitectures with more operators and memories working in parallel. It would also drastically reducethe power consumption. For instance, using single precision (32 bits) instead of double precision(64 bits) for a floating point multiplier reduces its area by four on modern FPGAs (Govindu et al.,2004; Underwood, 2004).

In this paper, we simulate the training of a set of state of the art neural networks, the Maxout net-works (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST, CIFAR10 and SVHN,with three distinct arithmetics: floating point, fixed point and dynamic fixed point. For each of thosedatasets and for each of those arithmetics, we assess the impact of the precision of the computationson the final error of the training. We find that very low precision computation is sufficient not justfor running trained networks but also for training them. For example, almost state-of-the-art results

⇤Yoshua Bengio is a CIFAR Senior Fellow.

1

arX

iv:1

412.

7024

v1 [

cs.L

G]

22 D

ec 2

014

12



Deep Learning with Limited Numerical Precision

Suyog Gupta [email protected]

Ankur Agrawal [email protected]

Kailash Gopalakrishnan [email protected]

IBM T. J. Watson Research Center, Yorktown Heights, NY 10598

Pritish Narayanan [email protected]

IBM Almaden Research Center, San Jose, CA 95120

Abstract

Training of large-scale deep neural networksis often constrained by the available compu-tational resources. We study the e↵ect of lim-ited precision data representation and com-putation on neural network training. Withinthe context of low-precision fixed-point com-putations, we observe the rounding schemeto play a crucial role in determining thenetwork’s behavior during training. Our re-sults show that deep networks can be trainedusing only 16-bit wide fixed-point numberrepresentation when using stochastic round-ing, and incur little to no degradation in theclassification accuracy. We also demonstratean energy-e�cient hardware accelerator thatimplements low-precision fixed-point arith-metic with stochastic rounding.

1. Introduction

To a large extent, the success of deep learning tech-niques is contingent upon the underlying hardwareplatform’s ability to perform fast, supervised train-ing of complex networks using large quantities oflabeled data. Such a capability enables rapid evalua-tion of di↵erent network architectures and a thoroughsearch over the space of model hyperparameters. Itshould therefore come as no surprise that recent yearshave seen a resurgence of interest in deploying large-scale computing infrastructure designed specificallyfor training deep neural networks. Some notablee↵orts in this direction include distributed computinginfrastructure using thousands of CPU cores (Deanet al., 2012; Chilimbi et al., 2014), or high-end graphicsprocessors (GPUs) (Krizhevsky & Hinton, 2009), or acombination of CPUs and GPUs scaled-up to multiplenodes (Coates et al., 2013; Wu et al., 2015).

At the same time, the natural error resiliency ofneural network architectures and learning algorithmsis well-documented, setting them apart from moretraditional workloads that typically require precisecomputations and number representations with highdynamic range. It is well appreciated that in thepresence of statistical approximation and estimationerrors, high-precision computation in the context oflearning is rather unnecessary (Bottou & Bousquet,2007). Moreover, the addition of noise during train-ing has been shown to improve the neural network’sperformance (Murray & Edwards, 1994; Bishop, 1995;Audhkhasi et al., 2013). With the exception of em-ploying the asynchronous version of the stochasticgradient descent algorithm (Recht et al., 2011) toreduce network tra�c, the state-of-the-art large-scaledeep learning systems fail to adequately capitalize onthe error-resiliency of their workloads. These systemsare built by assembling general-purpose computinghardware designed to cater to the needs of more tradi-tional workloads, incurring high and often unnecessaryoverhead in the required computational resources.

The work presented in this paper owes its inceptionto the thinking that it may be possible to leveragealgorithm-level noise-tolerance to relax certain con-straints on the underlying hardware, leading to ahardware-software co-optimized system that achievessignificant improvement in computational performanceand energy e�ciency. Allowing the low-level hard-ware components to perform approximate, possiblynon-deterministic computations and exposing thesehardware-generated errors up to the algorithm level ofthe computing stack forms a key ingredient in develop-ing such systems. Additionally, the low-level hardwarechanges need to be introduced in a manner that pre-serves the programming model so that the benefits canbe readily absorbed at the application-level withoutincurring significant software redevelopment costs.

arX

iv:1

502.

0255

1v1

[cs.L

G]

9 Fe

b 20

15Deep Learning with Limited Numerical Precision

Suyog Gupta [email protected]

Ankur Agrawal [email protected]

Kailash Gopalakrishnan [email protected]

IBM T. J. Watson Research Center, Yorktown Heights, NY 10598

Pritish Narayanan [email protected]

IBM Almaden Research Center, San Jose, CA 95120

Abstract

Training of large-scale deep neural networksis often constrained by the available compu-tational resources. We study the e↵ect of lim-ited precision data representation and com-putation on neural network training. Withinthe context of low-precision fixed-point com-putations, we observe the rounding schemeto play a crucial role in determining thenetwork’s behavior during training. Our re-sults show that deep networks can be trainedusing only 16-bit wide fixed-point numberrepresentation when using stochastic round-ing, and incur little to no degradation in theclassification accuracy. We also demonstratean energy-e�cient hardware accelerator thatimplements low-precision fixed-point arith-metic with stochastic rounding.

1. Introduction

To a large extent, the success of deep learning tech-niques is contingent upon the underlying hardwareplatform’s ability to perform fast, supervised train-ing of complex networks using large quantities oflabeled data. Such a capability enables rapid evalua-tion of di↵erent network architectures and a thoroughsearch over the space of model hyperparameters. Itshould therefore come as no surprise that recent yearshave seen a resurgence of interest in deploying large-scale computing infrastructure designed specificallyfor training deep neural networks. Some notablee↵orts in this direction include distributed computinginfrastructure using thousands of CPU cores (Deanet al., 2012; Chilimbi et al., 2014), or high-end graphicsprocessors (GPUs) (Krizhevsky & Hinton, 2009), or acombination of CPUs and GPUs scaled-up to multiplenodes (Coates et al., 2013; Wu et al., 2015).

At the same time, the natural error resiliency ofneural network architectures and learning algorithmsis well-documented, setting them apart from moretraditional workloads that typically require precisecomputations and number representations with highdynamic range. It is well appreciated that in thepresence of statistical approximation and estimationerrors, high-precision computation in the context oflearning is rather unnecessary (Bottou & Bousquet,2007). Moreover, the addition of noise during train-ing has been shown to improve the neural network’sperformance (Murray & Edwards, 1994; Bishop, 1995;Audhkhasi et al., 2013). With the exception of em-ploying the asynchronous version of the stochasticgradient descent algorithm (Recht et al., 2011) toreduce network tra�c, the state-of-the-art large-scaledeep learning systems fail to adequately capitalize onthe error-resiliency of their workloads. These systemsare built by assembling general-purpose computinghardware designed to cater to the needs of more tradi-tional workloads, incurring high and often unnecessaryoverhead in the required computational resources.

The work presented in this paper owes its inceptionto the thinking that it may be possible to leveragealgorithm-level noise-tolerance to relax certain con-straints on the underlying hardware, leading to ahardware-software co-optimized system that achievessignificant improvement in computational performanceand energy e�ciency. Allowing the low-level hard-ware components to perform approximate, possiblynon-deterministic computations and exposing thesehardware-generated errors up to the algorithm level ofthe computing stack forms a key ingredient in develop-ing such systems. Additionally, the low-level hardwarechanges need to be introduced in a manner that pre-serves the programming model so that the benefits canbe readily absorbed at the application-level withoutincurring significant software redevelopment costs.

arX

iv:1

502.

0255

1v1

[cs.L

G]

9 Fe

b 20

15

12



neon: nervana python deep learning library



13

• User-friendly, extensible, abstracts parallelism



13


• Support for many deep learning models



13



• Interface to nervana cloud



13




• Supports multiple backends

nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon)

{ }



13





• Multiple limited precision options


{ }



13





• Multiple limited precision options

• Optimized for Maxwell at assembler level


{ }



neon: easy model configuration



• Dataset



• Dataset

• Weight initialization



• Dataset


• Learning rule



• Dataset


• Learning rule

• Model layers and cost

neon experiments in fp16/32



15

• Use 16-bit floating point (fp16) as memory format



15


• Multiply-and-adds use fp32



15



• Kernel support for:


GEMM Stochastic rounding Dropout / maxout

Conv {f,b}prop, update Max pooling Statistics collection


15




• Python element-wise operations auto-compiled into kernels





15





• fp16 accumulations done carefully to minimize errors





15





• fp16 accumulations done carefully to minimize errors

• Working with collaborators (Baidu, Bengio lab) to improve




29

Cou

nt

30 31 32 33 34 35

fp32

Error (%) distribution over 25 reruns

fp16/32 accuracy


• No accuracy loss going from fp32 to fp16

Error (%) distribution over 25 runs

29

Cou

nt

30 31 32 33 34 35

fp32


fp16/32 accuracy



29

Cou

nt

30 31 32 33 34 35

fp16fp32

Error (%) distribution over 25 rerunsError (%) distribution over 25 runs

29

Cou

nt

30 31 32 33 34 35

fp32


fp16/32 accuracy



29

Cou

nt

30 31 32 33 34 35

fp16fp32

Error (%) distribution over 25 reruns29

Cou

nt

30 31 32 33 34 35

fp 16 stofp16fp32

Error (%) distribution over 25 rerunsError (%) distribution over 25 runs

17

Speed benchmarks1: fp16 vs others


100

200

300

400

500

600

Tim

e pe

r lay

er (m

s)

neon fp16Cuda-

convnet2neon

cudanet Torch7 cuDNN

5 layers forward pass, 5 backward pass05 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980

*

*2nd, 3rd layer don’t fit on a 4GB card

1 Soumith Chintala, github.com/soumith/convnet-benchmarks

X

Speed benchmarks1: fp16 vs fp32


100

200

300

400

500

600

Tim

e pe

r lay

er (m

s)

neon fp16Cuda-

convnet2neon


5 layers forward pass, 5 backward pass05 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980

*

*some layers do not fit on a 4GB card


X



100

200

300

400

500

600

Tim

e pe

r lay

er (m

s)

neon fp16Cuda-

convnet2neon


5 layers forward pass, 5 backward pass0

100

200

Tim

e pe

r lay

er (m

s)


neon fp16Cuda-

convnet2neon


5 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980

*



X



100

200

300

400

500

600

Tim

e pe

r lay

er (m

s)

neon fp16Cuda-

convnet2neon



100

200

Tim

e pe

r lay

er (m

s)


neon fp16Cuda-

convnet2neon


50

100

Tim

e pe

r lay

er (m

s)


neon fp16Cuda-

convnet2neon


5 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980

*



18

Benchmarks1 show 2x performance


Maximum practical peak is 4700 gflops.

1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

More than double speed2 with half memory storage / bandwidth.

Raw numbers (averaged over 10 runs)

0 1 2 3 4 5Speed (TFLOPS)

0

10

20

30

Tim

e / (

s)

Alexnet fp16

Alexnet Cuda-Convnet2

18



Avg(10) fprop: 43.650 msecs 4188.573 gflopsAvg(10) bprop: 94.315 msecs 3877.055 gflopsAvg(10) total: 137.965 msecs 3975.615 gflops

Alexnet






0

10

20

30

Tim

e / (

s)

Alexnet fp16


18




Overfeat






0

10

20

30

Tim

e / (

s)

Alexnet fp16


18




VGG (N=64)






0

10

20

30

Tim

e / (

s)

Alexnet fp16


19

Summary


19

Summary

• neon: User-friendly python library


19

Summary


• maxas: Powerful tool for optimizing deep learning


19

Summary



• Fast performance, full utilization of GPU


19

Summary




• Limited precision allows for larger models


19

Summary




• Limited precision allows for larger models

• Toolbox for exploring numerical representations


20

GTC 2015


• Contact us at [email protected]

• We are hiring!

• Sign up to try neon, our deep learning library.

• We can help solve your problem.

• Cloud engineers • machine learning engineers

• GPU experts • software engineers

OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir...

Documents

Transcript of OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir...