OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir...

66
OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

Transcript of OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir...

Page 1: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

OPTIMIZED GPU KERNELS FOR DEEP LEARNING

Amir Khosrowshahi

GTC 17 Mar 2015

Page 2: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

Outline

2

• About nervana

• Optimizing deep learning at assembler level

• Limited precision for deep learning

• neon benchmarks

Page 3: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

3

About nervana

• A platform for machine intelligence

• enable deep learning at scale

• optimized from algorithms to silicon

About | Kernels | neon | Summary

X

Page 4: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

4

Verticals

$Medical Finance Pharma Oil&Gas Agriculture

About | Kernels | neon | Summary

Page 5: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

4

Verticals

$Medical Finance Pharma Oil&Gas Agriculture

• Deep learning supplanting traditional approaches everywhere

• Small improvements have large impact

• Customers require clear roadmap that scales to growing need.

About | Kernels | neon | Summary

Page 6: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

5

nervana platform for deep learning

5 About | Kernels | neon | Summary

SolutionsData

nervana framework

train deploy

nervana cloud

explore

Page 7: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

5

nervana platform for deep learning

5 About | Kernels | neon | Summary

SolutionsData

nervana framework

train deploy

nervana cloud

explore

GPUs

CPUs

nervana engine

Page 8: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

6

• Full control of:

• register allocation

• instruction ordering

• control codes

• barriers, stall counts

• Built-in scheduler (optional)

• Meta-programming

About | Kernels [ maxas ] | neon | Summary

maxas: a Maxwell Assembler

Page 9: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

6

• Full control of:

• register allocation

• instruction ordering

• control codes

• barriers, stall counts

• Built-in scheduler (optional)

• Meta-programming

About | Kernels [ maxas ] | neon | Summary

maxas: a Maxwell Assembler

Scott Gray

Page 10: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

6

• Full control of:

• register allocation

• instruction ordering

• control codes

• barriers, stall counts

• Built-in scheduler (optional)

• Meta-programming

About | Kernels [ maxas ] | neon | Summary

maxas: a Maxwell Assembler

Scott Gray

See GitHub repo for docs and examples

Page 11: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

7

ptxas struggles with Instruction Level Parallelism

0"

5"

10"

15"

20"

25"

1" 6" 11"

16"

21"

26"

31"

36"

41"

46"

51"

56"

61"

66"

71"

76"

81"

86"

91"

96"

101"

106"

111"

116"

121"

126"

131"

136"

141"

146"

151"

Coun

t&

FFMA&Line#&.&LDS&Line#&

Distribu4on&of&Number&of&Instruc4ons&Between&LDS&and&Dependant&FFMA&Operands&

ptx"cublas"

Bad Good

About | Kernels [ maxas ] | neon | Summary

courtesy Scott Gray

Page 12: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

8

Easy register allocation through maxas

About | Kernels [ maxas ] | neon | Summary

Register banking for outer products

c = a bt

c

a b

Page 13: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

9

Example GEMM code in maxas

About | Kernels [ maxas ] | neon | Summary

Page 14: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

9

Example GEMM code in maxas

About | Kernels [ maxas ] | neon | Summary

Load from shared

Page 15: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

9

Example GEMM code in maxas

About | Kernels [ maxas ] | neon | Summary

Fused fp32 multiply add

Load from shared

Page 16: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

9

Example GEMM code in maxas

Control Codes

About | Kernels [ maxas ] | neon | Summary

Fused fp32 multiply add

Load from shared

Page 17: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

9

Example GEMM code in maxas

Control Codes

About | Kernels [ maxas ] | neon | Summary

Dual issue instr.

Fused fp32 multiply add

Load from shared

Page 18: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

9

Example GEMM code in maxas

Control Codes

About | Kernels [ maxas ] | neon | Summary

Dual issue instr.

Fused fp32 multiply add

Load from shared

Set barrier

Page 19: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

9

Example GEMM code in maxas

Control Codes

About | Kernels [ maxas ] | neon | Summary

Dual issue instr.

Fused fp32 multiply add

Load from shared

Barrier sync

Set barrier

Page 20: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

C H x W R x S K P x Q N

10

Convolution kernels for deep learning

C

H

W

RS

RS

P

QK

K

Input Filters Output

* =

About | Kernels [ Convolution ] | neon | Summary

Number of input channelsInput spatial dimsFilter spatial dimsNumber of filtersOutput spatial dimsMini-batch dim (not shown)

Page 21: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

11

Access patterns for matrix lowering• Convolution kernels:

About | Kernels [ Convolution ] | neon | Summary

Page 22: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

11

Access patterns for matrix lowering

fprop• Convolution kernels:

About | Kernels [ Convolution ] | neon | Summary

Page 23: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

11

Access patterns for matrix lowering

Backprop(Step(1(

23(

N(=(3(C(=(3(H(=(W(=(3(

C(=(3(K(=(2(R(=(S(=(2((

P(=(Q(=(2(K(=(2(

(In(each(itera'on,(there(is(a(mul'ply(opera'on(between(an(NxK(matrix(and(a(Kx(C*R*S)(matrix.(The(operands(and(the(result(are(shown(shaded.(((((((((((stands(for(deconvolve.(

δ1(

δ0(

The(results(of(the(matrix(mul'plica'ons(are(accumulated(to(obtain(((((((((((.(δ0(

bprop• Convolution kernels:

About | Kernels [ Convolution ] | neon | Summary

Page 24: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

11

Access patterns for matrix lowering

Backprop(Step(2(–(Weight(Updates(

27(

N(=(3(C(=(3(H(=(W(=(3(

P(=(Q(=(2(K(=(2(

(In(each(itera'on,(there(is(a(mul'ply(opera'on(between(a(KxN(matrix(and(an(Nx(C*R*S)(matrix.(Note(that(the(delta(matrix(is(sliced(and(the(result(transposed(before(the(mul'plica'on.(

δ1(

The(results(of(the(matrix(mul'plica'ons(are(accumulated(to(obtain(the(weight(updates.(

Output(of(the(previous(layer(

C(=(3(K(=(2(R(=(S(=(2((

Weight(updates(

update• Convolution kernels:

About | Kernels [ Convolution ] | neon | Summary

Page 25: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

12

Deep learning with low precision works

About | Kernels [ Limited Precision ] | neon | Summary

Page 26: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

12

Deep learning with low precision works

About | Kernels [ Limited Precision ] | neon | Summary

Improving the speed of neural networks on CPUs

Vincent Vanhoucke

Google, Inc.Mountain View, CA 94043

[email protected]

Andrew Senior

Google, Inc.New York, NY 10011

[email protected]

Mark Z. Mao

Google, Inc.Mountain View, CA [email protected]

Abstract

Recent advances in deep learning have made the use of large, deep neural net-works with tens of millions of parameters suitable for a number of applicationsthat require real-time processing. The sheer size of these networks can represent achallenging computational burden, even for modern CPUs. For this reason, GPUsare routinely used instead to train and run such networks. This paper is a tutorialfor students and researchers on some of the techniques that can be used to reducethis computational cost considerably on modern x86 CPUs. We emphasize datalayout, batching of the computation, the use of SSE2 instructions, and particularlyleverage SSSE3 and SSE4 fixed-point instructions which provide a 3⇥ improve-ment over an optimized floating-point baseline. We use speech recognition as anexample task, and show that a real-time hybrid hidden Markov model / neuralnetwork (HMM/NN) large vocabulary system can be built with a 10⇥ speedupover an unoptimized baseline and a 4⇥ speedup over an aggressively optimizedfloating-point baseline at no cost in accuracy. The techniques described extendreadily to neural network training and provide an effective alternative to the useof specialized hardware.

1 Introduction

The recent resurgence of interest in neural networks owes a certain debt to the availability of af-fordable, powerful GPUs which routinely speed up common operations such as large matrix com-putations by factors from 5⇥ to 50⇥ [1-3]. These enabled researchers to tackle much larger, moredifficult machine learning tasks using neural networks, auto-encoders or deep belief networks [4-6]. Due to a variety of factors, including cost, component reliability and programming complexity,GPUs are still however the exception rather than the norm in computing clusters. The question thenbecomes whether to invest in GPU resources, or whether traditional CPUs can be made to performfast enough that, using distributed computing, they will yield similar or superior scalability andperformance. The purpose of this paper is not to settle this debate, but rather to introduce to neuralnetwork researchers some tools which can significantly improve the performance of neural networkson Intel and AMD CPUs in accessible form. Some of these might not be novel to researchers wellversed in high-performance computing, but they lay the foundation for improvements going beyondwhat one might obtain using existing optimized BLAS packages. We will show in particular howone can outperform optimized BLAS packages by a factor of 3 using fixed point arithmetic andSSSE3 / SSE4 instructions.

1

Page 27: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

Under review as a conference paper at ICLR 2015

LOW PRECISION ARITHMETIC FOR DEEP LEARNING

Matthieu Courbariaux & Jean-Pierre David

Department of Electrical EngineeringEcole Polytechnique de MontrealMontreal, QC H3T 1J4, Canada{matthieu.courbariaux,jean-pierre.david}@polymtl.ca

Yoshua Bengio

Department of Computer Science and Operations ResearchUniversite de MontrealMontreal, QC H3T 1J4, [email protected]

ABSTRACT

We simulate the training of a set of state of the art neural networks, the Maxoutnetworks (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST,CIFAR10 and SVHN, with three distinct arithmetics: floating point, fixed pointand dynamic fixed point. For each of those datasets and for each of those arith-metics, we assess the impact of the precision of the computations on the finalerror of the training. We find that very low precision computation is sufficient notjust for running trained networks but also for training them. For example, almoststate-of-the-art results were obtained on most datasets with around 10 bits forcomputing activations and gradients, and 12 bits for storing updated parameters.

1 INTRODUCTION

Deep learning is very often limited by memory and computational power. Lots of previous worksaddress the best exploitation of general-purpose hardware, typically CPU clusters (Dean et al.,2012) and GPUs (Coates et al., 2009; Krizhevsky et al., 2012a). Faster implementations usuallylead to state of the art results (Dean et al., 2012; Krizhevsky et al., 2012a; Sutskever et al., 2014).

Actually, such approaches always consist in adapting the algorithm to best exploit state of the arthardware. Nevertheless, some dedicated deep learning hardware is appearing as well. FPGA imple-mentations claim a better power efficiency than general-purpose hardware (Farabet et al., 2011; Kimet al., 2009). The corresponding ASIC implementations are even more efficient (Pham et al., 2012).In contrast with general-purpose hardware, dedicated hardware such as ASIC and FPGA enables tobuild the hardware from the algorithm. In this context, it is important to know what is the minimumprecision acceptable.

Actually, minimizing the size of the arithmetic operators and the size of the memories would lead toarchitectures with more operators and memories working in parallel. It would also drastically reducethe power consumption. For instance, using single precision (32 bits) instead of double precision(64 bits) for a floating point multiplier reduces its area by four on modern FPGAs (Govindu et al.,2004; Underwood, 2004).

In this paper, we simulate the training of a set of state of the art neural networks, the Maxout net-works (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST, CIFAR10 and SVHN,with three distinct arithmetics: floating point, fixed point and dynamic fixed point. For each of thosedatasets and for each of those arithmetics, we assess the impact of the precision of the computationson the final error of the training. We find that very low precision computation is sufficient not justfor running trained networks but also for training them. For example, almost state-of-the-art results

⇤Yoshua Bengio is a CIFAR Senior Fellow.

1

arX

iv:1

412.

7024

v1 [

cs.L

G]

22 D

ec 2

014

12

Deep learning with low precision works

About | Kernels [ Limited Precision ] | neon | Summary

Page 28: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

Deep Learning with Limited Numerical Precision

Suyog Gupta [email protected]

Ankur Agrawal [email protected]

Kailash Gopalakrishnan [email protected]

IBM T. J. Watson Research Center, Yorktown Heights, NY 10598

Pritish Narayanan [email protected]

IBM Almaden Research Center, San Jose, CA 95120

Abstract

Training of large-scale deep neural networksis often constrained by the available compu-tational resources. We study the e↵ect of lim-ited precision data representation and com-putation on neural network training. Withinthe context of low-precision fixed-point com-putations, we observe the rounding schemeto play a crucial role in determining thenetwork’s behavior during training. Our re-sults show that deep networks can be trainedusing only 16-bit wide fixed-point numberrepresentation when using stochastic round-ing, and incur little to no degradation in theclassification accuracy. We also demonstratean energy-e�cient hardware accelerator thatimplements low-precision fixed-point arith-metic with stochastic rounding.

1. Introduction

To a large extent, the success of deep learning tech-niques is contingent upon the underlying hardwareplatform’s ability to perform fast, supervised train-ing of complex networks using large quantities oflabeled data. Such a capability enables rapid evalua-tion of di↵erent network architectures and a thoroughsearch over the space of model hyperparameters. Itshould therefore come as no surprise that recent yearshave seen a resurgence of interest in deploying large-scale computing infrastructure designed specificallyfor training deep neural networks. Some notablee↵orts in this direction include distributed computinginfrastructure using thousands of CPU cores (Deanet al., 2012; Chilimbi et al., 2014), or high-end graphicsprocessors (GPUs) (Krizhevsky & Hinton, 2009), or acombination of CPUs and GPUs scaled-up to multiplenodes (Coates et al., 2013; Wu et al., 2015).

At the same time, the natural error resiliency ofneural network architectures and learning algorithmsis well-documented, setting them apart from moretraditional workloads that typically require precisecomputations and number representations with highdynamic range. It is well appreciated that in thepresence of statistical approximation and estimationerrors, high-precision computation in the context oflearning is rather unnecessary (Bottou & Bousquet,2007). Moreover, the addition of noise during train-ing has been shown to improve the neural network’sperformance (Murray & Edwards, 1994; Bishop, 1995;Audhkhasi et al., 2013). With the exception of em-ploying the asynchronous version of the stochasticgradient descent algorithm (Recht et al., 2011) toreduce network tra�c, the state-of-the-art large-scaledeep learning systems fail to adequately capitalize onthe error-resiliency of their workloads. These systemsare built by assembling general-purpose computinghardware designed to cater to the needs of more tradi-tional workloads, incurring high and often unnecessaryoverhead in the required computational resources.

The work presented in this paper owes its inceptionto the thinking that it may be possible to leveragealgorithm-level noise-tolerance to relax certain con-straints on the underlying hardware, leading to ahardware-software co-optimized system that achievessignificant improvement in computational performanceand energy e�ciency. Allowing the low-level hard-ware components to perform approximate, possiblynon-deterministic computations and exposing thesehardware-generated errors up to the algorithm level ofthe computing stack forms a key ingredient in develop-ing such systems. Additionally, the low-level hardwarechanges need to be introduced in a manner that pre-serves the programming model so that the benefits canbe readily absorbed at the application-level withoutincurring significant software redevelopment costs.

arX

iv:1

502.

0255

1v1

[cs.L

G]

9 Fe

b 20

15Deep Learning with Limited Numerical Precision

Suyog Gupta [email protected]

Ankur Agrawal [email protected]

Kailash Gopalakrishnan [email protected]

IBM T. J. Watson Research Center, Yorktown Heights, NY 10598

Pritish Narayanan [email protected]

IBM Almaden Research Center, San Jose, CA 95120

Abstract

Training of large-scale deep neural networksis often constrained by the available compu-tational resources. We study the e↵ect of lim-ited precision data representation and com-putation on neural network training. Withinthe context of low-precision fixed-point com-putations, we observe the rounding schemeto play a crucial role in determining thenetwork’s behavior during training. Our re-sults show that deep networks can be trainedusing only 16-bit wide fixed-point numberrepresentation when using stochastic round-ing, and incur little to no degradation in theclassification accuracy. We also demonstratean energy-e�cient hardware accelerator thatimplements low-precision fixed-point arith-metic with stochastic rounding.

1. Introduction

To a large extent, the success of deep learning tech-niques is contingent upon the underlying hardwareplatform’s ability to perform fast, supervised train-ing of complex networks using large quantities oflabeled data. Such a capability enables rapid evalua-tion of di↵erent network architectures and a thoroughsearch over the space of model hyperparameters. Itshould therefore come as no surprise that recent yearshave seen a resurgence of interest in deploying large-scale computing infrastructure designed specificallyfor training deep neural networks. Some notablee↵orts in this direction include distributed computinginfrastructure using thousands of CPU cores (Deanet al., 2012; Chilimbi et al., 2014), or high-end graphicsprocessors (GPUs) (Krizhevsky & Hinton, 2009), or acombination of CPUs and GPUs scaled-up to multiplenodes (Coates et al., 2013; Wu et al., 2015).

At the same time, the natural error resiliency ofneural network architectures and learning algorithmsis well-documented, setting them apart from moretraditional workloads that typically require precisecomputations and number representations with highdynamic range. It is well appreciated that in thepresence of statistical approximation and estimationerrors, high-precision computation in the context oflearning is rather unnecessary (Bottou & Bousquet,2007). Moreover, the addition of noise during train-ing has been shown to improve the neural network’sperformance (Murray & Edwards, 1994; Bishop, 1995;Audhkhasi et al., 2013). With the exception of em-ploying the asynchronous version of the stochasticgradient descent algorithm (Recht et al., 2011) toreduce network tra�c, the state-of-the-art large-scaledeep learning systems fail to adequately capitalize onthe error-resiliency of their workloads. These systemsare built by assembling general-purpose computinghardware designed to cater to the needs of more tradi-tional workloads, incurring high and often unnecessaryoverhead in the required computational resources.

The work presented in this paper owes its inceptionto the thinking that it may be possible to leveragealgorithm-level noise-tolerance to relax certain con-straints on the underlying hardware, leading to ahardware-software co-optimized system that achievessignificant improvement in computational performanceand energy e�ciency. Allowing the low-level hard-ware components to perform approximate, possiblynon-deterministic computations and exposing thesehardware-generated errors up to the algorithm level ofthe computing stack forms a key ingredient in develop-ing such systems. Additionally, the low-level hardwarechanges need to be introduced in a manner that pre-serves the programming model so that the benefits canbe readily absorbed at the application-level withoutincurring significant software redevelopment costs.

arX

iv:1

502.

0255

1v1

[cs.L

G]

9 Fe

b 20

15

12

Deep learning with low precision works

About | Kernels [ Limited Precision ] | neon | Summary

Page 29: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon: nervana python deep learning library

13 About | Kernels | neon | Summary

Page 30: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon: nervana python deep learning library

13

• User-friendly, extensible, abstracts parallelism

About | Kernels | neon | Summary

Page 31: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon: nervana python deep learning library

13

• User-friendly, extensible, abstracts parallelism

• Support for many deep learning models

About | Kernels | neon | Summary

Page 32: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon: nervana python deep learning library

13

• User-friendly, extensible, abstracts parallelism

• Support for many deep learning models

• Interface to nervana cloud

About | Kernels | neon | Summary

Page 33: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon: nervana python deep learning library

13

• User-friendly, extensible, abstracts parallelism

• Support for many deep learning models

• Interface to nervana cloud

• Supports multiple backends

nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon)

{ }

About | Kernels | neon | Summary

Page 34: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon: nervana python deep learning library

13

• User-friendly, extensible, abstracts parallelism

• Support for many deep learning models

• Interface to nervana cloud

• Supports multiple backends

• Multiple limited precision options

nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon)

{ }

About | Kernels | neon | Summary

Page 35: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon: nervana python deep learning library

13

• User-friendly, extensible, abstracts parallelism

• Support for many deep learning models

• Interface to nervana cloud

• Supports multiple backends

• Multiple limited precision options

• Optimized for Maxwell at assembler level

nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon)

{ }

About | Kernels | neon | Summary

Page 36: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

14 About | Kernels | neon | Summary

neon: easy model configuration

Page 37: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

14 About | Kernels | neon | Summary

neon: easy model configuration

• Dataset

Page 38: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

14 About | Kernels | neon | Summary

neon: easy model configuration

• Dataset

• Weight initialization

Page 39: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

14 About | Kernels | neon | Summary

neon: easy model configuration

• Dataset

• Weight initialization

• Learning rule

Page 40: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

14 About | Kernels | neon | Summary

neon: easy model configuration

• Dataset

• Weight initialization

• Learning rule

• Model layers and cost

Page 41: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon experiments in fp16/32

15 About | Kernels | neon | Summary

Page 42: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon experiments in fp16/32

15

• Use 16-bit floating point (fp16) as memory format

About | Kernels | neon | Summary

Page 43: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon experiments in fp16/32

15

• Use 16-bit floating point (fp16) as memory format

• Multiply-and-adds use fp32

About | Kernels | neon | Summary

Page 44: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon experiments in fp16/32

15

• Use 16-bit floating point (fp16) as memory format

• Multiply-and-adds use fp32

• Kernel support for:

About | Kernels | neon | Summary

GEMM Stochastic rounding Dropout / maxout

Conv {f,b}prop, update Max pooling Statistics collection

Page 45: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon experiments in fp16/32

15

• Use 16-bit floating point (fp16) as memory format

• Multiply-and-adds use fp32

• Kernel support for:

• Python element-wise operations auto-compiled into kernels

About | Kernels | neon | Summary

GEMM Stochastic rounding Dropout / maxout

Conv {f,b}prop, update Max pooling Statistics collection

Page 46: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon experiments in fp16/32

15

• Use 16-bit floating point (fp16) as memory format

• Multiply-and-adds use fp32

• Kernel support for:

• Python element-wise operations auto-compiled into kernels

• fp16 accumulations done carefully to minimize errors

About | Kernels | neon | Summary

GEMM Stochastic rounding Dropout / maxout

Conv {f,b}prop, update Max pooling Statistics collection

Page 47: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

neon experiments in fp16/32

15

• Use 16-bit floating point (fp16) as memory format

• Multiply-and-adds use fp32

• Kernel support for:

• Python element-wise operations auto-compiled into kernels

• fp16 accumulations done carefully to minimize errors

• Working with collaborators (Baidu, Bengio lab) to improve

About | Kernels | neon | Summary

GEMM Stochastic rounding Dropout / maxout

Conv {f,b}prop, update Max pooling Statistics collection

Page 48: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

29

Cou

nt

30 31 32 33 34 35

fp32

Error (%) distribution over 25 reruns

fp16/32 accuracy

16 About | Kernels | neon | Summary

• No accuracy loss going from fp32 to fp16

Error (%) distribution over 25 runs

Page 49: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

29

Cou

nt

30 31 32 33 34 35

fp32

Error (%) distribution over 25 reruns

fp16/32 accuracy

16 About | Kernels | neon | Summary

• No accuracy loss going from fp32 to fp16

29

Cou

nt

30 31 32 33 34 35

fp16fp32

Error (%) distribution over 25 rerunsError (%) distribution over 25 runs

Page 50: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

29

Cou

nt

30 31 32 33 34 35

fp32

Error (%) distribution over 25 reruns

fp16/32 accuracy

16 About | Kernels | neon | Summary

• No accuracy loss going from fp32 to fp16

29

Cou

nt

30 31 32 33 34 35

fp16fp32

Error (%) distribution over 25 reruns29

Cou

nt

30 31 32 33 34 35

fp 16 stofp16fp32

Error (%) distribution over 25 rerunsError (%) distribution over 25 runs

Page 51: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

17

Speed benchmarks1: fp16 vs others

About | Kernels | neon | Summary

100

200

300

400

500

600

Tim

e pe

r lay

er (m

s)

neon fp16Cuda-

convnet2neon

cudanet Torch7 cuDNN

5 layers forward pass, 5 backward pass05 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980

*

*2nd, 3rd layer don’t fit on a 4GB card

1 Soumith Chintala, github.com/soumith/convnet-benchmarks

Page 52: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

X

Speed benchmarks1: fp16 vs fp32

About | Kernels | neon | Summary

100

200

300

400

500

600

Tim

e pe

r lay

er (m

s)

neon fp16Cuda-

convnet2neon

cudanet Torch7 cuDNN

5 layers forward pass, 5 backward pass05 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980

*

*some layers do not fit on a 4GB card

1 Soumith Chintala, github.com/soumith/convnet-benchmarks

Page 53: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

X

Speed benchmarks1: fp16 vs fp32

About | Kernels | neon | Summary

100

200

300

400

500

600

Tim

e pe

r lay

er (m

s)

neon fp16Cuda-

convnet2neon

cudanet Torch7 cuDNN

5 layers forward pass, 5 backward pass0

100

200

Tim

e pe

r lay

er (m

s)

5 layers forward pass, 5 backward pass0

neon fp16Cuda-

convnet2neon

cudanet Torch7 cuDNN

5 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980

*

*some layers do not fit on a 4GB card

1 Soumith Chintala, github.com/soumith/convnet-benchmarks

Page 54: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

X

Speed benchmarks1: fp16 vs fp32

About | Kernels | neon | Summary

100

200

300

400

500

600

Tim

e pe

r lay

er (m

s)

neon fp16Cuda-

convnet2neon

cudanet Torch7 cuDNN

5 layers forward pass, 5 backward pass0

100

200

Tim

e pe

r lay

er (m

s)

5 layers forward pass, 5 backward pass0

neon fp16Cuda-

convnet2neon

cudanet Torch7 cuDNN

50

100

Tim

e pe

r lay

er (m

s)

5 layers forward pass, 5 backward pass0

neon fp16Cuda-

convnet2neon

cudanet Torch7 cuDNN

5 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980

*

*some layers do not fit on a 4GB card

1 Soumith Chintala, github.com/soumith/convnet-benchmarks

Page 55: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

18

Benchmarks1 show 2x performance

About | Kernels | neon | Summary

Maximum practical peak is 4700 gflops.

1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

More than double speed2 with half memory storage / bandwidth.

Raw numbers (averaged over 10 runs)

0 1 2 3 4 5Speed (TFLOPS)

0

10

20

30

Tim

e / (

s)

Alexnet fp16

Alexnet Cuda-Convnet2

Page 56: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

18

Benchmarks1 show 2x performance

About | Kernels | neon | Summary

Avg(10) fprop: 43.650 msecs 4188.573 gflopsAvg(10) bprop: 94.315 msecs 3877.055 gflopsAvg(10) total: 137.965 msecs 3975.615 gflops

Alexnet

Maximum practical peak is 4700 gflops.

1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

More than double speed2 with half memory storage / bandwidth.

Raw numbers (averaged over 10 runs)

0 1 2 3 4 5Speed (TFLOPS)

0

10

20

30

Tim

e / (

s)

Alexnet fp16

Alexnet Cuda-Convnet2

Page 57: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

18

Benchmarks1 show 2x performance

About | Kernels | neon | Summary

Avg(10) fprop: 172.005 msecs 4169.400 gflopsAvg(10) bprop: 355.809 msecs 4031.144 gflopsAvg(10) total: 527.815 msecs 4076.199 gflops

Overfeat

Maximum practical peak is 4700 gflops.

1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

More than double speed2 with half memory storage / bandwidth.

Raw numbers (averaged over 10 runs)

0 1 2 3 4 5Speed (TFLOPS)

0

10

20

30

Tim

e / (

s)

Alexnet fp16

Alexnet Cuda-Convnet2

Page 58: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

18

Benchmarks1 show 2x performance

About | Kernels | neon | Summary

Avg(10) fprop: 234.050 msecs 4161.347 gflopsAvg(10) bprop: 529.052 msecs 3681.920 gflopsAvg(10) total: 763.102 msecs 3828.965 gflops

VGG (N=64)

Maximum practical peak is 4700 gflops.

1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

More than double speed2 with half memory storage / bandwidth.

Raw numbers (averaged over 10 runs)

0 1 2 3 4 5Speed (TFLOPS)

0

10

20

30

Tim

e / (

s)

Alexnet fp16

Alexnet Cuda-Convnet2

Page 59: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

19

Summary

About | Kernels | neon | Summary

Page 60: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

19

Summary

• neon: User-friendly python library

About | Kernels | neon | Summary

Page 61: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

19

Summary

• neon: User-friendly python library

• maxas: Powerful tool for optimizing deep learning

About | Kernels | neon | Summary

Page 62: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

19

Summary

• neon: User-friendly python library

• maxas: Powerful tool for optimizing deep learning

• Fast performance, full utilization of GPU

About | Kernels | neon | Summary

Page 63: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

19

Summary

• neon: User-friendly python library

• maxas: Powerful tool for optimizing deep learning

• Fast performance, full utilization of GPU

• Limited precision allows for larger models

About | Kernels | neon | Summary

Page 64: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

19

Summary

• neon: User-friendly python library

• maxas: Powerful tool for optimizing deep learning

• Fast performance, full utilization of GPU

• Limited precision allows for larger models

• Toolbox for exploring numerical representations

About | Kernels | neon | Summary

Page 65: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

19

Summary

• neon: User-friendly python library

• maxas: Powerful tool for optimizing deep learning

• Fast performance, full utilization of GPU

• Limited precision allows for larger models

• Toolbox for exploring numerical representations

About | Kernels | neon | Summary

Page 66: OPTIMIZED GPU KERNELS FOR DEEP LEARNING · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

20

GTC 2015

About | Kernels | neon | Summary

• Contact us at [email protected]

• We are hiring!

• Sign up to try neon, our deep learning library.

• We can help solve your problem.

• Cloud engineers • machine learning engineers

• GPU experts • software engineers