cifar summer school 2016

132
CIFAR SUMMER SCHOOL 2016 MACHINE LEARNING WITH TORCH + AUTOGRAD

Transcript of cifar summer school 2016

Page 1: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

M A C H I N E L E A R N I N G W I T H

T O R C H + A U T O G R A D

Page 2: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

@ AW I LT S C H

A L E X W I LT S C H KO R E S E A R C H E N G I N E E R T W I T T E R

Page 3: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

M AT E R I A L D E V E L O P E D W I T H

S O U M I T H C H I N TA L A H U G O L A R O C H E L L E R YA N A DA M S L U K E A L O N S O C L E M E N T FA R A B E T

Page 4: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

F I R S T H A L F :

T O R C H B A S I C S & O V E R V I E W O F T R A I N I N G N E U R A L N E T S S E C O N D H A L F :

A U T O M AT I C D I F F E R E N T I AT I O N A N D T O R C H - A U T O G R A D

Page 5: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

An array programming library for Lua, looks a lot like NumPy and Matlab

Page 6: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT IS ? • Interactive scientific computing framework in Lua

Page 7: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

• 150+ Tensor functions • Linear algebra • Convolutions • Tensor manipulation

• Narrow, index, mask, etc. • Logical operators

• Fully documented: https://github.com/torch/torch7/tree/master/doc

WHAT IS ?

Page 8: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

• Similar to Matlab / Python+Numpy

WHAT IS ?

Page 9: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

Lots of functions that can operate on tensors, all the basics, slicing, BLAS, LAPACK, cephes, rand

WHAT IS ?

Page 10: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT IS ? Lots of functions that can operate on tensors, all the basics, slicing, BLAS, LAPACK, cephes, rand

Page 11: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

http://deepmind.github.io/torch-cephes/

Special functions

WHAT IS ? Lots of functions that can operate on tensors, all the basics, slicing, BLAS, LAPACK, cephes, rand

Page 12: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT IS ? Lots of functions that can operate on tensors, all the basics, slicing, BLAS, LAPACK, cephes, rand

Page 13: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

• Inline help

Good docs online

http://torch.ch/docs/

WHAT IS ?

Page 14: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

• Little language overhead compared to Python / Matlab

• JIT compilation via LuaJIT

• Fearlessly write for-loopsCode snippet from a core package

• Plain Lua is ~10kLOC of C, small language

WHAT IS ?

Page 15: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

LUA IS DESIGNED TO INTEROPERATE WITH CFFI allows easy integration with C

• The "FFI" allows easy integration with C code

• Been copied by many languages (e.g. cffi in Python)

• No Cython/SWIG required to integrate C code

• Lua originally designed to be embedded!

• World of Warcraft

• Adobe Lightroom

• Redis

• nginx

• Lua originally chosen for embedded machine learning

Page 16: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

• Easy integration into and from C

• Example: using CuDNN functions

WHAT IS ?

Page 17: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

• Strong GPU support

WHAT IS ?

Page 18: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

COMMUNITY

Page 19: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

COMMUNITY

Code for cutting edge models shows up for Torch very quickly

Page 20: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

COMMUNITY

Page 21: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

COMMUNITY

Page 22: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

COMMUNITY

Page 23: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

COMMUNITY

Page 24: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TORCH - WHERE DOES IT FIT?How big is its ecosystem?

Smaller than Python for general data science

Strong for deep learning

Switching from Python to Lua can be smooth

Page 25: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TORCH - WHERE DOES IT FIT?Is it for research or production? It can be for bothBut mostly used for research.

There is no silver bullet

Slide credit: Yangqing Jia

Industry: Stability Scale & speed Data Integration Relatively Fixed

Research:Flexible

Fast Iteration Debuggable

Relatively bare bone

Caffe

Torch

Theano

TensorFlowD4J etc.

Neon

Page 26: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

CORE PHILOSOPHY• Interactive computing

• No compilation time

• Imperative programming

• Write code like you always did, not computation graphs in a "mini-language" or DSL

• Minimal abstraction • Thinking linearly

• Maximal Flexibility • No constraints on interfaces or classes

Page 27: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

• Tensor = n-dimensional array

• Row-major in memory

Page 28: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

• Tensor = n-dimensional array

• Row-major in memory

size: 4 x 6 stride: 6 x 1

Page 29: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

• Tensor = n-dimensional array

• 1-indexed

Page 30: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

• Tensor = n-dimensional array

• Tensor: size, stride, storage, storageOffset

Size: 6 Stride: 1 Offset:13

Page 31: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

• Tensor = n-dimensional array

• Tensor: size, stride, storage, storageOffset

Size: 4 Stride: 6 Offset: 3

Page 32: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

Page 33: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

Page 34: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

Underlying storage is shared

Page 35: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TENSORS AND STORAGES

• GPU support for all operations:

• require ‘cutorch’

• torch.CudaTensor = torch.FloatTensor on GPU

• Fully multi-GPU compatible

Page 36: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TRAINING CYCLEMoving parts

Page 37: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TRAINING CYCLEthreads

nn

nn

optim

Page 38: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE NN PACKAGE

threads

nn

nn

optim

Page 39: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE NN PACKAGE• nn: neural networks made easy • building blocks of differentiable modules

Page 40: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE NN PACKAGE

Compose networks like Lego blocks

Page 41: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE NN PACKAGE

Page 42: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE NN PACKAGECUDA Backend via the cunn package require 'cunn'

Page 43: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE NNGRAPH PACKAGE

Graph composition using chaining

Page 44: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

ADVANCED NEURAL NETWORKS

• nngraph • easy construction of complicated neural networks

Page 45: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TORCH-AUTOGRAD BY • Write imperative programs • Backprop defined for every operation in the language

Page 46: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE OPTIM PACKAGE

threads

nn

nn

optim

Page 47: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE OPTIM PACKAGE

Page 48: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE OPTIM PACKAGEA purely functional view of the world

Page 49: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

THE OPTIM PACKAGECollecting the parameters of your neural net

• Substitute each module weights and biases by one large tensor, making weights and biases point to parts of this tensor

Page 50: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TORCH AUTOGRADIndustrial-strength, extremely flexible implementation of automatic differentiation,for all your crazy ideas

Page 51: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

TORCH AUTOGRADIndustrial-strength, extremely flexible implementation of automatic differentiation,for all your crazy ideasInspired by the original Python autograd from Ryan Adams' HIPS group: github.com/hips/autograd

Props to:

— Dougal Maclaurin

— David Duvenaud

— Matt Johnson

Page 52: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

We should take these for granted, to stay sane!

BLAS LINPACK LAPACK

Est: 1957 Est: 1979 (now on GitHub!) Est: 1984

Arrays Linear AlgebraCommonSubroutines

WE WORK ON TOP OF STABLE ABSTRACTIONS

Page 53: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

All gradient-based optimization (that includes neural nets) relies on Automatic Differentiation (AD)

"Mechanically calculates derivatives as functions expressed as computer programs, at machine precision, and with complexity guarantees." (Barak Pearlmutter).

Not finite differences — generally bad numeric stability. We still use it as "gradcheck" though. Not symbolic differentiation — no complexity guarantee. Symbolic derivatives of heavily nested functions (e.g. all neural nets) can quickly blow up in expression size.

These assume all the other lower-level abstractions in scientific computingMACHINE LEARNING HAS OTHER ABSTRACTIONS

Page 54: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

All gradient-based optimization (that includes neural nets) relies on Automatic Differentiation (AD)

• Rediscovered several times (Widrow and Lehr, 1990) • Described and implemented for FORTRAN by Speelpenning

in 1980 (although forward-mode variant that is less useful for ML described in 1964 by Wengert).

• Popularized in connectionist ML as "backpropagation" (Rumelhart et al, 1986)

• In use in nuclear science, computational fluid dynamics and atmospheric sciences (in fact, their AD tools are more sophisticated than ours!)

AUTOMATIC DIFFERENTIATION IS THE ABSTRACTION FOR GRADIENT-BASED ML

Page 55: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

• Rediscovered several times (Widrow and Lehr, 1990) • Described and implemented for FORTRAN by Speelpenning

in 1980 (although forward-mode variant that is less useful for ML described in 1964 by Wengert).

• Popularized in connectionist ML as "backpropagation" (Rumelhart et al, 1986)

• In use in nuclear science, computational fluid dynamics and atmospheric sciences (in fact, their AD tools are more sophisticated than ours!)

All gradient-based optimization (that includes neural nets) relies on Reverse-Mode Automatic Differentiation (AD)

AUTOMATIC DIFFERENTIATION IS THE ABSTRACTION FOR GRADIENT-BASED ML

Page 56: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

• Two main modes: • Forward mode • Reverse mode (backprop)

Different applications of the chain rule

AUTOMATIC DIFFERENTIATION IS THE ABSTRACTION FOR GRADIENT-BASED ML

Page 57: cifar summer school 2016

FORWARD MODE (SYMBOLIC VIEW)

Page 58: cifar summer school 2016

FORWARD MODE (SYMBOLIC VIEW)

Page 59: cifar summer school 2016

FORWARD MODE (SYMBOLIC VIEW)

Page 60: cifar summer school 2016

FORWARD MODE (SYMBOLIC VIEW)

Page 61: cifar summer school 2016

Left to right:

FORWARD MODE (SYMBOLIC VIEW)

Page 62: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

Page 63: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

Page 64: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

Page 65: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

Page 66: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

Page 67: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

Page 68: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

Page 69: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

Page 70: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

Page 71: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

Page 72: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

Page 73: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

dada = 1

Page 74: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

dada = 1

b = 2

Page 75: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

dada = 1

b = 2

dbda = 0

Page 76: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

dada = 1

b = 2

dbda = 0

c = 1

Page 77: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

dada = 1

b = 2

dbda = 0

c = 1

dcda = 0

Page 78: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

dada = 1

b = 2

dbda = 0

c = 1

dcda = 0

d = a * math.sin(b) = 2.728

Page 79: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

dada = 1

b = 2

dbda = 0

c = 1

dcda = 0

d = a * math.sin(b) = 2.728

ddda = math.sin(b) = 0.909

Page 80: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

FORWARD MODE (PROGRAM VIEW)Left-to-right evaluation of partial derivatives (not so great for optimization)

From Baydin 2016

We can write the evaluation of a program in a sequence of operations,called a "trace", or a "Wengert list"

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

dada = 1

b = 2

dbda = 0

c = 1

dcda = 0

d = a * math.sin(b) = 2.728

ddda = math.sin(b) = 0.909

return 0.909

Page 81: cifar summer school 2016

REVERSE MODE (SYMBOLIC VIEW)

Page 82: cifar summer school 2016

REVERSE MODE (SYMBOLIC VIEW)

Page 83: cifar summer school 2016

REVERSE MODE (SYMBOLIC VIEW)

Page 84: cifar summer school 2016

REVERSE MODE (SYMBOLIC VIEW)

Page 85: cifar summer school 2016

Right to left:

REVERSE MODE (SYMBOLIC VIEW)

Page 86: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

From Baydin 2016

Page 87: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

From Baydin 2016

Page 88: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

From Baydin 2016

Page 89: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

From Baydin 2016

Page 90: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

From Baydin 2016

Page 91: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

From Baydin 2016

Page 92: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

From Baydin 2016

Page 93: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

From Baydin 2016

Page 94: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

From Baydin 2016

Page 95: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

b = 2

From Baydin 2016

Page 96: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

b = 2

c = 1

From Baydin 2016

Page 97: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

From Baydin 2016

Page 98: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

dddd = 1

From Baydin 2016

Page 99: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

dddd = 1

ddda = dd * math.sin(b) = 0.909

From Baydin 2016

Page 100: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

REVERSE MODE (PROGRAM VIEW)Right-to-left evaluation of partial derivatives (the right thing to do for optimization)

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

return 2.728

a = 3

b = 2

c = 1

d = a * math.sin(b) = 2.728

dddd = 1

ddda = dd * math.sin(b) = 0.909

return 0.909, 2.728

From Baydin 2016

Page 101: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

This is the API ->

A trainableneural network in torch-autograd

Any numericfunction cango here

These two fn'sare split onlyfor clarity

This is a howthe parametersare updated

Page 102: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

This is the API ->

A trainableneural network in torch-autograd

Any numericfunction cango here

These two fn'sare split onlyfor clarity

This is a howthe parametersare updated

Page 103: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT'S ACTUALLY HAPPENING?As torch code is run, we build up a compute graph

mult

add

sub

sq

sum

W input

b

target

Page 104: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT'S ACTUALLY HAPPENING?As torch code is run, we build up a compute graph

mult

add

sub

sq

sum

W input

b

target

Page 105: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT'S ACTUALLY HAPPENING?As torch code is run, we build up a compute graph

mult

add

sub

sq

sum

W input

b

target

Page 106: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT'S ACTUALLY HAPPENING?As torch code is run, we build up a compute graph

mult

add

sub

sq

sum

W input

b

target

Page 107: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT'S ACTUALLY HAPPENING?As torch code is run, we build up a compute graph

mult

add

sub

sq

sum

W input

b

target

Page 108: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT'S ACTUALLY HAPPENING?As torch code is run, we build up a compute graph

mult

add

sub

sq

sum

W input

b

target

Page 109: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WE TRACK COMPUTATION VIA OPERATOR OVERLOADINGLinked list of computation forms a "tape" of computation

Page 110: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

CALCULATING THE GRADIENTWhen it comes time to evaluate partial derivatives, we justhave to look up the partial derivatives from a table in reverse order on the tape

mult mult

sum sum

multdd

sum∂

mult∂∂

Page 111: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

WHAT'S ACTUALLY HAPPENING?When it comes time to evaluate partial derivatives, we justhave to look up the partial derivatives from a table

mult

add

sub

sq

sum

mult

add

sub

sq

sum

Whole-ModelLayer-Based Full Autodiff

mult

add

sub

sq

sum

We can then calculate the derivative of the loss w.r.t. inputs via the chain rule!

Page 112: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

AUTOGRAD EXAMPLESAutograd gives you derivatives of numeric code, without a special mini-language

Page 113: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

AUTOGRAD EXAMPLESControl flow, like if-statements, are handled seamlessly

Page 114: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

AUTOGRAD EXAMPLESScalars are good for demonstration, but autograd is most often used with tensor types

Page 115: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

AUTOGRAD EXAMPLESAutograd shines if you have dynamic compute graphs

Page 116: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

AUTOGRAD EXAMPLESRecursion is no problem. Write numeric code as you ordinarily would, autograd handles the gradients

Page 117: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

AUTOGRAD EXAMPLESNeed new or tweaked partial derivatives? Not a problem.

Page 118: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

AUTOGRAD EXAMPLESNeed new or tweaked partial derivatives? Not a problem.

Page 119: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

AUTOGRAD EXAMPLESNeed new or tweaked partial derivatives? Not a problem.

Page 120: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

mult

add

sub

sq

sum

mult

add

sub

sq

sum

Whole-ModelLayer-Based Full Autodiff

mult

add

sub

sq

sum

scikit-learn Torch NN cuda-convnet

Lasagne

Autograd Theano

TensorFlow

SO WHAT DIFFERENTIATES N.NET LIBRARIES?The granularity at which they implement autodiff ...

Page 121: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

SO WHAT DIFFERENTIATES N.NET LIBRARIES?... which is set by the partial derivatives they define

mult

add

sub

sq

sum

mult

add

sub

sq

sum

Whole-ModelLayer-Based Full Autodiff

mult

add

sub

sq

sum

scikit-learn Torch NN cuda-convnet

Lasagne

Autograd Theano

TensorFlow

Page 122: cifar summer school 2016

Why can’t we mix these styles?

We want no limits on the models we can write

Page 123: cifar summer school 2016

NEURAL NET THREE WAYSThe most granular — using individual Torch functions

Page 124: cifar summer school 2016

NEURAL NET THREE WAYSComposing pre-existing NN layers. If we need layers that have been highly optimized, this is good

Page 125: cifar summer school 2016

NEURAL NET THREE WAYSWe can also compose entire networks together (e.g. image captioning, GANs)

Page 126: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

IMPACT AT TWITTERPrototyping without fear

• We try crazier, potentially high-payoff ideas more often, because autograd makes it essentially free to do so (can write "regular" numeric code, and automagically pass gradients through it)

• We use weird losses in production: large classification model uses a loss computed over a tree of class taxonomies

• Models trained with autograd running on large amounts of media at Twitter

• Often "fast enough”, no penalty at test time

• "Optimized mode" is nearly a compiler, but still a work in progress

Page 127: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

OTHER AUTODIFF IDEASMaking their way from atmospheric science (and others) to machine learning

• Checkpointing — don't save all of the intermediate values. Recompute them when you need them (memory savings, potentially speedup if compute is faster than load/store, possibly good with pointwise functions like ReLU). MXNet I think first to implement this generally for neural nets.

• Mixing forward and reverse mode — called "cross-country elimination". No need to evaluate partial derivatives in one direction! For diamond or hour-glass shaped compute graphs, this will be more efficient than one method alone.

• Stencils — image processing (convolutions) and element-wise ufuncs can be phrased as stencil operations. More efficient, general-purpose implementations of differentiable stencils needed (computer graphics do this, Guenter 2007, extending with DeVito et al., 2016).

• Source-to-source — All neural net autodiff packages use either ahead-of-time compute graph construction, or operator-overloading. The original method for autodiff (in FORTRAN, in the 80s) was source transformation. I believe still gold-standard for performance. Challenge (besides wrestling with host language) is control flow.

• Higher-order gradients — hessian = grad(grad(f)). Not many efficient implementations. Fully closed versions in e.g. autograd, DiffSharp, Hype.

Page 128: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

YOU SHOULD BE USING ITIt's easy to try

Page 129: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

YOU SHOULD BE USING ITIt's easy to try

Anaconda is the de-facto distribution for scientific Python.

Works with Lua & Luarocks now.

https://github.com/alexbw/conda-lua-recipes

Page 130: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

YOU SHOULD BE USING ITIt's easy to try

Anaconda is the de-facto distribution for scientific Python.

Works with Lua & Luarocks now.

https://github.com/alexbw/conda-lua-recipes

Page 131: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

PRACTICAL SESSION

We'll work through (all in an iTorch notebook)

— Torch basics

— Running code on the GPU

— Training a CNN on CIFAR-10

— Using autograd to train neural networks

We have an autograd Slack team: http://autograd.herokuapp.com/

Join #summerschool channel

Page 132: cifar summer school 2016

C I FA R S U M M E R S C H O O L 2 0 1 6

QUESTIONS?

Happy to help at the practical session

Find me at:

@awiltsch

[email protected]

github.com/alexbw