Lecture 4: Backpropagation and Neural Networks (part 1 · Lecture 4: Backpropagation and Neural...

Post on 20-Apr-2020

18 views 1 download

Transcript of Lecture 4: Backpropagation and Neural Networks (part 1 · Lecture 4: Backpropagation and Neural...

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 1

Lecture 4: Backpropagation

and Neural Networks (part 1)

Tuesday January 31, 2017

comp150dl

Announcements!- If you are adversely affected by immigration ban, please talk to me about

accommodations

- Send in paper choices by tonight

- Should be able to run Jupyter server on Tufts was and network machines now

- (deep-venv)> pip install --upgrade jupyter

- hw1 deadline in two days — Thurs Feb 2: Don’t forget to read the course notes.

- Redo calculation of dL/dW for hinge loss2

comp150dl

Python/Numpy of the Day

- y_pred = scores.argmax(axis=1)

- inds = np.random.choice(X.shape[0],batch_size)

- randomly select N numbers in a range,

- useful for subsampling

- [:,np.newaxis]

- reshapes matrices of size (N,) to size (N,1)

3

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 4

want

scores function

SVM loss

data loss + regularization

Where we are...

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 5

(image credits to Alec Radford)

Optimization

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 6

Gradient Descent

Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your implementation with numerical gradient

comp150dl

Hinge Loss Gradient wrt Weights W

• We want the Jacobian Matrix of all gradients

• partial derivatives of all output dimensions by all input dimensions

7http://cs231n.github.io/optimization-1/#analytic

margin size, usually 1.0

For all rows of dW where the row corresponds to the GT value for that training instance, i.e. For all rows of dW where

comp150dl

Softmax Loss Gradient wrt Score S

8eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

* note change of subscripts from last slide

Skipping some steps for space, please see original notes.

comp150dl

Softmax Loss Gradient wrt Score S

9eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

Skipping some steps for space, please see original notes.

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 10

Computational Graph

x

W

* hinge loss

R

+ Ls (scores)

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 11

Convolutional Network (AlexNet)

input image weights

loss

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 12

e.g. x = -2, y = 5, z = -4

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 13

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 14

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 15

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 16

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 17

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 18

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 19

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 20

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 21

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 22

e.g. x = -2, y = 5, z = -4

Want:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 23

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 24

f

activations

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 25

f

activations

“local gradient”

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 26

f

activations

“local gradient”

gradients

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 27

f

activations

gradients

“local gradient”

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 28

f

activations

gradients

“local gradient”

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 29

f

activations

gradients

“local gradient”

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 30

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 31

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 32

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 33

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 34

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 35

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 36

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 37

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 38

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 39

Another example:

(-1) * (-0.20) = 0.20

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 40

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 41

Another example:

[local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 42

Another example:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 43

Another example:

[local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 44

sigmoid function

sigmoid gate

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 45

sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 46

Patterns in backward flow

add gate: gradient distributor max gate: gradient router mul gate: gradient… “switcher”?

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 47

Gradients add at branches

+

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 48

Implementation: forward/backward API

Graph (or Net) object. (Rough psuedo code)

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 49

Implementation: forward/backward API

(x,y,z are scalars)

*

x

y

z

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 50

Implementation: forward/backward API

(x,y,z are scalars)

*

x

y

z

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 51

Example: Torch Layers

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 52

Example: Torch Layers

=

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 53

Example: Torch MulConstant

initialization

forward()

backward()

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 54

Example: Caffe Layers

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 55

Caffe Sigmoid Layer

*top_diff (chain rule)

comp150dl 56

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 57

Gradients for vectorized code

f

“local gradient”

This is now the Jacobian matrix (derivative of each element of z w.r.t. each element of x)

(x,y,z are now vectors)

gradients

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 58

Vectorized operations

f(x) = max(0,x) (elementwise)

4096-d input vector

4096-d output vector

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 59

Vectorized operations

f(x) = max(0,x) (elementwise)

4096-d input vector

4096-d output vector

Q: what is the size of the Jacobian matrix?

Jacobian matrix

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 60

max(0,x) (elementwise)

4096-d input vector

4096-d output vector

Q: what is the size of the Jacobian matrix? [4096 x 4096!]

Q2: what does it look like?

Vectorized operations

Jacobian matrix

f(x) = max(0,x) (elementwise)

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 61

max(0,x) (elementwise)

100 4096-d input vectors

100 4096-d output vectors

Vectorized operations

in practice we process an entire minibatch (e.g. 100) of examples at one time:

i.e. Jacobian would technically be a [409,600 x 409,600] matrix :\

f(x) = max(0,x) (elementwise)

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 62

Assignment: Writing SVM/Softmax Stage your forward/backward computation!

E.g. for the SVM:margins

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 63

Summary so far

- neural nets will be very large: no hope of writing down gradient formula by hand for all parameters

- backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates

- implementations maintain a graph structure, where the nodes implement the forward() / backward().

- forward: compute result of an operation and save any intermediates needed for gradient computation in memory

- backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs.

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 64

Neural Network so far:

(Before) Linear score function:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 65

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural Network so far:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 66

(Before) Linear score function:

(Now) 2-layer Neural Network

x hW1 sW2

3072 100 10

Neural Network so far:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl

Neural Network so far:

67

(Before) Linear score function:

(Now) 2-layer Neural Network

x hW1 sW2

3072 100 10

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 68

(Before) Linear score function:

(Now) 2-layer Neural Network or 3-layer Neural Network

Neural Network so far:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 69

Full implementation of training a 2-layer Neural Network needs ~11 lines:

from @iamtrask, http://iamtrask.github.io/2015/07/12/basic-python-network/

Forward pass

Backward pass

backprop of derivative

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 70

Assignment: Writing 2layer Net Stage your forward/backward computation!

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 71

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 72

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 73

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 74

sigmoid activation function

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 75

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 76

Be very careful with your Brain analogies:

Biological Neurons: - Many different types - Dendrites can perform complex non-

linear computations - Synapses are not a single weight but

a complex non-linear dynamical system

- Rate code may not be adequate

[Dendritic Computation. London and Hausser]

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 77

Activation Functions

Sigmoid

tanh tanh(x)

ReLU max(0,x)

Leaky ReLU max(0.1x, x)

Maxout

ELU

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 78

Neural Networks: Architectures

“Fully-connected” layers“2-layer Neural Net”, or “1-hidden-layer Neural Net”

“3-layer Neural Net”, or “2-hidden-layer Neural Net”

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 79

Example Feed-forward computation of a Neural Network

We can efficiently evaluate an entire layer of neurons.

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 80

Example Feed-forward computation of a Neural Network

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 81

Setting the number of layers and their sizes

more neurons = more capacity

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 82

(you can play with this demo over at ConvNetJS: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html)

Do not use size of neural network as a regularizer. Use stronger regularization instead:

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 83

Summary

- we arrange neurons into fully-connected layers - the abstraction of a layer has the nice property that it

allows us to use efficient vectorized code (e.g. matrix multiplies)

- neural networks are not really neural - neural networks: bigger = better (but might have to

regularize more strongly)

* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 84

Next Lecture:

More than you ever wanted to know about Neural Networks and how to train them.