PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2...
Transcript of PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2...
![Page 1: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/1.jpg)
![Page 2: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/2.jpg)
![Page 3: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/3.jpg)
![Page 4: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/4.jpg)
![Page 5: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/5.jpg)
Perceptron:
This is convolution!
![Page 6: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/6.jpg)
v
v
v v
Shared weights
![Page 7: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/7.jpg)
Filter = ‘local’ perceptron.Also called kernel.
![Page 8: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/8.jpg)
Yann LeCun’s MNIST CNN architecture
![Page 9: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/9.jpg)
![Page 10: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/10.jpg)
DEMO
http://scs.ryerson.ca/~aharley/vis/conv/
Thanks to Adam Harley for making this.
More here: http://scs.ryerson.ca/~aharley/vis
![Page 11: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/11.jpg)
![Page 12: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/12.jpg)
![Page 13: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/13.jpg)
![Page 14: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/14.jpg)
![Page 15: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/15.jpg)
![Page 16: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/16.jpg)
Think-Pair-Share
Input size: 96 x 96 x 3
Kernel size: 5 x 5 x 3
Stride: 1
Max pooling layer: 4 x 4
Output feature map size?
a) 5 x 5
b) 22 x 22
c) 23 x 23
d) 24 x 24
e) 25 x 25
Input size: 96 x 96 x 3
Kernel size: 3 x 3 x 3
Stride: 3
Max pooling layer: 8 x 8
Output feature map size?
a) 2 x 2
b) 3 x 3
c) 4 x 4
d) 5 x 5
e) 12 x 12
![Page 17: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/17.jpg)
![Page 18: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/18.jpg)
Our connectomics diagram
Conv 13x3x464 filters
Max pooling2x2 per filter
Conv 23x3x6448 filters
Max pooling2x2 per filter
Auto-generated from network declaration by nolearn (for Lasagne / Theano)
Conv 33x3x4848 filters
Max pooling2x2 per filter
Conv 43x3x4848 filters
Max pooling2x2 per filter
Input75x75x4
![Page 19: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/19.jpg)
Reading architecture diagrams
Layers- Kernel sizes- Strides- # channels- # kernels- Max pooling
![Page 20: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/20.jpg)
AlexNet diagram (simplified)Input size227 x 227 x 3
Conv 111 x 11 x 3Stride 496 filters
227
227
Conv 25 x 5 x 96Stride 1256 filters
3x3 Stride 2
3x3 Stride 2
[Krizhevsky et al. 2012]
Conv 33 x 3 x 256Stride 1384 filters
Conv 43 x 3 x 192Stride 1384 filters
Conv 43 x 3 x 192Stride 1256 filters
![Page 21: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/21.jpg)
Wait, why isn’t it called a correlation neural network?
It could be.
Deep learning libraries implement correlation.
Correlation relates to convolution via a 180deg rotation of the kernel. When we learn kernels, we could easily learn them flipped.
Associative property of convolution ends up not being important to our application, so we just ignore it.
[p.323, Goodfellow]
![Page 22: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/22.jpg)
What does it mean to convolve over greater-than-first-layer hidden units?
![Page 23: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/23.jpg)
Yann LeCun’s MNIST CNN architecture
![Page 24: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/24.jpg)
Multi-layer perceptron (MLP)
…is a ‘fully connected’ neural network with non-linear activation functions.
‘Feed-forward’ neural network
Nielson
![Page 25: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/25.jpg)
Does anyone pass along the weight without an activation function?
No – this is linear chaining.
Output vectorInputvector
![Page 26: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/26.jpg)
Output vectorInputvector
Does anyone pass along the weight without an activation function?
No – this is linear chaining.
![Page 27: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/27.jpg)
Are there other activation functions?
Yes, many.
As long as:
- Activation function s(z) is well-defined as z -> -∞ and z -> ∞
- These limits are different
Then we can make a step! [Think visual proof]It can be shown that it is universal for function
approximation.
![Page 28: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/28.jpg)
Activation functions:Rectified Linear Unit• ReLU
![Page 29: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/29.jpg)
Cyh24 - http://prog3.com/sbdm/blog/cyh_24
![Page 30: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/30.jpg)
Rectified Linear Unit
Ranzato
![Page 31: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/31.jpg)
What is the relationship between SVMs and perceptrons?
SVMs attempt to learn the support vectors which maximize the margin between classes.
![Page 32: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/32.jpg)
What is the relationship between SVMs and perceptrons?
SVMs attempt to learn the support vectors which maximize the margin between classes.
A perceptron does not. Both of these perceptron classifiers are equivalent.
‘Perceptron of optimal stability’ is used in SVM:
Perceptron+ optimal stability+ kernel trick = foundations of SVM
![Page 33: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/33.jpg)
Why is pooling useful again?What kinds of pooling operations might we consider?
![Page 34: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/34.jpg)
By pooling responses at different locations, we gain robustness to the exact spatial location of image features.
Useful for classification, when I don’t care about _where_ I ‘see’ a feature!
Convolutional layer output
Pooling layer output
![Page 35: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/35.jpg)
Pooling is similar to downsampling
…except sometimes we don’t want to blur,as other functions might be better for classification.
…but on feature maps, not the input!
![Page 36: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/36.jpg)
![Page 37: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/37.jpg)
Wikipedia
Max pooling
![Page 38: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/38.jpg)
OK, so what about invariances?
What about translation, scale, rotation?
Convolution is translation equivariant (‘shift-equivariant’) – we could shift the image and the kernel would give us a corresponding (‘equal’) shift in the feature map.
But! If we rotated or scaled the input, the same kernel would give a different response.
Pooling lets us aggregate (avg) or pick from (max) responses, but the kernels themselves must be trained and so learn to activate on scaled or rotated instances of the object.
![Page 39: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/39.jpg)
Fig 9.9, Goodfellow et al. [the book]
If we max pooled over depth (# kernels)…
Three different kernels trained to fire on different rotations of ‘5’.
![Page 40: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/40.jpg)
![Page 41: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/41.jpg)
![Page 42: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/42.jpg)
![Page 43: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/43.jpg)
I’ve heard about many more terms of jargon!
Skip connections
Residual connections
Batch normalization
…we’ll get to these in a little while.
![Page 44: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/44.jpg)
![Page 45: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/45.jpg)
![Page 46: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/46.jpg)
![Page 47: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/47.jpg)
Training Neural NetworksLearning the weight matrices W
![Page 48: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/48.jpg)
Gradient descent
x
f(x)
![Page 49: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/49.jpg)
General approach
Pick random starting point.
𝑎1 x
f(x)
![Page 50: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/50.jpg)
General approach
Compute gradient at point (analytically or by finite differences)
𝛻 𝑓(𝑎1)
x
f(x)
𝑎1
![Page 51: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/51.jpg)
General approach
Move along parameter space in direction of negative gradient
𝑎2 = 𝑎1 − 𝛾𝛻 𝑓 𝑎1
x
f(x)
𝑎1 𝑎2
𝛾 = amount to move= learning rate
![Page 52: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/52.jpg)
General approach
Move along parameter space in direction of negative gradient.
𝑎3 = 𝑎2 − 𝛾𝛻 𝑓 𝑎2
x
f(x)
𝑎1 𝑎2
𝛾 = amount to move= learning rate
𝑎3
![Page 53: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/53.jpg)
General approach
Stop when we don’t move any more.
𝑎𝑠𝑡𝑜𝑝:
𝑎𝑛−1 − 𝛾𝛻 𝑓 𝑎𝑛−1 = 0
x
f(x)
𝑎1 𝑎2𝑎3 𝑎𝑠𝑡𝑜𝑝
![Page 54: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/54.jpg)
Gradient descent
Optimizer for functions.
Guaranteed to find optimum for convex functions.• Non-convex = find local optimum.
• Most vision problems aren’t convex.
Works for multi-variate functions.• Need to compute matrix of partial derivatives (“Jacobian”)
x
f(x)
![Page 55: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/55.jpg)
Why would I use this over Least Squares?
If my function is convex, why can’t I just use linear least squares?
Analytic solution = normal equations
You can, yes.
𝑥 = 𝐴𝑇𝐴 −1𝐴𝑇𝑏
![Page 56: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/56.jpg)
Why would I use this over Least Squares?
But now imagine that I have 1,000,000 data points.
Matrices are _huge_.
Even for convex functions, gradient descent allows me to iteratively solve the solution without requiring very large matrices.
We’ll see how.
![Page 57: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/57.jpg)
Train NN with Gradient Descent
• 𝑥𝑖 , 𝑦𝑖 = n training examples
• 𝑓 𝒙 = feed forward neural network
• L(x, y; θ) = some loss function
Loss function measures how ‘good’ our network is at classifying the training examples wrt. the parameters of the model (the perceptron weights).
![Page 58: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/58.jpg)
Train NN with Gradient Descent
Model parameters(perceptron weights)
𝑎1 𝑎2𝑎3
Loss function(Evaluate NNon training data)
𝑎𝑠𝑡𝑜𝑝
![Page 59: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/59.jpg)
utput
![Page 60: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/60.jpg)
What is an appropriate loss?
What if we define an output threshold on detection?
• Classification: compare training class to output class
• Zero-one loss 𝐿 (per class)
Is it good?• Nope – it’s a step function.
• I need to compute the gradient of the loss.
• This loss is not differentiable, and ‘flips’ easily.
𝑦 = 𝑡𝑟𝑢𝑒 𝑙𝑎𝑏𝑒𝑙ො𝑦 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑙𝑎𝑏𝑒𝑙
![Page 61: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/61.jpg)
Classification as probability
Special function on last layer - ‘Softmax’ σ(): “Squashes" a C-dimensional vector O of arbitrary real values to a C-dimensional vector σ(O) of real values in the range (0, 1) that add up to 1.
Turns the output into a probability distribution on classes.
![Page 62: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/62.jpg)
Classification as probability
Softmax example:“Squashes" a C-dimensional vector O of arbitrary real values to a C-dimensional vector σ(O) of real values in the range (0, 1) that add up to 1.
Turns the output into a probability distribution on classes.
O = [2.0, 0.7, 0.2, -0.3, -0.6, -2.5]Output from perceptron layer‘distance from class boundary’
σ(O) = [0.616, 0.168, 0.102, 0.061, 0.046, 0.007]
![Page 63: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/63.jpg)
utput
Softmax
Softmax
![Page 64: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/64.jpg)
Cross-entropy loss function
Negative log-likelihood
• Measures difference betweenpredicted and training probability distributions(see Project 4 for more details)
• Is it a good loss?• Differentiable• Cost decreases as
probability increases
𝐿 𝒙, 𝑦; 𝜽 = −
𝑗
𝑦𝑗 log 𝑝(𝑐𝑗|𝒙)
L
p(cj|x)
![Page 65: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/65.jpg)
utput
Softmax
![Page 66: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/66.jpg)
Learning consists of minimizing the loss wrt. parameters over the whole training set.
Training
![Page 67: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/67.jpg)
Learning consists of minimizing the loss wrt. parameters over the whole training set.
Training
![Page 68: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/68.jpg)
Softmax
![Page 69: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/69.jpg)
Softmax
![Page 70: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/70.jpg)
Derivative of loss wrt. softmax
![Page 71: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/71.jpg)
![Page 72: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/72.jpg)
<- Chain rule from calculus
![Page 73: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/73.jpg)
![Page 74: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/74.jpg)
![Page 75: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/75.jpg)
![Page 76: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/76.jpg)
![Page 77: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/77.jpg)
![Page 78: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/78.jpg)
But the ReLU is not differentiable at 0!
Right. Fudge!
- ‘0’ is the best place for this to occur, because we don’t care about the result (it is no activation).
- ‘Dead’ perceptrons
- ReLU has unbounded positive response:- Potential faster convergence / overstep
![Page 79: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/79.jpg)
![Page 80: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/80.jpg)
Optimization demo
• http://www.emergentmind.com/neural-network
• Thank you Matt Mazur
![Page 81: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/81.jpg)
![Page 82: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/82.jpg)
![Page 83: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/83.jpg)
Wow
so misclassified
false positives
no good filtr
what class
cool kernel
![Page 84: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/84.jpg)
Stochastic Gradient Descent
• Dataset can be too large to strictly apply gradient descent wrt. all data points.
• Instead, randomly sample a data point, perform gradient descent per point, and iterate.• True gradient is approximated only• Picking a subset of points: “mini-batch”
Randomly initialize starting 𝑊 and pick learning rate 𝛾
While not at minimum:• Shuffle training set• For each data point i=1…n (maybe as mini-batch)
• Gradient descent“Epoch“
![Page 85: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/85.jpg)
Stochastic Gradient Descent
Loss will not always decrease (locally) as training data point is random.
Still converges over time.
Wikipedia
![Page 86: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/86.jpg)
Gradient descent oscillations
Wikipedia
![Page 87: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/87.jpg)
Gradient descent oscillations
Slow to converge to the (local) optimum
Wikipedia
![Page 88: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/88.jpg)
Momentum
• Adjust the gradient by a weighted sum of the previous amount plus the current amount.
• Without momentum: 𝜽𝑡+1 = 𝜽𝑡 − 𝛾𝜕𝐿
𝜕𝜽
• With momentum (new 𝛼 parameter):
𝜽𝑡+1 = 𝜽𝑡 − 𝛾 𝛼𝜕𝐿
𝜕𝜽 𝑡−1+
𝜕𝐿
𝜕𝜽 𝑡
![Page 89: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/89.jpg)
But James…
…I thought we were going to treat machine learning like a black box? I like black boxes.
Deep learning is: - a black box
ClassifierTraining data
![Page 90: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/90.jpg)
But James…
…I thought we were going to treat machine learning like a black box? I like black boxes.
Deep learning is: - a black box - also a black art.
http://www.isrtv.com/
![Page 91: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/91.jpg)
But James…
…I thought we were going to treat machine learning like a black box? I like black boxes.
Many approaches and hyperparameters:
Activation functions, learning rate, mini-batch size, momentum…
Often these need tweaking, and you need to know what they do to change them intelligently.
![Page 92: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/92.jpg)
Nailing hyperparameters + trade-offs
![Page 93: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/93.jpg)
Lowering the learning rate = smaller steps in SGD
-Less ‘ping pong’
-Takes longer to get to the optimum
Wikipedia
![Page 94: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/94.jpg)
Flat regions in energy landscape
![Page 95: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/95.jpg)
![Page 96: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/96.jpg)
![Page 97: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/97.jpg)
![Page 98: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/98.jpg)
Problem of fitting
• Too many parameters = overfitting
• Not enough parameters = underfitting
• More data = less chance to overfit
• How do we know what is required?
![Page 99: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/99.jpg)
Regularization
• Attempt to guide solution to not overfit
• But still give freedom with many parameters
![Page 100: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/100.jpg)
Data fitting problem
[Nielson]
![Page 101: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/101.jpg)
Which is better?Which is better a priori?
1st order polynomial9th order polynomial
[Nielson]
![Page 102: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/102.jpg)
Regularization
• Attempt to guide solution to not overfit
• But still give freedom with many parameters
• Idea: Penalize the use of parameters to prefer small weights.
![Page 103: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/103.jpg)
Regularization:
• Idea: add a cost to having high weights
• λ = regularization parameter
[Nielson]
𝜆
![Page 104: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/104.jpg)
Both can describe the data…
• …but one is simpler.
• Occam’s razor:“Among competing hypotheses, the one with the fewest assumptions should be selected”
For us:Large weights cause large changes in behaviour in response to small changes in the input.Simpler models (or smaller changes) are more robust to noise.
![Page 105: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/105.jpg)
Regularization
• Idea: add a cost to having high weights
• λ = regularization parameter
Normal cross-entropy loss (binary classes)
Regularization term
[Nielson]
𝜆
𝜆
![Page 106: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/106.jpg)
Regularization: Dropout
Our networks typically start with random weights.
Every time we train = slightly different outcome.
• Why random weights?
• If weights are all equal,response across filterswill be equivalent.• Network doesn’t train.
[Nielson]
![Page 107: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/107.jpg)
Regularization
Our networks typically start with random weights.
Every time we train = slightly different outcome.
• Why not train 5 different networks with random starts and vote on their outcome?• Works fine!
• Helps generalization because error due to overfitting is averaged; reduces variance.
![Page 108: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/108.jpg)
Regularization: Dropout
[Nielson]
![Page 109: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/109.jpg)
Regularization: Dropout
At each mini-batch:- Randomly select a subset of neurons.- Ignore them.
On test: half weights outgoing to compensate for training on half neurons.
Effect:- Neurons become less dependent on
output of connected neurons.- Forces network to learn more robust
features that are useful to more subsets of neurons.
- Like averaging over many different trained networks with different random initializations.
- Except cheaper to train.
[Nielson]
![Page 110: PowerPoint Presentation€¦ · Our connectomics diagram Conv 1 3x3x4 64 filters Max pooling 2x2 per filter Conv 2 3x3x64 48 filters Max pooling 2x2 per filter Auto-generated from](https://reader034.fdocuments.in/reader034/viewer/2022050523/5fa644ee1b61e33d6970c3c2/html5/thumbnails/110.jpg)
Many forms of ‘regularization’
• Adding more data is a kind of regularization
• Pooling is a kind of regularization
• Data augmentation is a kind of regularization