Introduction to Deep Learning - cs.toronto.edu

40
Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2014 A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 1 / 35

Transcript of Introduction to Deep Learning - cs.toronto.edu

Page 1: Introduction to Deep Learning - cs.toronto.edu

Introduction to Deep Learning

A. G. Schwing & S. Fidler

University of Toronto, 2014

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 1 / 35

Page 2: Introduction to Deep Learning - cs.toronto.edu

Outline

1 Universality of Neural Networks

2 Learning Neural Networks

3 Deep Learning

4 Applications

5 References

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 2 / 35

Page 3: Introduction to Deep Learning - cs.toronto.edu

What are neural networks?

Let’s ask

• Biological

• Computational

Input #1

Input #2

Input #3

Input #4

Output

Hiddenlayer

Inputlayer

Outputlayer

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 3 / 35

Page 4: Introduction to Deep Learning - cs.toronto.edu

What are neural networks?

...Neural networks (NNs) are computational models inspired bybiological neural networks [...] and are used to estimate or

approximate functions... [Wikipedia]

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 4 / 35

Page 5: Introduction to Deep Learning - cs.toronto.edu

What are neural networks? The people behindOrigins:

Traced back to threshold logic [W. McCulloch and W. Pitts, 1943]Perceptron [F. Rosenblatt, 1958]

More recently:

G. Hinton (UofT)

Y. LeCun (NYU)

A. Ng (Stanford)

Y. Bengio (University of Montreal)

J. Schmidhuber (IDSIA, Switzerland)

R. Salakhutdinov (UofT)

H. Lee (University of Michigan)

R. Fergus (NYU)

... (and many more having made significant contributions; apologies fornot being able to mention everyone and all the students behind)

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 5 / 35

Page 6: Introduction to Deep Learning - cs.toronto.edu

What are neural networks? Use cases

ClassificationPlaying video gamesCaptchaNeural Turing Machine (e.g., learn how to sort) Alex Graves

http://www.technologyreview.com/view/532156/googles-secretive-deepmind-startup-unveils-a-neural-turing-machine/

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 6 / 35

Page 7: Introduction to Deep Learning - cs.toronto.edu

What are neural networks?Example:

input xparameters w1, w2, b

x ∈ Rh1

b ∈ R

fw1 w2

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 7 / 35

Page 8: Introduction to Deep Learning - cs.toronto.edu

How to compute the function?

Forward propagation/pass, inference, prediction:Given input x and parameters w , bCompute latent variables/intermediate results in a feed-forwardmannerUntil we obtain output function f

x ∈ Rh1

b ∈ R

fw1 w2

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 8 / 35

Page 9: Introduction to Deep Learning - cs.toronto.edu

How to compute the function?

Forward propagation/pass, inference, prediction:Given input x and parameters w , bCompute latent variables/intermediate results in a feed-forwardmannerUntil we obtain output function f

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 8 / 35

Page 10: Introduction to Deep Learning - cs.toronto.edu

How to compute the function?Example: input x , parameters w1, w2, b

x ∈ Rh1

b ∈ R

fw1 w2

h1 = σ(w1 · x + b)f = w2 · h1

Sigmoid function:σ(z) = 1/(1 + exp(−z))

x = ln 2, b = ln 3, w1 = 2, w2 = 2h1 =?f =?

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 9 / 35

Page 11: Introduction to Deep Learning - cs.toronto.edu

How to compute the function?Given parameters, what is f for x = 0, x = 1, x = 2, ...

f = w2σ(w1 · x + b)

−5 0 50

0.5

1

1.5

2

x

f

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 10 / 35

Page 12: Introduction to Deep Learning - cs.toronto.edu

Let’s mess with parameters:

x ∈ Rh1

b ∈ R

fw1 w2

h1 = σ(w1 · x + b)f = w2 · h1

σ(z) = 1/(1 + exp(−z))

w1 = 1.0 b = 0

−5 0 50

0.2

0.4

0.6

0.8

1

x

f

b = −2b = 0b = 2

−5 0 50

0.2

0.4

0.6

0.8

1

x

f

w1 = 0

w1 = 0.5

w1 = 1.0

w1 = 100

Keep in mind the step function.

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 11 / 35

Page 13: Introduction to Deep Learning - cs.toronto.edu

How to use Neural Networks for binary classification?Feature/Measurement: xOutput: How likely is the input to be a cat?

−5 0 50

0.2

0.4

0.6

0.8

1

x

y

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 12 / 35

Page 14: Introduction to Deep Learning - cs.toronto.edu

How to use Neural Networks for binary classification?Feature/Measurement: xOutput: How likely is the input to be a cat?

−5 0 50

0.2

0.4

0.6

0.8

1

x

f

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 12 / 35

Page 15: Introduction to Deep Learning - cs.toronto.edu

How to use Neural Networks for binary classification?Feature/Measurement: xOutput: How likely is the input to be a cat?

−5 0 50

0.2

0.4

0.6

0.8

1

x

f

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 12 / 35

Page 16: Introduction to Deep Learning - cs.toronto.edu

How to use Neural Networks for binary classification?Feature/Measurement: xOutput: How likely is the input to be a cat?

−5 0 50

0.2

0.4

0.6

0.8

1

x

f

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 12 / 35

Page 17: Introduction to Deep Learning - cs.toronto.edu

How to use Neural Networks for binary classification?Shifted feature/measurement: xOutput: How likely is the input to be a cat?

Previous features Shifted features

−5 0 50

0.2

0.4

0.6

0.8

1

x

f

−5 0 50

0.2

0.4

0.6

0.8

1

x

f

Learning/Training means finding the right parameters.

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 13 / 35

Page 18: Introduction to Deep Learning - cs.toronto.edu

So far we are able to scale and translate sigmoids.

How well can we approximate an arbitrary function?With the simple model we are obviously not going very far.

Features are good Features are noisySimple classifier More complex classifier

−5 0 50

0.2

0.4

0.6

0.8

1

x

f

−5 0 50

0.2

0.4

0.6

0.8

1

x

fHow can we generalize?

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 14 / 35

Page 19: Introduction to Deep Learning - cs.toronto.edu

Let’s use more hidden variables:

x ∈ Rh1

b1

h2

b2

f

w1 w2

w3 w4

h1 = σ(w1 · x + b1)h2 = σ(w3 · x + b2)

f = w2 · h1 + w4 · h2

Combining two step functions gives a bump.

−5 0 51

1.2

1.4

1.6

1.8

2

x

f

w1 = −100, b1 = 40, w3 = 100, b2 = 60, w2 = 1, w4 = 1

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 15 / 35

Page 20: Introduction to Deep Learning - cs.toronto.edu

So let’s simplify:

x ∈ Rh1

b1

h2

b2

f

w1 w2

w3 w4

fBump(x1, x2, h)

We simplify a pair of hidden nodes to a “bump” function:Starts at x1

Ends at x2

Has height h

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 16 / 35

Page 21: Introduction to Deep Learning - cs.toronto.edu

Now we can represent “bumps” very well. How can we generalize?

f

Bump(0.0, 0.2, h1)

Bump(0.2, 0.4, h2)

Bump(0.4, 0.6, h3)

Bump(0.6, 0.8, h4)

Bump(0.8, 1.0, h5)

0 0.5 1−0.5

0

0.5

1

1.5

x

f

TargetApproximation

More bumps gives more accurate approximation.Corresponds to a single layer network.

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 17 / 35

Page 22: Introduction to Deep Learning - cs.toronto.edu

Universality: theoretically we can approximate an arbitraryfunctionSo we can learn a really complex cat classifierWhere is the catch?

Complexity, we might need quite a few hidden unitsOverfitting, memorize the training data

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 18 / 35

Page 23: Introduction to Deep Learning - cs.toronto.edu

Universality: theoretically we can approximate an arbitraryfunctionSo we can learn a really complex cat classifierWhere is the catch?Complexity, we might need quite a few hidden unitsOverfitting, memorize the training data

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 18 / 35

Page 24: Introduction to Deep Learning - cs.toronto.edu

Generalizations are possible toinclude more input dimensionscapture more output dimensionsemploy multiple layers for more efficient representations

See ‘http://neuralnetworksanddeeplearning.com/chap4.html’ for agreat read!

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 19 / 35

Page 25: Introduction to Deep Learning - cs.toronto.edu

How do we find the parameters to obtain a good approximation? Howdo we tell a computer to do that?

Intuitive explanation:Compute approximation error at the outputPropagate error back by computing individual contributions ofparameters to error

[Fig. from H. Lee]

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 20 / 35

Page 26: Introduction to Deep Learning - cs.toronto.edu

Intuitive example:

Target function: 5x2

Approximation: fw (x)Domain of interest: x ∈ [0,1]Error:

e(w) =

∫ 1

0(5x2 − fw (x))2dx

Program of interest:

minw

e(w) = minw

∫ 1

0(5x2 − fw (x))2dx

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 21 / 35

Page 27: Introduction to Deep Learning - cs.toronto.edu

Chain rule is important: w1,w2, x ∈ RAssume

e(w1,w2) = f (w2,h(w1, x))

Derivatives are:

∂e(w1,w2)

∂w2=∂f (w2,h(w1, x))

∂w2

∂e(w1,w2)

∂w1=

∂f (w2,h(w1, x))∂w1

=∂f∂h· ∂h∂w1

Chain rule

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 22 / 35

Page 28: Introduction to Deep Learning - cs.toronto.edu

Back propagation does not work well for deep networks:Diffusion of gradient signal (multiplication of many small numbers)Attractivity of many local minima (random initialization is very farfrom good points)Requires a lot of training samplesNeed for significant computational power

Solution: 2 step approachGreedy layerwise pre-trainingPerform full fine tuning at the end

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 23 / 35

Page 29: Introduction to Deep Learning - cs.toronto.edu

Why go deep?

Representation efficiency (fewercomputational units for the same function)Hierarchical representation (non-localgeneralization)Combinatorial sharing (re-use of earliercomputation)Works very well

[Fig. from H. Lee]

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 24 / 35

Page 30: Introduction to Deep Learning - cs.toronto.edu

To obtain more flexibility/non-linearity we use additional functionprototypes:

SigmoidRectified linear unit (ReLU)PoolingDropoutConvolutions

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 25 / 35

Page 31: Introduction to Deep Learning - cs.toronto.edu

Convolutions

What do the numbers mean?

See Sanja’s lecture 14 for the answers...[Fig. adapted from A. Krizhevsky]

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 26 / 35

Page 32: Introduction to Deep Learning - cs.toronto.edu

Max Pooling

What is happening here?[Fig. adapted from A. Krizhevsky]

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 27 / 35

Page 33: Introduction to Deep Learning - cs.toronto.edu

Rectified Linear Unit (ReLU)Drop information if smaller than zeroFixes the problem of vanishing gradients to some degree

DropoutDrop information at randomKind of a regularization, enforcing redundancy

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 28 / 35

Page 34: Introduction to Deep Learning - cs.toronto.edu

A famous deep learning network called “AlexNet.”

The network won the ImageNet competition in 2012.How many parameters?Given an image, what is happening?Inference Time: about 2ms per image when processing manyimages in parallelTraining Time: forever (maybe 2-3 weeks)

[Fig. adapted from A. Krizhevsky]

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 29 / 35

Page 35: Introduction to Deep Learning - cs.toronto.edu

Demo

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 30 / 35

Page 36: Introduction to Deep Learning - cs.toronto.edu

Neural networks have been used for many applications:

Classification and Recognition in Computer VisionText Parsing in Natural Language ProcessingPlaying Video GamesStock Market PredictionCaptcha

Demos:Russ websiteAntonio Places website

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 31 / 35

Page 37: Introduction to Deep Learning - cs.toronto.edu

Classification in Computer Vision: ImageNet Challengehttp://deeplearning.cs.toronto.edu/

Since it’s the end of the semester, let’s find the beach...

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 32 / 35

Page 38: Introduction to Deep Learning - cs.toronto.edu

Classification in Computer Vision: ImageNet Challengehttp://deeplearning.cs.toronto.edu/

A place to maybe prepare for exams...

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 33 / 35

Page 39: Introduction to Deep Learning - cs.toronto.edu

Links:

Tutorials: http://deeplearning.net/tutorial/deeplearning.pdfToronto Demo by Russ and students:http://deeplearning.cs.toronto.edu/MIT Demo by Antonio and students:http://places.csail.mit.edu/demo.htmlHonglak Lee:http://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdfYann LeCun:http://www.cs.nyu.edu/ yann/talks/lecun-ranzato-icml2013.pdfRichard Socher: http://lxmls.it.pt/2014/socher-lxmls.pdf

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 34 / 35

Page 40: Introduction to Deep Learning - cs.toronto.edu

Videos:

Video games: https://www.youtube.com/watch?v=mARt-xPablECaptcha: http://singularityhub.com/2013/10/29/tiny-ai-startup-vicarious-says-its-solved-captcha/https://www.youtube.com/watch?v=lge-dl2JUAM#t=27Stock exchange:http://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Applications/stocks.html

A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2014 35 / 35