Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These...

107
Machine Learning for Vision: Random Decision Forests and Deep Neural Networks Kari Pulli Senior Director NVIDIA Research

Transcript of Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These...

Page 1: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Machine Learning

for Vision:

Random Decision Forests

and

Deep Neural Networks Kari Pulli Senior Director

NVIDIA Research

Page 2: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

NVIDIA Research

Material sources

!   A. Criminisi & J. Shotton: Decision Forests Tutorial

! http://research.microsoft.com/en-us/projects/decisionforests/

!   J. Shotton et al. (CVPR11): Real-Time Human Pose Recognition

in Parts from a Single Depth Image

! http://research.microsoft.com/apps/pubs/?id=145347

!   Geoffrey Hinton: Neural Networks for Machine Learning

! https://www.coursera.org/course/neuralnets

!   Andrew Ng: Machine Learning

! https://www.coursera.org/course/ml

!   Rob Fergus: Deep Learning for Computer Vision

! http://media.nips.cc/Conferences/2013/Video/Tutorial1A.pdf

! https://www.youtube.com/watch?v=qgx57X0fBdA

Page 3: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

•  Each training case consists of

–  an input vector x

–  a target output t

•  Regression: The target output is

–  real numbers

•  Classification: The target output is

–  a class label

Two types of supervised learning

Page 4: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

100 

200 

300 

400 

0  500  1000  1500  2000  2500 

Housing price predic7on  

Price ($)  

in 1000’s 

Size in feet2  

Regression: Predict con7nuous 

valued output (price) 

Supervised Learning 

“right answers” given 

Page 5: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Tumor Size 

Age 

‐  Clump Thickness 

‐  Uniformity of Cell Size 

‐  Uniformity of Cell Shape 

… 

Breast cancer (malignant, benign) 

Page 6: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

x1 

x2 

Supervised Learning 

Page 7: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Unsupervised Learning 

x1 

x2 

Page 8: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Clustering 

K‐means  

algorithm Machine Learning 

Page 9: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Page 10: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Page 11: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Page 12: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Page 13: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Page 14: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Page 15: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Page 16: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Page 17: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Decision Forests

for computer vision and medical image analysis

A. Criminisi, J. Shotton and E. Konukoglu

  

h"p://research.microso/.com/projects/decisionforests 

  

Page 18: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Decision trees and decision forests

A forest is an ensemble of trees. The trees are all slightly different from one another. 

terminal (leaf) node 

internal  

(split) node 

root node 0 

1  2 

3  4  5  6 

7  8  9  10  11  12  13  14 

A general tree structure 

Is top  

part blue? 

Is bo"om  

part green? Is bo"om  

part blue? 

false tru

A decision tree 

Page 19: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Input 

test 

point Split the data at node 

Decision tree testing (runtime)

Input data in feature space

PredicDon at leaf 

Page 20: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 21: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

How to split the data? 

Binary tree? Ternary? 

How big a tree? 

Decision tree training (off-line) Input training data 

Page 22: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Training and information gain

Before split 

Information gain

Shannon’s entropy

Node training

(for categorical, non‐parametric distribuDons) 

Split 1 

Split 2 

Page 23: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Training and information gain

Information gain

Differential entropy of Gaussian

Node training

Before split 

(for conDnuous, parametric densiDes) 

Split 1 

Split 2 

Page 24: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The weak learner model Node weak learner 

Node test params 

Splitting data at node j

With a generic line in homog. coordinates.

Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section

Examples of weak learners

Feature response for 2D example.

Feature response for 2D example. With a matrix representing a conic.

Feature response for 2D example. With or

Page 25: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The prediction model What do we do at the leaf?

leaf leaf 

leaf 

Prediction model: probabilistic

Page 26: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

How many trees? 

How different? 

How to fuse their outputs? 

Decision forest training (off-line)

…  … 

Page 27: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Decision forest model: the randomness model

1) Bagging (randomizing the training set)

The full training set 

The randomly sampled subset of training data made available for the tree t 

Forest training 

Efficient training 

Page 28: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Decision forest model: the randomness model

The full set of all possible node test parameters 

For each node the set of randomly sampled features 

Randomness control parameter.  For               no randomness and maximum tree correla7on. 

For               max randomness and minimum tree correla7on. 

2) Randomized node optimization (RNO)

Small value of ; little tree correlation. Large value of ; large tree correlation.

The effect of

Node weak learner 

Node test params 

Node training

Page 29: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

An example forest to predict continuous variables

Decision forest model: the ensemble model

Page 30: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Classification forest: the ensemble model

Tree t=1 t=2 t=3

Forest output probability

The ensemble model

Page 31: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Training different trees in the forest 

Tes7ng different trees in the forest 

(2 videos in this page) 

Classification forest: effect of the weak learner model

Parameters: T=200, D=2, weak learner = aligned, leaf model = probabilisDc 

•  “Accuracy of predic7on” 

•  “Quality of confidence” 

•  “Generaliza7on” 

Three concepts to keep in mind: 

Training points 

Page 32: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Training different trees in the forest 

Tes7ng different trees in the forest 

Classification forest: effect of the weak learner model

Parameters: T=200, D=2, weak learner = linear, leaf model = probabilisDc (2 videos in this page) 

Training points 

Page 33: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Classification forest: effect of the weak learner model Training different trees in the forest 

Tes7ng different trees in the forest 

Parameters: T=200, D=2, weak learner = conic, leaf model = probabilisDc (2 videos in this page) 

Training points 

Page 34: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Classification forest: with >2 classes Training different trees in the forest 

Tes7ng different trees in the forest 

Parameters: T=200, D=3, weak learner = conic, leaf model = probabilisDc (2 videos in this page) 

Training points 

Page 35: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 36: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Classification forest: effect of tree depth

max tree depth, D 

overfidng underfidng 

T=200, D=3, w. l. = conic  T=200, D=6, w. l. = conic  T=200, D=15, w. l. = conic 

Predictor model = prob. (3 videos in this page) 

Training points: 4‐class mixed 

Page 37: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Jamie Shotton, Andrew Fitzgibbon, Mat Cook, 

Toby Sharp, Mark Finocchio, Richard Moore, 

Alex Kipman, Andrew Blake 

 

CVPR 2011 

Page 38: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

right elbow 

right hand  left shoulder neck 

Page 39: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

  No temporal informa7on 

  frame‐by‐frame 

   Local pose es7mate of parts 

  each pixel & each body joint treated independently 

  Very fast 

  simple depth image features 

  parallel decision forest classifier 

Page 40: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

infer 

body parts 

per pixel  cluster pixels to 

hypothesize 

body joint 

positions 

capture 

depth image & 

remove bg 

fit model & 

track skeleton 

Page 41: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Train invariance to: 

  

Record mocap 500k frames 

distilled to 100k poses 

Retarget to several models   

Render (depth, body parts) pairs  

Page 42: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

synthetic (train & test) 

real (test) 

Page 43: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

  Depth comparisons 

  very fast to compute 

input depth 

image 

x Δ 

x Δ 

x Δ x

Δ 

x

Δ 

x

Δ image  depth 

image coordinate 

offset depth 

feature response 

Background pixels d = large constant 

 

scales inversely with depth 

f(I, x) = dI(x)− dI(x+∆)

∆ = v/dI(x)

Page 44: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

depth 1 depth 2 depth 3 depth 4 depth 5 depth 6 depth 7 depth 8 depth 9 depth 10 depth 11 depth 12 depth 13 depth 14 depth 15 depth 16 depth 17 depth 18 

input depth  ground truth parts  inferred parts (soft) 

Page 45: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

30% 

35% 

40% 

45% 

50% 

55% 

60% 

65% 

8  12  16  20 

Average per‐class 

accuracy

 

Depth of trees 

30% 

35% 

40% 

45% 

50% 

55% 

60% 

65% 

5  10  15  20 Depth of trees 

synthetic test data  real test data 

Page 46: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

ground truth 

1 tree  3 trees  6 trees inferred body parts (most likely) 

40% 

45% 

50% 

55% 

1  2  3  4  5  6 

Average per‐class 

Number of trees 

Page 47: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

front view  top view side view 

input depth  inferred body parts 

inferred joint positions 

no tracking or smoothing 

Page 48: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

front view  top view side view 

input depth  inferred body parts 

inferred joint positions 

no tracking or smoothing 

Page 49: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

  Use…   3D joint hypotheses 

  kinema7c constraints   temporal coherence 

   … to give 

  full skeleton 

  higher accuracy 

  invisible joints   mul7‐player 

4. track skeleton 

Page 50: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Feed-forward neural networks

•  These are the most common type of neural network in practice

–  The first layer is the input

and the last layer is the output.

–  If there is more than one hidden layer,

we call them “deep” neural networks.

–  Hidden layers learn complex features,

the outputs are learned in terms of those

features.

hidden units

output units

input units

Page 51: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Linear neurons

•  These are simple but computationally limited

i

i

iwxby ∑+=

output

bias

index over input connections

i input th

i th

weight on

input

Page 52: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Sigmoid neurons

•  These give a real-valued output that is a smooth and bounded function of their total input.

–  They have nice derivatives which make learning easy

y =1

1+ e−z

0.5

0 0

1

z

y

z = b+ xii

∑ wi

∂z

∂wi= xi

∂z

∂xi= wi

dy

dz= y (1− y)

Page 53: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Finding weights with backpropagation

•  There is a big difference between the

forward and backward passes.

•  In the forward pass we use squashing

functions to prevent the activity vectors

from exploding.

•  The backward pass, is completely linear.

–  The forward pass determines the slope

of the linear function used for

backpropagating through each neuron.

Page 54: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Backpropagating dE/dy y j

j

yi

i

z j

zj= y

i

i

∑ wij

wij

•  Find squared error

•  Propagate error to the

layer below

•  Compute error

derivative w.r.t. weights

•  Repeat

y =1

1+ e−z

Ej= 1

2(y

j− t

j)2

Page 55: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Backpropagating dE/dy

∂E

∂z j=dy j

dz j

∂E

∂y j

y j

j

yi

i

z j

wij

Ej= 1

2(y

j− t

j)2

Propagate error across

non-linearity

zj= y

i

i

∑ wij

y =1

1+ e−z

Page 56: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Backpropagating dE/dy

∂E

∂z j=dy j

dz j

∂E

∂y j= y j (1− y j )

∂E

∂y j

y j

j

yi

i

z j

wij

Ej= 1

2(y

j− t

j)2

dy

dz= y (1− y)

∂E

∂y j

= yj− t

j

Propagate error across

non-linearity

zj= y

i

i

∑ wij

y =1

1+ e−z

Page 57: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

∂E

∂z j=dy j

dz j

∂E

∂y j= y j (1− y j )

∂E

∂y j

∂E

∂y j

= yj− t

j

Backpropagating dE/dy y j

j

yi

i

z j

∂E

∂yi=

dz j

dyi

∂E

∂z jj

wij

Ej= 1

2(y

j− t

j)2

dy

dz= y (1− y)

Propagate error to the

next activation across

connections

zj= y

i

i

∑ wij

y =1

1+ e−z

Page 58: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

∂E

∂y j

= yj− t

j

Backpropagating dE/dy y j

j

yi

i

z j

∂E

∂yi=

dz j

dyi

∂E

∂z jj

∑ = wij∂E

∂z jj

wij

Ej= 1

2(y

j− t

j)2

dy

dz= y (1− y)

Propagate error to the

next activation across

connections

zj= y

i

i

∑ wij

y =1

1+ e−z

∂E

∂z j=dy j

dz j

∂E

∂y j= y j (1− y j )

∂E

∂y j

Page 59: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

∂E

∂z j=dy j

dz j

∂E

∂y j= y j (1− y j )

∂E

∂y j

∂E

∂yi=

dz j

dyi

∂E

∂z jj

∑ = wij∂E

∂z jj

∂E

∂y j

= yj− t

j

Backpropagating dE/dy y j

j

yi

i

z j

∂E

∂wij

=∂z

j

∂wij

∂E

∂zj

wij

Ej= 1

2(y

j− t

j)2

dy

dz= y (1− y) Error gradient w.r.t. weights

zj= y

i

i

∑ wij

y =1

1+ e−z

Page 60: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

∂E

∂yi=

dz j

dyi

∂E

∂z jj

∑ = wij∂E

∂z jj

∂E

∂y j

= yj− t

j

∂E

∂z j=dy j

dz j

∂E

∂y j= y j (1− y j )

∂E

∂y j

Backpropagating dE/dy y j

j

yi

i

z j

∂E

∂wij=

∂z j

∂wij

∂E

∂z j= yi

∂E

∂z j

wij

Ej= 1

2(y

j− t

j)2

dy

dz= y (1− y) Error gradient w.r.t. weights

zj= y

i

i

∑ wij

y =1

1+ e−z

Page 61: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

∂E

∂z j=dy j

dz j

∂E

∂y j= y j (1− y j )

∂E

∂y j

∂E

∂yi=

dz j

dyi

∂E

∂z jj

∑ = wij∂E

∂z jj

∂E

∂y j

= yj− t

j

Backpropagating dE/dy y j

j

yi

i

∂E

∂wij=

∂z j

∂wij

∂E

∂z j= yi

∂E

∂z j

wij

Ej= 1

2(y

j− t

j)2

y =1

1+ e−z

dy

dz= y (1− y)

zj= y

i

i

∑ wij

Page 62: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Converting error derivatives into a learning procedure

•  The backpropagation algorithm is an efficient way of computing the

error derivative dE/dw for every weight on a single training case.

•  To get a fully specified learning procedure, we still need to make a lot of other decisions about how to use these error derivatives:

–  Optimization issues: How do we use the error derivatives on

individual cases to discover a good set of weights?

–  Generalization issues: How do we ensure that the learned weights

work well for cases we did not see during training?

Page 63: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Gradient descent algorithm 

repeat until convergence { W := W − α∂E

∂W}

Page 64: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

θ1 θ0

E(θ0,θ1)

Page 65: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

θ0

θ1

E(θ0,θ1)

Page 66: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Overfitting: The downside of using powerful models

•  The training data contains information about the regularities in the mapping from input to output. But it also contains two types of noise.

–  The target values may be unreliable

–  There is sampling error: accidental regularities just because of the particular training cases that were chosen.

•  When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.

–  So it fits both kinds of regularity.

–  If the model is very flexible it can model the sampling error really

well. This is a disaster.

Page 67: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Andrew Ng 

Example: Logis7c regression 

(    = sigmoid func7on) 

x1 

x2 

x1 

x2 

x1 

x2 

Page 68: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Preventing overfitting

•  Approach 1: Get more data!

–  Almost always the best bet if you have enough compute power to

train on more data.

•  Approach 2: Use a model that has

the right capacity:

–  enough to fit the true regularities.

–  not enough to also fit spurious

regularities (if they are weaker).

•  Approach 3: Average many different

models.

–  Use models with different forms.

•  Approach 4: (Bayesian) Use a

single neural network architecture,

but average the predictions made

by many different weight vectors.

–  Train the model on different

subsets of the training data

(this is called “bagging”).

Page 69: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Some ways to limit the capacity of a neural net

•  The capacity can be controlled in many ways:

–  Architecture: Limit the number of hidden layers and the number of units per layer.

–  Early stopping: Start with small weights and stop the learning

before it overfits.

–  Weight-decay: Penalize large weights using penalties or

constraints on their squared values (L2 penalty) or absolute

values (L1 penalty).

–  Noise: Add noise to the weights or the activities.

•  Typically, a combination of several of these methods is used.

Page 70: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Small Model vs. Big Model + Regularize 

Small model Big model Big model +

regularize

Page 71: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Cross-validation for choosing meta parameters

•  Divide the total dataset into three subsets:

–  Training data is used for learning the parameters of the model.

–  Validation data is not used for learning but is used for deciding what settings of the meta parameters work best.

–  Test data is used to get a final, unbiased estimate of how well the network works.

Page 72: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Convolu7onal Neural Networks (currently the dominant approach for neural networks) 

•  Use many different copies of the same feature detector 

with different posi7ons. 

–  Replica7on greatly reduces the number of free 

parameters to be learned. 

•  Use several different feature types,  

each with its own map of replicated detectors. 

–  Allows each patch of image to be represented in 

several ways. 

 

The similarly colored 

connec7ons all have the 

same weight. 

Page 73: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Le Net 

•  Yann LeCun and his collaborators developed a really good recognizer for handwrilen digits by using backpropaga7on in a feedforward net with: 

–  Many hidden layers 

–  Many maps of replicated convolu7on units in each layer 

–  Pooling of the outputs of nearby replicated units 

–  A wide input can cope with several digits at once even if they overlap 

•  This net was used for reading ~10% of the checks in North America. 

•  Look the impressive demos of LENET at hlp://yann.lecun.com 

Page 74: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 75: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The architecture of LeNet5 

Page 76: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Pooling the outputs of replicated feature detectors

•  Get a small amount of translational invariance at each level by averaging four neighboring outputs to give a single output.

–  This reduces the number of inputs to the next layer of feature extraction.

–  Taking the maximum of the four works slightly better.

•  Problem: After several levels of pooling, we have lost information about the precise positions of things.

Page 77: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The architecture of LeNet5 

Page 78: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 79: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The 82 errors made

by LeNet5

Notice that most of the

errors are cases that

people find quite easy.

The human error rate is

probably 20 to 30 errors

but nobody has had the

patience to measure it.

Test set size is 10000.

Page 80: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The brute force approach

•  LeNet uses knowledge about the invariances to design:

–  the local connectivity

–  the weight-sharing

–  the pooling.

•  This achieves about 80 errors.

•  Ciresan et al. (2010) inject knowledge of invariances by creating a huge amount of carefully designed extra training data:

–  For each training image, they produce many new training examples by applying many different transformations.

–  They can then train a large, deep, dumb net on a GPU without much overfitting.

•  They achieve about 35 errors.

Page 81: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

From hand‐wrilen digits to 3‐D objects 

•  Recognizing real objects in color photographs downloaded from the web is 

much more complicated than recognizing hand‐wrilen digits: 

–  Hundred 7mes as many classes (1000 vs. 10) 

–  Hundred 7mes as many pixels (256 x 256 color vs. 28 x 28 gray) 

–  Two dimensional image of three‐dimensional scene. 

–  Clulered scenes requiring segmenta7on 

–  Mul7ple objects in each image. 

•  Will the same type of convolu7onal neural network work? 

Page 82: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The ILSVRC-2012 competition on ImageNet

•  The dataset has 1.2 million high-resolution training images.

•  The classification task:

–  Get the “correct” class in your top 5 bets.

There are 1000 classes.

•  The localization task:

–  For each bet, put a box around the object.

Your box must have at least 50% overlap

with the correct box.

Page 83: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Examples from the test set (with the network’s guesses) 

Page 84: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Error rates on the ILSVRC‐2012 compe77on 

•  University of Tokyo              

•  Oxford University Computer Vision Group 

•  INRIA (French na7onal research ins7tute in CS) + 

XRCE (Xerox Research Center Europe)   

•  University of Amsterdam 

•  26.1%            53.6% 

•  26.9%            50.0% 

•  27.0% 

•  29.5%      

•  University of Toronto (Alex Krizhevsky)  •  16.4% 34.1% •    

classifica7on classifica7on 

&localiza7on 

Page 85: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

A neural network for ImageNet

•  Alex Krizhevsky (NIPS 2012)

developed a very deep

convolutional neural net

Its architecture:

•  The activation functions were:

–  Rectified linear units in every hidden

layer.

•  These train much faster.

•  More expressive than logistic.

–  Competitive normalization to suppress hidden activities when

nearby units have stronger activities.

•  This helps with variations in intensity.  

y

z 0

Page 86: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Tricks that significantly improve generaliza7on 

•  Train on random 224x224 patches from the 256x256 images to get more data. Also use left-right reflections of the images.

•  At test time, combine the opinions from ten different patches: The four 224x224 corner patches plus the central 224x224 patch plus the reflections of those five patches.

•  Use “dropout” to regularize the weights in the globally connected layers (which contain most of the parameters).

–  Dropout means that half of the hidden units in a layer are randomly removed for each training example.

–  This stops hidden units from relying too much on other hidden units. 

Page 87: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The hardware required for Alex’s net

•  He uses a very efficient implementa7on of convolu7onal nets on two Nvidia GTX 580 Graphics Processor Units (over 1000 fast lille cores) 

–  GPUs are very good for matrix‐matrix mul7plies. 

–  GPUs have very high bandwidth to memory. 

–  This allows him to train the network in a week. 

–  It also makes it quick to combine results from 10 patches at test 7me. 

•  We can spread a network over many cores if we can communicate the states fast enough. 

•  As cores get cheaper and datasets get bigger, big neural nets will improve faster than old‐fashioned (i.e., pre Oct 2012) computer vision systems. 

Page 88: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Auto‐Encoders 

Page 89: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Restricted Boltzmann Machines

•  Simple recursive neural net

–  Only one layer of hidden units.

–  No connections between

hidden units.

•  Idea:

–  The hidden layer should

“auto-encode” the input.

p(h j = 1) =1

1+ e−(bj+ viwij)

i∈vis

hidden

visible i

j

Page 90: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Contrastive divergence to train an RBM

t = 0 t = 1

Δwij = ε ( <vih j>0− <vih j>

1)

Start with a training vector on the

visible units.

Update all the hidden units in

parallel.

Update all the visible units in

parallel to get a reconstruction .

Update the hidden units again.

reconstruction data

<vih j>0 <vih j>

1

i

j

i

j

Page 91: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Explanation

t = 0 t = 1

Δwij = ε ( <vih j>0− <vih j>

1)

Ideally, hidden layers re-generate

the input.

If that’s not the case, the hidden

layers generate something else.

Change the weights so that this

wouldn’t happen. reconstruction data

<vih j>0 <vih j>

1

i

j

i

j

Page 92: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The weights of the 50 feature detectors

We start with small random weights to break symmetry

Page 93: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 94: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 95: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 96: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 97: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 98: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 99: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 100: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the
Page 101: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The final 50 x 256 weights: Each neuron grabs a different feature

Page 102: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Reconstruction from activated

binary features Data

Reconstruction from activated

binary features Data

How well can we reconstruct digit images from the binary feature activations?

New test image from

the digit class that the

model was trained on

Image from an

unfamiliar digit class

The network tries to see

every image as a 2.

Page 103: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

The DBN used for modeling the joint distribution of

MNIST digits and their labels

2000 units

500 units

500 units

28 x 28 pixel

image

10 labels

•  The first two hidden layers are

learned without using labels.

•  The top layer connects the labels to the features in the

second hidden layer.

•  The weights are then fine-tuned

to be a better generative model

using backpropagation.

Page 104: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Krizhevsky’s deep autoencoder

1024 1024 1024

8192

4096

2048

1024

512

256-bit binary code The encoder has

about 67,000,000

parameters.

It takes a few days on

a GTX 285 GPU to

train on two million

images.

Page 105: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

Reconstructions of 32x32 color images from 256-bit codes

Page 106: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

retrieved using 256 bit codes

retrieved using Euclidean distance in pixel intensity space

Page 107: Machine Learning for Vision: Random Decision Forests and ...Feed-forward neural networks • These are the most common type of neural network in practice – The first layer is the

retrieved using 256 bit codes

retrieved using Euclidean distance in pixel intensity space