Lecture 9 Pattern Recognition & Learning
Transcript of Lecture 9 Pattern Recognition & Learning
![Page 1: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/1.jpg)
Lecture 9Pattern Recognition & Learning
© UW CSE vision faculty
(Rowley, Baluja & Kanade, 1998)
![Page 2: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/2.jpg)
Motivation: Object Classification
Suppose you are given a dataset of images containing 2 classes of objects
![Page 3: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/3.jpg)
Test Set of Images
Can a computer vision system learn to automatically classify these new images?
![Page 4: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/4.jpg)
Images as Patterns
Binary handwritten characters
Greyscale images62 79 23 119 120 105 4 0
10 10 9 62 12 78 34 0
10 58 197 46 46 0 0 48
176 135 5 188 191 68 0 49
2 1 1 29 26 37 0 77
0 89 144 147 187 102 62 208
255 252 0 166 123 62 0 31
166 63 127 17 1 0 99 30
Treat an image as a high-dimensional vector (e.g., by reading pixel values left to right, top to bottom row)
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
−
N
N
pp
pp
2
2
1
MI
Pixel value pi can be 0 or 1 (binary image) or 0 to 255 (greyscale)
![Page 5: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/5.jpg)
Feature representation• Trying to classify raw images directly may be
- inefficient (huge number of pixels N)- error-prone (raw pixel values not invariant to transformations)
• Better to extract features from the image and use these for classification
• Represent each image I by a vector of features:
n is typically much smaller than N (though doesn’t have to be)
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
−
n
n
ff
ff
1
2
1
MIF
![Page 6: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/6.jpg)
Types of Features: Binary Images • Features for binary characters (‘A’,
‘B’, ‘C’, ..) could be number of strokes, number of holes, area, etc.
![Page 7: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/7.jpg)
Types of Features: Grayscale and Color• Features for greyscale images
could be oriented gradient features, multiscale oriented patches (MOPS), SIFT features, etc.
• Features for color images could be above features applied to R, G, B images, or opponent images (R-G image, B-(R+G)/2 image)
![Page 8: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/8.jpg)
Typical Pattern Recognition System
Pattern recognition or classification problem: Given a training dataset of (input image, output class) pairs, build a classifier that outputs a class for any new input image
![Page 9: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/9.jpg)
Example: Dataset of Binary Character Images
Feature values extracted from input image Class
![Page 10: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/10.jpg)
Decision Tree
#holes
area #strokes #strokes
best axisdirection #strokes
- / 1 * x w 0 A 8 B
01
2
Low High
2 4
0 1
060
90
0 3Medium
Feature-based decisions
Output
![Page 11: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/11.jpg)
Decision Trees
Input: Description of an object through a set of features or attributes
Output: a decision that is the predicted output value for the input
Advantages: • Not all features need be evaluated for every input• Feature extraction may be interleaved with
classification decisions• Can be easy to design and efficient in execution
Feature values can be discrete or continuous
![Page 12: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/12.jpg)
Example: Decision Tree for Continuous Valued Features
x1
x2Two features x1 and x2Two output classes 0 and 1
How do we branch using feature values x1 and x2 to partition the space correctly?
![Page 13: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/13.jpg)
Example: Decision Tree for Continuous Valued Features
3
4
Decision Tree
x1
x2
![Page 14: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/14.jpg)
Expressiveness
Decision trees can express any function of the input attributes.E.g., for Boolean functions, truth table row → path to leaf:
Trivially, there is a consistent decision tree for any training set with one path to leaf for each example• But most likely won't generalize to new examples
Want to find more compact decision trees (to prevent overfitting and allow generalization)
![Page 15: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/15.jpg)
Decision Tree LearningAim: find a small tree consistent with training examplesIdea: (recursively) choose "most significant" attribute (feature)
as root of (sub)tree and expand
![Page 16: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/16.jpg)
Choosing an attribute/feature to split onIdea: a good feature should reduce uncertainty
• E.g., splits the examples into subsets that are (ideally) "all positive" or "all negative"
Feature 1 is a better choice Output class probability is still at 50%.
Feature 1 Feature 2
A C B a b c d
Positive class inputs Negative class inputs
![Page 17: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/17.jpg)
How do we quantify uncertainty?
![Page 18: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/18.jpg)
Entropy measures the amount of uncertainty in a probabilitydistribution
Entropy (or Information Content) of an answer to a question with possible answers v1, … , vn:
I(P(v1), … , P(vn)) = - Σi P(vi) log2 P(vi)
Using information theory to quantify uncertainty
![Page 19: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/19.jpg)
Using information theoryImagine we have p examples with Feature1 = 1 or true, and n examples with Feature1 = 0 or false.
Our best estimate of the probabilities of Feature1 = true or false is given by:
Hence the entropy of Feature1 is given by:
( ) /( ) /
P true p p np false n p n
≈ +
≈ +
npn
npn
npp
npp
npn
nppI
++−
++−=
++ 22 loglog),(
![Page 20: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/20.jpg)
P(Feature1 = T)
Ent
ropy
I
.00 .50 1.00
1.0
0.5
Entropy is highest when uncertainty is greatest
Feature1 = T
Feature1 = F
![Page 21: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/21.jpg)
Idea: a good feature should reduce uncertainty and result in “gain in information”
How much information do we gain if we disclose the value of some feature?
Answer: uncertainty before – uncertainty after
Choosing an attribute/feature to split on
![Page 22: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/22.jpg)
Example
Before choosing any feature: Entropy = - 6/12 log(6/12) – 6/12 log(6/12)
= - log(1/2) = log(2) = 1 bitThere is “1 bit of information to be discovered”
Feature1 Feature2
A C B a b c d
![Page 23: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/23.jpg)
Example
If we choose Feature2: Go along branch “a”: we have entropy = 1 bit; similarly for the others.
Information gain = 1-1 = 0 along any branchIf we choose Feature1: In branch “A” and “B”, entropy = 0 For “C”, entropy = -2/6 log(2/6)-4/6 log(4/6) = 0.92
Info gain = (1-0) or (1-0.92) bits > 0 in both casesSo choosing Feature1 gains more information!
Feature1 Feature2
A C B a b c d
![Page 24: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/24.jpg)
Entropy across branches
• How do we combine entropy of different branches?
• Answer: Compute average entropy
• Weight entropies according to probabilities of branches
2/12 times we enter “A”, so weight for “A” = 1/6
“B” has weight: 4/12 = 1/3“C” has weight 6/12 = ½
1( ) ( , )
ni i i i
i i i i i
p n p nEntropy A Entropyp n p n p n=
+=
+ + +∑
weight for each branch entropy for each branch
AvgEntropy
Feature1
A C B
m
![Page 25: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/25.jpg)
Information gainInformation Gain (IG) or reduction in entropy from using
feature A:
1. Choose the feature/attribute with the largest IG2. Create (sub)tree with this feature as root3. Recursively call the algorithm for each value of feature
IG(A) = Entropy before – AvgEntropy after choosing A
Feature F
v1 vkv2Dataset D ′
Dataset D
repeatrecursively
D′={s∈D | value(F)=v1}
![Page 26: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/26.jpg)
Performance Measurement
How do we test the performance of the learned tree?Answer: Try it on a test set of examples not used in trainingLearning curve = % correct on test set as a function of training
set size
![Page 27: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/27.jpg)
Cross-validation
Instead of only 1 subset held out as the test set, better to use K-fold cross-validation: • Divide data into K subsets of equal size• Train learning algorithm K times, leaving out one of the
subsets. Compute error on left-out subset• Report average error over all subsets
Leave-1-out cross-validation:• Train on all but 1 data point, test on that data point; repeat
for each point• Report average error over all points
![Page 28: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/28.jpg)
Confusion matrix
Useful for characterizing recognition performaceQuantifies amount of “confusion” between similar classes
![Page 29: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/29.jpg)
Other classification methods
These utilize the full feature vector for each input
![Page 30: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/30.jpg)
Classification using nearest class meanGiven new input image I, compute the distance (e.g., Euclidean distance) between feature vector FI and the mean of each classChoose closest class, if close enough (reject otherwise)
![Page 31: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/31.jpg)
If the class distributions are complex…
Class 2 has two clustersWhere is its mean?
Nearest class mean method will likely fail badly in this case
●
New input pointWhat is its class?
![Page 32: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/32.jpg)
Nearest Neighbor Classification
• Keep all the training samples in some efficientlook-up structure
• Find the nearest neighbor of the feature vectorto be classified and assign the class of the neighbor
• Can be extended to K nearest neighbors
![Page 33: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/33.jpg)
K-Nearest Neighbors
Idea:• Look around you to see how your neighbors classify data• Classify a new data-point according to a majority vote of
your K nearest neighbors
![Page 34: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/34.jpg)
ExampleInput Data: 2-D points (x1,x2)
Two classes: C1 and C2. New Data Point +
K = 4: Look at 4 nearest neighbors of +3 are in C1, so classify + as C1
![Page 35: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/35.jpg)
Decision Boundary using K-NN
Some points near the boundary may be misclassified
![Page 36: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/36.jpg)
K-NN is for girlie men – what about
something stronger?
http://www.ipjnet.com/schwarzenegger2/pages/arnold_01.htm
![Page 37: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/37.jpg)
The human brain is extremely good at classifying objects in images
Can we develop classification methods by emulating the brain?
![Page 38: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/38.jpg)
Neurons compute using spikes
Inputs
Output spike (electrical pulse)
Output spike roughly dependent on whether sum of all inputs reaches a threshold
![Page 39: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/39.jpg)
Neurons as “Threshold Units”Artificial neuron:
• m binary inputs (-1 or 1) and 1 output (-1 or 1)• Synaptic weights wji
• Threshold μi
Inputs uj(-1 or +1)
Output vi(-1 or +1)
Weighted Sum Threshold
Θ(x) = 1 if x > 0 and -1 if x ≤ 0
)( ijj
jii uwv μ−Θ= ∑
w1i
w2i
w3i
![Page 40: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/40.jpg)
“Perceptrons” for Classification
Fancy name for a type of layered “feed-forward” networks (no loops)
Uses artificial neurons (“units”) with binary inputs and outputs
Multilayer
Single-layer
![Page 41: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/41.jpg)
Perceptrons and Classification
Consider a single-layer perceptron• Weighted sum forms a linear hyperplane
• Due to threshold function, everything on one side of this hyperplane is labeled as class 1 (output = +1) and everything on other side is labeled as class 2 (output = -1)
Any function that is linearly separable can be computed by a perceptron
0=−∑ ijj
jiuw μ
![Page 42: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/42.jpg)
Linear Separability
Example: AND is linearly separable
A linear hyperplane(a line in 2D)
v
u1 u2
μ = 1.5(1,1)
1
1
-1
-1
u1
u2-1 -1 -1
1 -1 -1
-1 1 -1
1 1 1
u1 u2 AND
v = 1 iff u1 + u2 – 1.5 > 0
Similarly for OR and NOT
![Page 43: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/43.jpg)
What about the XOR function?
(1,1)
11
-1
-1
u1
u2-1 -1 1
1 -1 -1
-1 1 -1
1 1 1
u1 u2 XOR ?
Can a straight line separate the +1 outputs from the -1 outputs?
![Page 44: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/44.jpg)
Linear Inseparability
Single-layer perceptron with threshold units fails if classification task is not linearly separable• Example: XOR• No single line can separate the “yes” (+1)
outputs from the “no” (-1) outputs!
Minsky and Papert’s book showing such negative results put a damper on neural networks research for over a decade!
(1,1)
11
-1
-1
u1
u2
X
![Page 45: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/45.jpg)
How do we deal with linear inseparability?
![Page 46: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/46.jpg)
Multilayer PerceptronsRemoves limitations of single-layer networks
• Can solve XORExample: Two-layer perceptron that computes XOR
Output is +1 if and only if x + y – 2Θ(x + y – 1.5) – 0.5 > 0x y
![Page 47: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/47.jpg)
x y
out
x
y
1
1
2
1 2
21
11− 1−
2
1−
1−1
21
− ?
Multilayer Perceptron: What does it do?
![Page 48: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/48.jpg)
x y
out
x
y
1
1
2
1 2
0211 >−+ yx
0211 <−+ yx
=-1
=1
21
11−
Example: Perceptrons as Constraint Satisfaction Networks
xy211+=
Line defined by first hidden unit
![Page 49: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/49.jpg)
x y
out
x
y
1
1
2
1 2
02 >−− yx 02 <−− yx
=-1
=-1=1
=1
1−
2
1−
Example: Perceptrons as Constraint Satisfaction Networks
Line defined by second hidden unit
![Page 50: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/50.jpg)
x y
out
x
y
1
1
2
1 2
=-1
=-1=1
=11−1
21
− -21
− >0
Example: Perceptrons as Constraint Satisfaction Networks
Output region defined by combining hidden unit outputs
![Page 51: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/51.jpg)
x y
out
x
y
1
1
2
1 2
02 <−− yx
0211 >−+ yx
=-1
=-1=1
=1
21
11− 1−
2
1−
1−1
21
−
Example: Perceptrons as Constraint Satisfaction Networks
Output is 1 if and only if inputs satisfy the two constraints
![Page 52: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/52.jpg)
How do we learn the appropriate weights given only examples of (input,output)?
Idea: Change the weights to decrease the error in ouput
![Page 53: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/53.jpg)
Learning Multilayer NetworksWe want networks that can learn to map inputs to outputs
• Assume outputs are real-valued between 0 and 1 (instead of only 0 and 1, or -1 and 1)
– Can threshold output to decide if class 0, class 1, or Reject
• Idea: Given data, minimize errors between network’s output and desired output by changing weights
To minimize errors, a differentiable output function is desirable (threshold function won’t do)
![Page 54: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/54.jpg)
Sigmoidal Networks
Input nodes aeag β−+
=1
1)(
a
Ψ(a)1
The most commonly useddifferentiable function:
Sigmoid function:
Non-linear “squashing” function: Squashes input to be between 0 and 1. The parameter β controls the slope.
g(a)
)()( ii
iT uwggv ∑== uw
w
u = (u1 u2 u3)T
Output
![Page 55: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/55.jpg)
Gradient-Descent Learning (“Hill-Climbing”)
Given training examples (um,dm) (m = 1, …, N), define a sum of squared output errors function (also called a cost function or “energy” function)
2)(21)( m
m
m vdE −= ∑w
)( mTm gv uw=where
![Page 56: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/56.jpg)
Would like to change w so that E(w) is minimized• Gradient Descent: Change w in proportion to –dE/dw
(why?)
mmTmm
m
mmm
mgvd
ddvvd
ddE
ddE
uuwww
www
)()()( ′−−=−−=
−→
∑∑
ε
Derivative of sigmoid
Gradient-Descent Learning (“Hill-Climbing”)
![Page 57: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/57.jpg)
“Stochastic” Gradient DescentWhat if the inputs only arrive one-by-one?Stochastic gradient descent approximates sum over all
inputs with an “on-line” running sum:
mmTmm gvdddE
ddE
uuww
www
)()(1
1
′−−=
−→ εAlso known as the “delta rule”or “LMS (least mean square)
rule”delta = error
![Page 58: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/58.jpg)
But wait….What if we have multiple layers?
Delta rule can be used to adapt these weights
How do we adapt these?
Input u = (u1 u2 … uK)T
Output v = (v1 v2 … vJ)T; Desired = d
![Page 59: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/59.jpg)
Enter…the backpropagation algorithm
(Actually, nothing but the chain rule from calculus)
![Page 60: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/60.jpg)
Backpropagation: Uppermost layer (delta rule)
jj
jjiiiji
jijiji
xxWgvddWdE
dWdEWW
)()( ∑′−−=
−→ ε
{delta rule}
)( jj
jii xWgv ∑=
ku
jx
Learning rule for hidden-output weights W:
2)(21),( i
ii vdE −= ∑wW
{gradient descent}
![Page 61: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/61.jpg)
Backpropagation: Inner layer (chain rule)
)( jj
jimi xWgv ∑=
mku
⎥⎦
⎤⎢⎣
⎡ ′⋅⎥⎦
⎤⎢⎣
⎡′−−=
⋅=−→
∑∑∑ mk
mk
kkjji
j
mjji
mi
mi
imkj
kj
j
jkjkjkjkj
uuwgWxWgvddwdE
dwdx
dxdE
dwdE
dwdEww
)()()(
:But
,
ε {chain rule}
)( mk
kkj
mj uwgx ∑=
Learning rule for input-hidden weights w:
2)(21),( i
ii vdE −= ∑wW
![Page 62: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/62.jpg)
Example: Learning to Drive
![Page 63: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/63.jpg)
Example Network
(Pomerleau, 1992)
![Page 64: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/64.jpg)
Example Network
Training Input u = (u1 u2 … u960) = image pixels
Get steering angle from a human driver
Get current camera image
Training Output:d = (d1 d2 … d30)
![Page 65: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/65.jpg)
Training the network using backprop
2)(21),( i
ii vdE −= ∑wW
Start with random weights W, w
Given input u, network produces output v
Use backprop to learn W and wthat minimize total error over all output units (labeled i):
))(( kk
kjj
jii uwgWgv ∑∑=
ku
![Page 66: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/66.jpg)
Learning to Drive using Backprop
One of the learned “road features” wi
![Page 67: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/67.jpg)
ALVINN (Autonomous Land Vehicle in a Neural Network)
(Pomerleau, 1992)
Trained using human driver + camera images
After learning:Drove up to 70 mph on highwayUp to 22 miles without interventionDrove cross-country largely autonomously
![Page 68: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/68.jpg)
Another Example: Face Detection
(Rowley, Baluja & Kanade, 1998)
Output between -1 (no face) and +1 (face present)
![Page 70: Lecture 9 Pattern Recognition & Learning](https://reader033.fdocuments.in/reader033/viewer/2022050512/6271d3a01200020a812a31e7/html5/thumbnails/70.jpg)
Next Time: More Pattern Recognition & LearningThings to do:
• Work on Project 2• Vote on Project 1 Artifacts• Read Chap. 4
Have a good wekened!