Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 ·...
Transcript of Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 ·...
![Page 1: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/1.jpg)
Neural Networks
Aarti Singh
Machine Learning 10-601 Nov 3, 2011
Slides Courtesy: Tom Mitchell
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA
1
![Page 2: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/2.jpg)
Logis0c Regression
2
Assumes the following func1onal form for P(Y|X):
Logistic function (or Sigmoid):
Logis1c func1on applied to a linear func1on of the data
z
logi
t (z)
Features can be discrete or continuous!
![Page 3: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/3.jpg)
Logis0c Regression is a Linear Classifier!
3
Assumes the following func1onal form for P(Y|X): Decision boundary:
(Linear Decision Boundary)
1
1
![Page 4: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/4.jpg)
Training Logis0c Regression
4
How to learn the parameters w0, w1, … wd?
Training Data
Maximum (Condi1onal) Likelihood Es1mates
Discrimina1ve philosophy – Don’t waste effort learning P(X), focus on P(Y|X) – that’s all that maLers for classifica1on!
![Page 5: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/5.jpg)
Op0mizing convex func0on
5
• Max Condi1onal log-‐likelihood = Min Nega1ve Condi1onal log-‐likelihood
• Nega1ve Condi1onal log-‐likelihood is a convex func1on Gradient Descent (convex)
Gradient:
Learning rate, η>0 Update rule:
![Page 6: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/6.jpg)
Logis0c func0on as a Graph
Sigmoid Unit
d
d
d
![Page 7: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/7.jpg)
Neural Networks to learn f: X à Y • f can be a non-‐linear func1on • X (vector of) con1nuous and/or discrete variables • Y (vector of) con1nuous and/or discrete variables
• Neural networks -‐ Represent f by network of logis1c/sigmoid units:
Input layer, X
Output layer, Y
Hidden layer, H
Sigmoid Unit
![Page 8: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/8.jpg)
Neural Network trained to distinguish vowel sounds using 2 formants (features)
Highly non-linear decision surface Two layers of logistic units
Input layer
Hidden layer
Output layer
![Page 9: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/9.jpg)
Neural Network trained to drive a car!
Weights of each pixel for one hidden unit
Weights to output units from the hidden unit
![Page 10: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/10.jpg)
![Page 11: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/11.jpg)
Forward Propaga0on for predic0on
Prediction – Given neural network (hidden units and weights), use it to predict the label of a test point
Forward Propagation –
Start from input layer For each subsequent layer, compute output of sigmoid unit
Sigmoid unit: 1-Hidden layer, 1 output NN: oh
![Page 12: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/12.jpg)
Differentiable
d
d
d
Training Neural Networks
![Page 13: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/13.jpg)
• Consider regression problem f:XàY , for scalar Y y = f(x) + ε
assume noise N(0,σε), iid
deterministic
M(C)LE Training for Neural Networks
Learned neural network
• Let’s maximize the conditional data likelihood
Train weights of all units to minimize sum of squared errors of predicted network outputs
![Page 14: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/14.jpg)
• Consider regression problem f:XàY , for scalar Y y = f(x) + ε
noise N(0,σε)
deterministic
MAP Training for Neural Networks
Gaussian P(W) = N(0,σΙ)
ln P(W) ↔ c ∑i wi2
Train weights of all units to minimize sum of squared errors of predicted network outputs plus weight magnitudes
![Page 15: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/15.jpg)
d
E – Mean Square Error
For Neural Networks, E[w] no longer convex in w
![Page 16: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/16.jpg)
Error Gradient for a Sigmoid Unit
ll
ll
l
l
l l
ll
l
l ll
l
ly
ly
ly ly
ly
ly
ll
l
ll l
l ll
ll l l lly
Sigmoid Unit
d
d d
![Page 17: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/17.jpg)
(MLE)
l l l lly k
l l l l
l
l l lo
Using Forward propagation
yk = target output (label)
ok/h = unit output (obtained by forward propagation)
wij = wt from i to j
Note: if i is input variable, oi = xi
![Page 18: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/18.jpg)
Using all training data D
llly
l
l
llly l
![Page 19: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/19.jpg)
Objective/Error no longer convex in weights
![Page 20: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/20.jpg)
Our learning algorithm involves a parameter n=number of gradient descent iterations
How do we choose n to optimize future error? (note: similar issue for logistic regression, decision trees, …) e.g. the n that minimizes error rate of neural net over future data
Dealing with OverfiVng
![Page 21: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/21.jpg)
Our learning algorithm involves a parameter n=number of gradient descent iterations
How do we choose n to optimize future error? • Separate available data into training and validation set • Use training to perform gradient descent • n ß number of iterations that optimizes validation set error
Dealing with OverfiVng
![Page 22: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/22.jpg)
Idea: train multiple times, leaving out a disjoint subset of data each time for test. Average the test set accuracies.
________________________________________________ Partition data into K disjoint subsets For k=1 to K
testData = kth subset h ß classifier trained* on all data except for testData
accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies
* might withhold some of this to choose number of gradient decent steps
K-‐fold Cross-‐valida0on
![Page 23: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/23.jpg)
This is just k-fold cross validation leaving out one example each iteration ________________________________________________ Partition data into K disjoint subsets, each containing one example For k=1 to K
testData = kth subset h ß classifier trained* on all data except for testData
accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies
* might withhold some of this to choose number of gradient decent steps
Leave-‐one-‐out Cross-‐valida0on
![Page 24: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/24.jpg)
• Cross-validation
• Regularization – small weights imply NN is linear (low VC dimension)
• Control number of hidden units – low complexity
Dealing with OverfiVng
Σwixi
Logi
stic
out
put
![Page 25: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/25.jpg)
![Page 26: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/26.jpg)
![Page 27: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/27.jpg)
![Page 28: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/28.jpg)
![Page 29: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/29.jpg)
w0 left strt right up
![Page 30: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/30.jpg)
Semantic Memory Model Based on ANN’s
[McClelland & Rogers, Nature 2003]
No hierarchy given.
Train with assertions, e.g., Can(Canary,Fly)
![Page 31: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/31.jpg)
Humans act as though they have a hierarchical memory organization
1. Victims of Semantic Dementia progressively lose knowledge of objects But they lose specific details first, general properties later, suggesting hierarchical memory organization
Thing Living
Animal Plant
NonLiving
Bird Fish
Canary
2. Children appear to learn general categories and properties first, following the same hierarchy, top down*.
* some debate remains on this.
Question: What learning mechanism could produce this emergent hierarchy?
![Page 32: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/32.jpg)
Memory deterioration follows semantic hierarchy [McClelland & Rogers, Nature 2003]
![Page 33: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/33.jpg)
![Page 34: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/34.jpg)
• Suppose we want to predict next state of world – and it depends on history of unknown length – e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns
Training Networks on Time Series
![Page 35: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/35.jpg)
• Suppose we want to predict next state of world – and it depends on history of unknown length – e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns
• Idea: use hidden layer in network to capture state history
Training Networks on Time Series
![Page 36: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/36.jpg)
How can we train recurrent net??
Training Networks on Time Series
![Page 37: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell](https://reader033.fdocuments.in/reader033/viewer/2022050114/5f4bc4e2c73ffb63852475c1/html5/thumbnails/37.jpg)
Ar0ficial Neural Networks: Summary
• Ac1vely used to model distributed computa1on in brain • Highly non-‐linear regression/classifica1on • Vector-‐valued inputs and outputs • Poten1ally millions of parameters to es1mate -‐ overfiWng • Hidden layers learn intermediate representa1ons – how many
to use?
• Predic1on – Forward propaga1on • Gradient descent (Back-‐propaga1on), local minima problems
• Mostly obsolete – kernel tricks are more popular, but coming back in new form as deep belief networks (probabilis1c interpreta1on)