Jeff Howbert Introduction to Machine Learning Winter 2012 1 Classification Neural Networks 1.
-
Upload
gregory-randall -
Category
Documents
-
view
251 -
download
4
Transcript of Jeff Howbert Introduction to Machine Learning Winter 2012 1 Classification Neural Networks 1.
Jeff Howbert Introduction to Machine Learning Winter 2012 1
Classification
Neural Networks 1
Jeff Howbert Introduction to Machine Learning Winter 2012 2
Neural networks
Topics– Perceptrons
structure training expressiveness
– Multilayer networks possible structures
– activation functions training with gradient descent and backpropagation expressiveness
Jeff Howbert Introduction to Machine Learning Winter 2012 3
Connectionist models
Consider humans:– Neuron switching time ~ 0.001 second– Number of neurons ~ 1010
– Connections per neuron ~ 104-5
– Scene recognition time ~ 0.1 second– 100 inference steps doesn’t seem like enough
Massively parallel computation
Jeff Howbert Introduction to Machine Learning Winter 2012 4
Neural networks
Properties:– Many neuron-like threshold switching units– Many weighted interconnections among units– Highly parallel, distributed process– Emphasis on tuning weights automatically
Jeff Howbert Introduction to Machine Learning Winter 2012 5
Neural network application
ALVINN: An Autonomous Land Vehicle In a Neural Network
(Carnegie Mellon University Robotics Institute, 1989-1997)
ALVINN is a perception system which learns to control the NAVLAB vehicles by watching a person drive. ALVINN's architecture consists of a single hidden layer back-propagation network. The input layer of the network is a 30x32 unit two dimensional "retina" which receives input from the vehicles video camera. Each input unit is fully connected to a layer of five hidden units which are in turn fully connected to a layer of 30 output units. The output layer is a linear representation of the direction the vehicle should travel in order to keep the vehicle on the road.
Jeff Howbert Introduction to Machine Learning Winter 2012 6
Neural network application
ALVINN drives 70 mph on highways!
Jeff Howbert Introduction to Machine Learning Winter 2012 7
X1
X2
X3
Y
t
Output node
Input nodes
Perceptron structure
Model is an assembly of nodes connected by weighted links
Output node sums up its input values according to the weights of their links
Output node sum then compared against some threshold t
)( txwsignyj
jj
or)( txwIyj
jj
Jeff Howbert Introduction to Machine Learning Winter 2012 8
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
Black box
Output
Input
Example: modeling a Boolean function
Output Y is 1 if at least two of the three inputs are equal to 1.
Jeff Howbert Introduction to Machine Learning Winter 2012 9
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
0.3
0.3
0.3 t = 0.4
Output node
Input nodes
Perceptron model
otherwise0
trueis if1)( where
)4.03.03.03.0( 321
zzI
xxxIy
Jeff Howbert Introduction to Machine Learning Winter 2012 10
Example: decision surface for Boolean function on preceding slides
Perceptron decision boundary
Perceptron decision boundaries are linear(hyperplanes in higher dimensions)
Jeff Howbert Introduction to Machine Learning Winter 2012 11
Can model any function where positive and negative examples are linearly separable
– Examples: Boolean AND, OR, NAND, NOR
Cannot (fully) model functions which are not linearly separable.
– Example: Boolean XOR
Expressiveness of perceptrons
Jeff Howbert Introduction to Machine Learning Winter 2012 12
1. Initialize weights with random values.
2. Do
a. Apply perceptron to each training example.
b. If example is misclassified, modify weights.
3. Until all examples are correctly classified, or process has converged.
Perceptron training process
Jeff Howbert Introduction to Machine Learning Winter 2012 13
Two rules for modifying weights during training:– Perceptron training rule
train on thresholded outputs driven by binary differences between correct and
predicted outputs modify weights with incremental updates
– Delta rule train on unthresholded outputs driven by continuous differences between correct
and predicted outputs modify weights via gradient descent
Perceptron training process
Jeff Howbert Introduction to Machine Learning Winter 2012 14
1. Initialize weights with random values.
2. Do
a. Apply perceptron to each training sample i.
b. If sample i is misclassified, modify all weights j.
3. Until all samples are correctly classified.
Perceptron training rule
constant) small (a rate learning is
1)or (0output perceptron ed thresholdis ˆ
1)or (0 samplefor output (correct) target is
where
)ˆ(
η
y
iy
xyyww
i
i
ijiijj
Jeff Howbert Introduction to Machine Learning Winter 2012 15
a. If sample i is misclassified, modify all weights j.
Examples:
Perceptron training rule
amount largeby increased negative large, ;1ˆ
amount largeby decreased negative large, ;1ˆ
amount smallby increased positive small, ;1ˆ
update no ˆ
jijii
jijii
jijii
ii
wxyy
wxyy
wxyy
yy
constant) small (a rate learning is
1)or (0output perceptron ed thresholdis ˆ
1)or (0 samplefor output (correct) target is
where
)ˆ(
η
y
iy
xyyww
i
i
ijiijj
Jeff Howbert Introduction to Machine Learning Winter 2012 16
1
0
1
1
-0.1
0.5
0.6 t = 0.4
0.5
Example of processing one sample
Perceptron training rule
1
0
1
0
-0.2
0.5
0.5 t = 0.4
0.3
1.0)ˆ(
0.0)ˆ(
1.0)ˆ(
1ˆ
1.0
3
2
1
iii
iii
iii
ii
xyy
xyy
xyy
yy
Jeff Howbert Introduction to Machine Learning Winter 2012 17
Based on squared error function for weight vector:
Note that error is difference between correct output and unthresholded sum of inputs, a continuous quantity (rather than binary difference between correct output and thresholded output).
Weights are modified by descending gradient of error function.
Delta training rule
i
iii
ii yyyE 22 )(2
1)ˆ(
2
1)( xww
Jeff Howbert Introduction to Machine Learning Winter 2012 18
Squared error function for weight vector w
Jeff Howbert Introduction to Machine Learning Winter 2012 19
Gradient of error function
jj
d
w
Ew
E
w
E
w
E
w
EE
: weightindividualfor rule Training
)(
:for rule Training
,,,)(
:Gradient
10
ww
w
w
Jeff Howbert Introduction to Machine Learning Winter 2012 20
iijii
j
iiji
ii
iiji
ii
iii
j
iii
jj
xyyw
E
yw
yy
yyw
yy
yyw
yyww
E
))(ˆ(
)()ˆ(
)ˆ()ˆ(22
1
)ˆ(2
1
)ˆ(2
1
2
2
xw
Gradient of squared error function
Jeff Howbert Introduction to Machine Learning Winter 2012 21
constant) small (a rate learning is
value)s(continuououtput perceptron dedunthreshol is ˆ
1)or (0 samplefor output (correct) target is
where
)ˆ(
η
y
iy
xyyww
i
i
ijiijj
1. Initialize weights with random values.
2. Do
a. Apply perceptron to each training sample i.
b. If sample i is misclassified, modify all weights j.
3. Until all samples are correctly classified, or process converges.
Delta training rule
Jeff Howbert Introduction to Machine Learning Winter 2012 22
Gradient descent: batch vs. incremental
Incremental mode (illustrated on preceding slides)
– Compute error and weight updates for a single sample.
– Apply updates to weights before processing next sample.
Batch mode– Compute errors and weight updates for a block of
samples (maybe all samples).
– Apply all updates simultaneously to weights.
Jeff Howbert Introduction to Machine Learning Winter 2012 23
Perceptron training rule vs. delta rule
Perceptron training rule guaranteed to correctly classify all training samples if:
– Samples are linearly separable.
– Learning rate is sufficiently small.
Delta rule uses gradient descent. Guaranteed to converge to hypothesis with minimum squared error if:
– Learning rate is sufficiently small.
Even when:
– Training data contains noise.
– Training data not linearly separable.
Jeff Howbert Introduction to Machine Learning Winter 2012 24
Equivalence of perceptron and linear models
X1
X2
X3
Y
t
Output node
Input nodes
)( txwIyj
jj