Perceptron Linear Classifiers
-
Upload
himanshu-saxena -
Category
Documents
-
view
59 -
download
1
description
Transcript of Perceptron Linear Classifiers
-
2/19/2013
1
Perceptron
Dr Ashutosh GuptaDr. Ashutosh Gupta
Outline Single Layer Discrete Perceptron Networks
Perceptron learning rule Perceptron training Learning algorithm Properties of Perceptron What a perceptron does?
Regression Classification
Hyperplane based Classification Decision BoundaryDecision Boundary Linear classification via hyperplane Perceptron Algorithm How perceptron Update works? Perceptron Convergence Theorem Example: a simple problem
-
2/19/2013
2
Outline Linear Machines and Minimum Distance classification Single layer continuous perceptron networks for linearly
separable classification Delta rule
Perceptron vs. Delta rule XOR problem Delta rule Generalization and Early stopping
Overfitting Training timeTraining time
Limitations of perceptrons Linear inseparability Multi Layer perceptron
Example: Perceptrons as Constraint Satisfaction Networks
Discrete Perceptron Linear threshold unit (LTU)
i=0
n
n
X1
X2
w1w2
w0x0=1
o
1 if wi xi >0o(xi)= 1 otherwise{
takes a vector of realvalued inputs (x1, ..., xn) weighted with (w1, ...,wn) calculates the linear combination of these inputs
n
i=0
2
Xn
.
.. wn wi xi
4
w0 denotes a threshold valuex0 is always 1
outputs 1 if the result is greater than 1, otherwise 1
-
2/19/2013
3
Representational Power many boolean functions can be represented by a
perceptron: AND, OR, NAND, NOR
a perceptron represents a hyperplane decision surfacein the n dimensional space of instancesin the ndimensional space of instances
some sets of examples cannot be separated by any hyperplane, those that can be separated are called linearly separable
Perceptron Learning Ruleproblem: determine a weight vector w that causes the perceptron to produce the correct output for each training example
perceptron training rule:Weight adjustment wi = c [di sgn (witx)] x = c [di oi] x
di is the desired responseoi is the perceptron outputc is a small constant (e.g. 0.1) called learning rate
6
( g ) g
wi = wi + wi
-
2/19/2013
4
Perceptron Learning Rulealgorithm:1. Initialize w to random weights2. repeat, until each training example is classified correctly
(a) apply perceptron training rule to each training example(a) apply perceptron training rule to each training example
If the output is correct (di = oi) the weights wi are not changed If the output is incorrect (di oi) the weights wi are changed such
that the output of the perceptron for the new weights is closer to di .
7
The algorithm converges to the correct classification if the training data is linearly separable and c is sufficiently small
Supervised Learning
Training and test data sets Training set; input & target Training set; input & target
8
-
2/19/2013
5
Perceptron Training
1 if wi xi >tOutput =
0 otherwise{ i=0
9
Linear threshold is used. W weight value t threshold value
Simple network
W = 1.51
1 if wi xi >toutput=
0 otherwise{ i=0
10
t = 0.0
Y
X
W = 1
-
2/19/2013
6
Training Perceptrons1
W = ?
For AND
A B Output
0 0 0
t = 0.0
y
x
W = ?
W = ?
0 1 0
1 0 0
1 1 1
What are the weight values? What are the weight values?
11
What are the weight values? What are the weight values? Initialize with random weight valuesInitialize with random weight values
Learning algorithm
Epoch : Presentation of the entire training set to the l t kneural network.
In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1])
Error: The error value is the amount by which the value
12
output by the network differs from the target value. For example, if we required the network to output 0 and it output a 1, then Error = 1
-
2/19/2013
7
Learning algorithmTarget Value, T : When we are training a network we not
only present it with the input but also with a value that we require the network to produce.
F l if t th t k ith [1 1] f thFor example, if we present the network with [1,1] for the AND function the training value will be 1
Output , O : The output value from the neuron
Ij : Inputs being presented to the neuron
Wj : Weight from input neuron (Ij) to the output neuronWj : Weight from input neuron (Ij) to the output neuron
LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of experimentation. It is typically 0.1
Properties of Perceptrons Separability: some parameters get the training set perfectly
correct
Convergence: if the training is separable, perceptron will eventually converge (binary case)
Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability
-
2/19/2013
8
What a Perceptron Does ? Regression: y=wx+w0 Classification: y=1 (wx+w0>0)
Regression:
-
2/19/2013
9
Hyperplane Separates a Ddimensional space into two halfspaces
Defined by an outward pointing normal vector w w is orthogonal to any vector lying on the hyperplane Assumption: The hyperplane passes through origin. If
not, have a bias term b; we will then need both w and b to define it b > 0 means moving it parallely along w (b < 0 means in
opposite direction)
-
2/19/2013
10
Decision boundaries In simple cases, divide feature space by drawing a
hyperplane across it.
Known as a decision boundary.
Discriminant function: returns different values on opposite Discriminant function: returns different values on opposite sides. (straight line)
Problems which can be thus classified are linearly separable.
19
Discriminantfunction
Linear Classification via Hyperplanes Linear Classifiers: Represent the decision boundary by a hyperplane w
Decision region R1Decision region R2
decision boundary (surface)
For binary classification, w is assumed to point towards the positive class Classification rule:
wt x + b > 0 y = +1 wt x + b < 0 ) y = 1
Question: What about the points x for which wt x + b = 0? Goal: To learn the hyperplane (w, b) using the training data.
=> to find hyperplane equation wt x + b = 0 (It is a decision boundary)
-
2/19/2013
11
Concept of Margins Geometric margin gnof an example xn is its distance from the hyperplane
Geometric margin may be positive (if yn = +1) or negative (if yn = 1) Margin of a set {x1, . . . , xN} is the minimum absolute geometric margin
Functional margin of a training example: yn(wt xn + b) Positive if prediction is correct; Negative if prediction is incorrect
Absolute value of the functional margin = confidence in the predicted label ..or misconfidence if prediction is wrong large margin high confidence
The Perceptron Algorithm One of the earliest algorithms for linear classification (Rosenblatt, 1958)
Based on finding a separating hyperplane of the data
G t d t fi d ti h l if th d t i li l bl Guaranteed to find a separating hyperplane if the data is linearly separable
If data not linear separable Make it linearly separable .. or use a combination of multiple perceptrons (Neural Networks)
-
2/19/2013
12
The Perceptron Algorithm Cycles through the training data by
processing training examples one at a time (an online algorithm)
Starts with some initialization for (w, b) (e.g., w = [0, . . . , 0]; b = 0) An iterative mistakedriven learning algorithm for updating (w, b)
Dont update if w correctly predicts the label of the current training example
Update w when it mispredicts the label of the current training example True label is +1, but sign(wt x + b) = 1 (or viceversa)
Repeat until convergence Batch vs Online learning algorithms:g g
Batch algorithms operate on the entire training data Online algorithms can process one example at a time
Usually more efficient (computationally, memoryfootprintwise) than batch
Often batch problems can be solved using online learning!
The Perceptron Algorithm: Formally Given: Sequence of N training examples {(x1, y1), . . . , (xN, yN)} Initialize: w = [0, . . . , 0], b = 0 Repeat until convergence:
For n = 1, . . . ,N if sign(wt xn + b) yn (i.e., mistake is made)f g ( n ) yn ( , )
w = w + ynxnb = b + yn
Stopping condition: stop when either All training examples are classified correctly
May overfit, so less common in practice
A fixed number of iterations completed or some convergence criteria metA fixed number of iterations completed, or some convergence criteria met
Completed one pass over the data (each example seen once)E.g., examples arriving in a streaming fashion and cant be stored in memory (more passes just not possible)
Note: sign(wt xn + b) yn is equivalent to yn(wt xn + b) < 0
-
2/19/2013
13
Why Perceptron Updates Work? Lets look at a misclassified positive example (yn = +1)
Perceptron (wrongly) thinks wtoldxn + bold < 0
Updates would be wnew = wold + ynxn = wold + xn (since yn = +1) bnew = bold + yn = bold + 1
wtnewxn + bnew = (wold + xn)t xn + bold + 1
( t b ) t= (wtoldxn + bold ) + xtn xn + 1
Thus wtnew xn + bnew is less negative than wtoldxn + bold
So we are making ourselves more correct on this example!
Why Perceptron Updates Work (Pictorially)?
-
2/19/2013
14
Why Perceptron Updates Work? Lets look at a misclassified negative example (yn = 1)
Perceptron (wrongly) thinks wtoldxn + bold > 0
Updates would be wnew = wold + ynxn = wold xn (since yn = 1) bnew = bold + yn = bold 1
wtnewxn + bnew = (wold xn)t xn + bold 1( t b ) t= (wtoldxn + bold ) xtn xn 1
Thus wtnew xn + bnew is less positive than wtoldxn + bold
So we are making ourselves more correct on this example!
Why Perceptron Updates Work (Pictorially)?
-
2/19/2013
15
Perceptron convergence theorem The perceptron convergence theorem states that if
the perceptron learning rule is applied to a linearlyseparable data set, a solution will be found after somefi i b f dfinite number of updates.
The number of updates depends on the data set, andalso on the step size parameter.
If the data is not linearly separable, there will beoscillation (which can be detected automatically).
Example: a simple problem4 points linearly separable
0.5
1
1.5
2
X4 = (1,1/2)
X3 = (1/2, 1)
X1 = (-1,1/2)
d3= 1
d1 = - 1 d4= 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
X2 = (-1,1) d2 = - 1
4
-
2/19/2013
16
2initial weights
wi = c [di sgn (witx)] x = c [di oi] x di is the desired responseoi is the perceptron outputc is a small constant (e.g. 0.1) called learning rate
wi = wi + wi
0
0.5
1
1.5
2
W(0) = (0,1) X3 = (1/2, 1)
X4 = (1,1/2)X1 = (-1,1/2)
d3= 1
d1 = - 1 d4= 1
W0 = [0 1]
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
X2 = (-1,1) d2 = - 1
Initial weight Conditions:
wi = c [di sgn (witx)] x = c [di oi] x di is the desired responseoi is the perceptron outputc is a small constant (e.g. 0.1) called learning rate
wi = wi + wig
Learning constant c =0.1 Training sets: ( vectors x1, x2, x3, x4 are augmented with x0 =1)
i=0
n
X wx0=1 1 if wi xi >0o(xi)= 1 otherwise{
n
i=0
X1
X2
w1
w2
w0
wi xio
i 1 otherwise{
-
2/19/2013
17
Initial discriminate function: w1x1 +w2x2+ w0 = 0 => x2+ 1 = 0 => x2= 1 Step 1:
x1
x2
Initial discriminate function
=> 0.2x1 +0.9x2+ 0.8 = 0
Point (1,0.5) is still mis classified !!!!
x1
Step 1 modification in discriminate function
x2
Step 2:
x1
x2
Initial discriminate function
=> 0.4x1 +0.7x2+ 0.6 = 0
Step 2 modification in discriminate function
Step 1 modification in discriminate function
Point (1,0.5) is still mis classified !!!!
-
2/19/2013
18
Step 3:
=> 0.4x1 +0.7x2+ 0.6 = 01 2(No change in discriminate function)
x2
Initial discriminate function
Step 2 x1
Step 1 modification in discriminate function
modification in discriminate function
Step 4:
=> 0.4x1 +0.7x2+ 0.6 = 0 1 2
(No change in discriminate function)
x2
Initial discriminate function
Step 2 x1
Step 1 modification in discriminate function
modification in discriminate function
-
2/19/2013
19
Step 5:
=> 0.6x1 +0.6x2+ 0.4 = 0 x2
Initial discriminate function
Step 5 modification in discriminate function
Point (1,0.5) is still mis classified !!!!
x1
Step 6:
=> 0.8x1 +0.4x2+ 0.2 = 0
x2
Initial discriminate function
Step 6 modification in discriminate function
Final discriminate function
x1(correct classification)
In 6th step: All points are linearly classified
-
2/19/2013
20
The Best Hyperplane Separator? Perceptron finds one of the many possible hyperplanes separating
the data
.. if one exists Of the many possible choices, which one is the best?y p ,
Intuitively, we want the hyperplane having the maximum margin Large margin leads to good generalization on the test data
Linear Machines and Minimum Distance Classification
-
2/19/2013
21
two clusters of patterns, each cluster belonging to one known category (class). The center points (centers of gravity )of the clusters shown
of classes 1 and 2 are vectors x1 and x2, respectively. decision hyperplane contain the midpoint of the line
Linear Machines and Minimum Distance Classification
decision hyperplane contain the midpoint of the line segment connecting prototype points P1 and P2, and normal to the vector x1 x2, which is directed toward P1 .
Linear Machines and Minimum Distance Classification Decision hyperplane equation:
Hyperplane equation in terms of w and x for n dimensional space:
The weighting coefficients w1, w2, . . . , wn+1 are obtained by comparing (3.9) and (3.10) as follows:
-
2/19/2013
22
Linear Machines and Minimum Distance Classification Let us assume that a minimumdistance classification is required to
classify patterns into one of the R categories.
Each of the R classes is represented by prototype points P1, P2, . . . , PR being vectors x1, x2, . . . , xR, respectively.
The Euclidean distance between input pattern x and the prototype p p p yppattern vector xi is expressed by the norm of the vector x xi as follows:
A minimumdistance classifier computes the distance from pattern xof unknown classification to each prototype.
Linear Machines and Minimum Distance Classification category number of that closest, or smallest distance, prototype is assigned to the
unknown pattern
Calculating the squared distances from Equation (3.12) yields
term xtx is independent of i only R terms are computed; 2xitxxitxi, for i = 1, . . . , R, in (3.13) determine for which xi this term takes the largest of all R values. Choosing the largest of the terms xitx0.5xitxi is equivalent to choosing the smallest
of the distances llx xill This property is used to equate the highlighted term with a discriminant function
gi(x): [remember : hyperplane w = x1 x2]
gi(x) = xitx 0.5xitxi , for i = 1, 2, . . . , R (3.14)gi(x) xi x 0.5xi xi , for i 1, 2, . . . , R (3.14)
Also, gi(x)= witx + wi, n+1 , for i= , 1 , 2, ..., R (3.15) Compare (3.14) and (3.15), we get,
-
2/19/2013
23
Linear Machines and Minimum Distance Classification minimumdistance classifiers are linear classifiers => linear machines. Since minimumdistance classifiers assign category membership based on the
closest match => correlation classification
block diagram of a linear machine employing linear discriminantfunctions asin Equation (3.15)
decision surface Sij , for the contiguous decision regions Ri, Rj is a hyperplane given by the equation
Sij =
Linear Machines and Minimum Distance Classification
-
2/19/2013
24
Linear Machines and Minimum Distance Classification Example: The assumed prototype points are as shown & there
coordinates are:
assumed that each prototype point index corresponds to its class numberclass number.
Linear Machines and Minimum Distance Classification Using formula (3.16) for n=2, R = 3, the weight vectors can be
obtained as:
The corresponding linear discriminant functions are:
-
2/19/2013
25
The resulting classifier is shown in Figure 3.8(d).Linear Machines and Minimum Distance Classification
there are three decision lines S12, S13, and S23 The decision lines can be calculated by using the condition (3.17)
and the discriminant functions (3.20b) as:
Linear Machines and Minimum Distance Classification
-
2/19/2013
26
For input pattern xt=[6 1]
Linear Machines and Minimum Distance Classification
g1(x) = 60+2 52 = 10 g2(x) = 12514.5 = 7.5 g3(x) = 30 +5 25 = 50 Maximum is g1(x): so pattern x belongs to class 1
Input pattern x
Single layer continuous perceptron networks for linearly separbale classification
TLU element with weights will be replaced by the continuous perceptron. Two objectives:
Gain finer control over the training procedure to facilitate working with differentiable characteristics of the threshold
element, thus enabling computation of the error gradient
weight modification problem could be better solved by using the gradient, or steepest descent, procedure
-
2/19/2013
27
Single layer continuous perceptron networks procedure of descent is simple. Starting from an arbitrary chosen weight vector w, the gradient
E(w) of the current error function is computed. The next value of w is obtained by moving in the direction of the
i di l h l idi i l fnegative gradient along the multidimensional error surface.
The direction of negative gradient is the one of steepest descent. The algorithm can be summarized as below:
where is the positive constant called the learning constant and the where is the positive constant called the learning constant and the superscript (k) denotes the step number.
expression for classification error to be minimized is:
Single layer continuous perceptronnetworks
error function has a single minimum at w = wf, which can be achieved using the negative
di tgradient descent starting at the initial weight vector 0.
-
2/19/2013
28
The error minimization algorithm (3.40) requires computation of the gradient of the error (3.41) as follows:
Single layer continuous perceptron networks
The n + 1dimensional gradient vector (3.43) is defined as follows:
-
2/19/2013
29
Using (3.42) we obtain for the gradient vector:
Single layer continuous perceptron networks
which is the training rule of the continuous perceptron (delta rule)
Delta rule for single layer continuous perceptronThe learning rule performs a search within the
solution's vector space towards a global minimum.
The error surface itself is a hyperparaboloid but is seldom as smooth as isseldom as smooth as is depicted below.In most problems, the
solution space is quite irregular with numerous pits and hills which may cause the network to settle down in a local minimum (not the (best overall solution).Epochs are repeated until
stopping criterion is reached (error magnitude, number of iterations, change of weights, etc).
-
2/19/2013
30
Let us express f '(net) in terms of continuous perceptron output. Using the bipolar continuous activation function f(net) of the form
Single layer continuous perceptron networks
complete delta training rule for the bipolar continuous activation function results from (3.40) as:
Single layer continuous perceptron networks
-
2/19/2013
31
Perceptron vs. Delta Rule perceptron training rule:
uses thresholded unit converges after a finite number of iterations output hypothesis classifies training data perfectlyoutput hypothesis classifies training data perfectly linearly separability necessary
delta rule: uses unthresholded linear unit converges asymptotically toward a minimum error hypothesisg y p y yp termination is not guaranteed linear separability not neccessary
XOR
Single layer Perceptroncan not solve XOR problem. !!!
-
2/19/2013
32
Different NonLinearlySeparable Problems
StructureTypes of
Decision RegionsExclusive-OR
ProblemClasses with
Meshed regionsMost General
Region ShapesDecision Regions Problem Meshed regions Region Shapes
Single-Layer
Two-Layer
Half PlaneBounded ByHyperplane
Convex OpenOr
A
AB
B
A B
BA
BA
63
Three-Layer
Closed Regions
Arbitrary(Complexity
Limited by No.of Nodes)
AB
A
AB
B
A
BA
Delta Rule perceptron rule fails if data is not linearly separable delta rule converges toward a bestfit approximation uses gradient descent to search the hypothesis space
Discrete perceptron cannot be used, because it is not differentiable
hence, a unthresholded linear unit is appropriate error measure:
to understand gradient descent, it is helpful to visualize the entire hypothesis space with all possible weight vectors and associated E values
-
2/19/2013
33
Error Surface the axes w0,w1 represent possible values for the two weights of a
simple linear unit
=> error surface must be parabolic with a single global minimum
Generalization and Early Stopping By proper training, a neural network
may produce reasonable output for inputs not seen during training GeneralizationGeneralization is particularly useful
for the analysis of a noisy data (e.g. timeseries)
Overtraining will not improve the ability of a neural network to produce good output.On the contrary, it will try to take
noise as the real data and lost its generality.
-
2/19/2013
34
Generalization and Early Stopping
Overfitting vs Generalization
Validation data set
Learning data set
Number of iteration in optimizationEarly stopping
area
Overfitting With sufficient nodes can classify any training set
exactly
May have poor generalisation ability.
Crossvalidation with some patterns Typically 30% of training patterns
Validation set error is checked each epoch
Stop training if validation error goes up
-
2/19/2013
35
Training time
How many epochs of training? Stop if the error fails to improve (has reached a minimum)
Stop if the rate of improvement drops below a certain level
Stop if the error reaches an acceptable level
Stop when a certain number of epochs have passed
Limitations of Simple Neural Networks
The Limitations of Perceptrons(Minsky and Papert, 1969)
Able to form only linear discriminate functions; i.e. classes which can be divided by a line or hyperplane
Most functions are more complex; i.e. they are nonlinear or not linearly separable
This crippled research in neural net theory for 15 years ....
-
2/19/2013
36
Linear inseparability Singlelayer perceptron with threshold units fails
if problem is not linearly separable Example: XOR
) Minsky and Paperts book showing these negative results was very influential X
1,0 1,1
Y0,0 0,1
Solution in 1980s: Multilayer perceptrons
Removes many limitations of singlelayer networks Can solve XOR
Draw a twolayer perceptron that computes the XOR function 2 binary inputs X and Y ( 0 or 1) 1 binary output ( 0 or 1)
1,0 1,1
y p ( )
One hidden layer Find the appropriate
weights and threshold
X
Y0,0 0,1
-
2/19/2013
37
EXAMPLE
L i l XOR
Hidden layer of neurons
MultilayerNeural Network
X YInput layer
Output layer
Logical XOR Function
X Y Z0 0 00 1 11 0 11 1 0
X
1,0 1,1
Two neurons are need! Their combined results can produce good classification.
Y0,0 0,1
Solution in 1980s: Multilayer perceptrons Two Examples of twolayer perceptrons that compute XOR
Logical XOR FunctionHidden layer
E h hidd d li f h li
x yx y
X Y Z0 0 00 1 11 0 11 1 0
Each hidden node realizes one of the lines E.g. left side network
Output is 1 if and only if (x + y .5 > 0) 2(x + y 1.5 > 0) 0.5 > 0 E.g. Right side network
Output is 1 if and only if x + y 2(x + y 1.5 > 0) 0.5 > 0
-
2/19/2013
38
EXAMPLE
B
A
More complex multilayer networks are neededto solve more difficult problems.
Multilayer Perceptron
The most commonOutput neurons} One or morelayers ofhidden units(hidden layers) aeag += 1 1)( (a)
The most commonoutput function(Sigmoid):
g(a)
Input nodes
}a
(a)1
(nonlinearsquashing function)
g(a)
-
2/19/2013
39
Multilayer Perceptron (MLP)
Output Values
Output Layer
AdjustableWeights
Input Layer
77
Input Signals (External Stimuli)
Input Layer
Types of Layers The input layer.
Introduces input values into the network.
No activation function or other processingNo activation function or other processing.
The hidden layer(s). Perform classification of features
Two hidden layers are sufficient to solve any problem
Features imply more layers may be better
The output layer
78
The output layer. Functionally just like the hidden layers
Outputs are passed on to the world outside the neural network.
-
2/19/2013
40
outy
Example: Perceptrons as Constraint Satisfaction Networks
1
2
1 2
1121
?
x yx
1
1 2
21
11 1 1
out y 011 + yx
02
1
-
2/19/2013
41
outy =0
Example: Perceptrons as Constraint Satisfaction Networks
1
2=1
2
x yx
1
1 2
02 > yx 02 0
x y x
1
1 2
=0=1
-
2/19/2013
42
out y 011 + yx
02
1 yx 02