Linear Discriminant Functions
Chapter 5 (Duda et al.)
CS479/679 Pattern RecognitionDr. George Bebis
Generative vs Discriminant Approach
• Generative approaches find the discriminant function by first estimating the probability distribution of the patterns belonging to each class.
• Discriminant approaches find the discriminant function explicitly, without assuming a probability distribution.
Generative Approach – Example(two categories)
• More common to use a single discriminant function (dichotomizer) instead of two:
• Examples:1 2
1 1
2 2
( ) ( / ) ( / )
( / ) ( )( ) ln ln
( / ) ( )
g P P
p Pg
p P
x x x
xx
x
Discriminant Approach
• Specify parametric form of the discriminant function, for example, a linear discriminant:
Decide 1 if g(x) > 0 and 2 if g(x) < 0
If g(x)=0, then x lies on the decision boundary and can be assigned to either class.
0 01
( )d
ti i
i
g w w x w
x w x
Discriminant Approach (cont’d)
• Find the “best” decision boundary (i.e., estimate w and w0) using a set of training examples xk.
Discriminant Approach (cont’d)
• The solution is found by minimizing a criterion function (e.g., “training error” or “empirical risk”):
• Learning algorithms can be applied to find the solution.
20
1
1( , ) [ ( )]
n
k kk
J w z g xn
w
correct class predicted class
Linear Discriminant Functions:two-categories case
• A linear discriminant function has the following form:
• The decision boundary, is a hyperplane where the
orientation of the hyperplane is determined by w and its location by w0.
– w is the normal to the hyperplane– If w0=0, the hyperplane passes through the origin
0 01
( )d
ti i
i
g w w x w
x w x
Geometric Interpretation of g(x)
• g(x) provides an algebraic measure of the distance of x from the hyperplane.
direction of r
|| ||p r w
x xw
x can be expressed as follows:
Geometric Interpretation of g(x) (cont’d)
• Substitute x in g(x):
0 0
0
( ) ( )|| ||
|| |||| ||
t tp
ttp
g w r w
r w r
wx w x w x
w
w ww x w
w
0 0tp w w xsince 2|| ||t w w w and
Geometric Interpretation of g(x) (cont’d)
• Therefore, the distance of x from the hyperplane is given by:
( )
|| ||
gr
x
w
0
|| ||
wr
wsetting x=0:
Linear Discriminant Functions: multi-category case
• There are several ways to devise multi-category classifiers using linear discriminant functions:
(1) One against the rest
problem:ambiguous regions
Linear Discriminant Functions: multi-category case (cont’d)
(2) One against another (i.e., c(c-1)/2 pairs of classes)
problem:ambiguous regions
Linear Discriminant Functions: multi-category case (cont’d)
• To avoid the problem of ambiguous regions:– Define c linear discriminant functions– Assign x to i if gi(x) > gj(x) for all j i.
• The resulting classifier is called a linear machine
(see Chapter 2)
Linear Discriminant Functions: multi-category case (cont’d)
• A linear machine divides the feature space in c convex decisions regions.– If x is in region Ri, the gi(x) is the largest.
Note: although there are c(c-1)/2 pairs of regions, there typically less decision boundaries
Linear Discriminant Functions: multi-category case (cont’d)
• The decision boundary between adjacent regions Ri and Rj is a portion of the hyperplane Hij given by:
• (wi-wj) is normal to Hij and the signed distance from x to Hij is
0 0
( ) ( ) ( ) ( ) 0
( ) ( ) 0
i j i j
ti j i j
g g or g g
or w w
x x x x
w w x
( ) ( )
|| ||i j
i j
g gr
x x
w w
Higher Order Discriminant Functions
• Can produce more complicated decision boundaries than linear discriminant functions.
( )g x
Generalized discriminants
- α is a dimensional weight vector
ˆ
1
( ) ( ) ( )d
ti i
i
g a y or g
x x x α y
d̂
11
22
ˆ
( )
( )
......
( )d d
yx
yx
yx
x
xx
x
• the φ functions yi(x) map a point from the d-dimensional x-space to a point in the -dimensional y-space (usually >> d )
d̂
d̂
φ
- defined through special functions yi(x) called φ functions
Generalized discriminants (cont’d)
• The resulting discriminant function is linear in y-space.
• Separates points in the transformed space by a hyperplane passing through the origin.
ˆ
1
( ) ( ) ( )d
ti i
i
g a y or g y
x x x α
Example
12
22
3
1 ( ) 1
( ) 1 2 1 , ( )
2 ( )
y x
g x x x y y x x
y x x
α
φ functions
( ) 0 1 0.5g x if x or x
The corresponding decision regions R1,R2 in the x-space are not simply connected!
d=1, ˆ 3d
( ) tg yx α
Example (cont’d)
g(x) maps a line in x-space to a parabola in y-space.
The plane αty=0 divides the y-space in two decision regions 1 2
ˆ ˆ,R R
Learning: two-category, linearly separable case
• Given a linear discriminant function
the goal is to “learn” the parameters w and w0
from a set of n labeled samples xi where each xi
has a class label ω1 or ω2.
0 01
( )d
ti i
i
g w w x w
x w x
Augmented feature/parameter space
0 0 0
1 0
( )d d
t ti i i i
i i
g w w x x w w x
x w x α y
1 0 1 0
2 1 2 1
... ... ... ...
d d d d
w w x x
w w x x
w w x x
w α x y
dimensionality: d (d+1)
0( 1)x
• Simplify notation:
Classification in augmented space
Classification rule: If αtyi>0 assign yi to ω1
else if αtyi<0 assign yi to ω2
g(x)=αty Discriminant:
Learning in augmented space: two-category, linearly separable case
• Given a linear discriminant function
the goal is to learn the weights (parameters) α from a set of n labeled samples yi where each yi
has a class label ω1 or ω2.
g(x)=αty
Learning in augmented space:effect of training examples
• Every training sample yi places a constraint on the weight vector α.– αty=0 defines a hyperplane in parameter space having
y as a normal vector.– Given n examples, the solution α must lie on the
intersection of n half-spaces.
parameter space (ɑ1, ɑ2)
aa11
aa221i y
2j y
Learning in augmented space:effect of training examples (cont’d)
• Visualize solution in the parameter or feature space.
feature space (y1, y2)
aa11
aa221i y
2j y
parameter space (ɑ1, ɑ2)
1
2
Uniqueness of Solution
• Solution vector α is usually not unique; we can impose
certain constraints to enforce uniqueness:
“Find unit-length weight vector that maximizes the minimum distance from the training examples to the separating plane”
Iterative Optimization
• Define an error function J(α) (i.e., missclassifications) that is minimized if α is a solution vector.
• Minimize J(α) iteratively: ( 1) ( ) ( ) kk k k α α p
( )kα(k)(k) α(k+1)(k+1)( ) kk p kp search directionsearch direction
learning ratelearning rate
How should we define pk?
Choosing pk using Gradient Descent
learninglearning
raterate
( ( ))k J k p α
(note: replace a with α)
Gradient Descent (cont’d)
-2 -1 0 1 2-2
-1
0
1
2
solution space -
(0)α( )kα
J(α)
α
Gradient Descent (cont’d)
-2 -1 0 1 2-2
-1
0
1
2
-2 -1 0 1 2-2
-1
0
1
2
0.37= 0.39=
ηηηη
• What is the effect of the learning rate?
fast by overshoots solutionslow but converges to solution
J(α)
Gradient Descent (cont’d)
• How to choose the learning rate (k)?
• If J(α) is quadratic, then H is constant which implies that the learning rate is constant.
Taylor series approximation
optimum learning rate
Hessian (2nd derivatives)
(note:replace a with α)
Choosing pk using Newton’s Method
requires inverting H
1 ( ( ))k H J k p α
(note: replace a with α)
Newton’s method (cont’d)
-2 -1 0 1 2-2
-1
0
1
2
If J(α) is quadratic, Newton’s method converges in one step!
J(α)
Gradient descent vs Newton’s method
“Normalized” Problem
• If yi in ω2, replace yi by -yi
• Find α such that: αtyi>0
replace yi by -yi
Seek a hyperplane that separates patterns from different categories
Seek a hyperplane that puts normalized patterns on the same (positive) side
Perceptron rule
• Use Gradient Descent assuming:
where Y(α) is the set of samples misclassified by α.• If Y(α) is empty, Jp(α)=0; otherwise, Jp(α)>0
( )
( ) ( )tp
Y
J
y α
α α y
Find α such that: αtyi>0
Perceptron rule (cont’d)
• The gradient of Jp(α) is:
• The perceptron update rule is obtained using gradient descent:
( )
( )pY
J
y α
y
( )
( 1) ( ) ( )Y
k k k
y α
α α yor
(note: replace a with α)
( )
( ) ( )tp
Y
J
y α
α α y
Perceptron rule (cont’d)
missclassifiedexamples
(note: replace a with α andYk with Y(αα))
Perceptron rule (cont’d)
• Move the hyperplane so that training samples are on its positive side.
aa22
aa11
Example:
Perceptron rule (cont’d)
Perceptron Convergence Theorem: If training samples are linearly separable, then the sequence of weight vectors by the above algorithm will terminate at a solution vector in a finite number of steps.
ηη(k)=1(k)=1
k α α y
one exampleone example
at a timeat a time
Perceptron rule (cont’d)
order of examples: y2 y3 y1 y3
“Batch” algorithm leads to a smoothertrajectory in solutionspace.
Top Related