Download - Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Linear Discriminant Functions

Chapter 5 (Duda et al.)

CS479/679 Pattern RecognitionDr. George Bebis

Generative vs Discriminant Approach

• Generative approaches find the discriminant function by first estimating the probability distribution of the patterns belonging to each class.

• Discriminant approaches find the discriminant function explicitly, without assuming a probability distribution.

Generative Approach – Example(two categories)

• More common to use a single discriminant function (dichotomizer) instead of two:

• Examples:1 2

1 1

2 2

( ) ( / ) ( / )

( / ) ( )( ) ln ln

( / ) ( )

g P P

p Pg

p P

x x x

xx

x

Discriminant Approach

• Specify parametric form of the discriminant function, for example, a linear discriminant:

Decide 1 if g(x) > 0 and 2 if g(x) < 0

If g(x)=0, then x lies on the decision boundary and can be assigned to either class.

0 01

( )d

ti i

i

g w w x w

x w x

Discriminant Approach (cont’d)

• Find the “best” decision boundary (i.e., estimate w and w0) using a set of training examples xk.

Discriminant Approach (cont’d)

• The solution is found by minimizing a criterion function (e.g., “training error” or “empirical risk”):

• Learning algorithms can be applied to find the solution.

20

1

1( , ) [ ( )]

n

k kk

J w z g xn

w

correct class predicted class

Linear Discriminant Functions:two-categories case

• A linear discriminant function has the following form:

• The decision boundary, is a hyperplane where the

orientation of the hyperplane is determined by w and its location by w0.

– w is the normal to the hyperplane– If w0=0, the hyperplane passes through the origin

0 01

( )d

ti i

i

g w w x w

x w x

Geometric Interpretation of g(x)

• g(x) provides an algebraic measure of the distance of x from the hyperplane.

direction of r

|| ||p r w

x xw

x can be expressed as follows:

Geometric Interpretation of g(x) (cont’d)

• Substitute x in g(x):

0 0

0

( ) ( )|| ||

|| |||| ||

t tp

ttp

g w r w

r w r

wx w x w x

w

w ww x w

w

0 0tp w w xsince 2|| ||t w w w and

Geometric Interpretation of g(x) (cont’d)

• Therefore, the distance of x from the hyperplane is given by:

( )

|| ||

gr

x

w

0

|| ||

wr

wsetting x=0:

Linear Discriminant Functions: multi-category case

• There are several ways to devise multi-category classifiers using linear discriminant functions:

(1) One against the rest

problem:ambiguous regions

Linear Discriminant Functions: multi-category case (cont’d)

(2) One against another (i.e., c(c-1)/2 pairs of classes)

problem:ambiguous regions


• To avoid the problem of ambiguous regions:– Define c linear discriminant functions– Assign x to i if gi(x) > gj(x) for all j i.

• The resulting classifier is called a linear machine

(see Chapter 2)


• A linear machine divides the feature space in c convex decisions regions.– If x is in region Ri, the gi(x) is the largest.

Note: although there are c(c-1)/2 pairs of regions, there typically less decision boundaries


• The decision boundary between adjacent regions Ri and Rj is a portion of the hyperplane Hij given by:

• (wi-wj) is normal to Hij and the signed distance from x to Hij is

0 0

( ) ( ) ( ) ( ) 0

( ) ( ) 0

i j i j

ti j i j

g g or g g

or w w

x x x x

w w x

( ) ( )

|| ||i j

i j

g gr

x x

w w

Higher Order Discriminant Functions

• Can produce more complicated decision boundaries than linear discriminant functions.

( )g x

Generalized discriminants

- α is a dimensional weight vector

ˆ

1

( ) ( ) ( )d

ti i

i

g a y or g

x x x α y

d̂

11

22

ˆ

( )

( )

......

( )d d

yx

yx

yx

x

xx

x

• the φ functions yi(x) map a point from the d-dimensional x-space to a point in the -dimensional y-space (usually >> d )

d̂

d̂

φ

- defined through special functions yi(x) called φ functions

Generalized discriminants (cont’d)

• The resulting discriminant function is linear in y-space.

• Separates points in the transformed space by a hyperplane passing through the origin.

ˆ

1

( ) ( ) ( )d

ti i

i

g a y or g y

x x x α

Example

12

22

3

1 ( ) 1

( ) 1 2 1 , ( )

2 ( )

y x

g x x x y y x x

y x x

α

φ functions

( ) 0 1 0.5g x if x or x

The corresponding decision regions R1,R2 in the x-space are not simply connected!

d=1, ˆ 3d

( ) tg yx α

Example (cont’d)

g(x) maps a line in x-space to a parabola in y-space.

The plane αty=0 divides the y-space in two decision regions 1 2

ˆ ˆ,R R

Learning: two-category, linearly separable case

• Given a linear discriminant function

the goal is to “learn” the parameters w and w0

from a set of n labeled samples xi where each xi

has a class label ω1 or ω2.

0 01

( )d

ti i

i

g w w x w

x w x

Augmented feature/parameter space

0 0 0

1 0

( )d d

t ti i i i

i i

g w w x x w w x

x w x α y

1 0 1 0

2 1 2 1

... ... ... ...

d d d d

w w x x

w w x x

w w x x

w α x y

dimensionality: d (d+1)

0( 1)x

• Simplify notation:

Classification in augmented space

Classification rule: If αtyi>0 assign yi to ω1

else if αtyi<0 assign yi to ω2

g(x)=αty Discriminant:

Learning in augmented space: two-category, linearly separable case

• Given a linear discriminant function

the goal is to learn the weights (parameters) α from a set of n labeled samples yi where each yi

has a class label ω1 or ω2.

g(x)=αty

Learning in augmented space:effect of training examples

• Every training sample yi places a constraint on the weight vector α.– αty=0 defines a hyperplane in parameter space having

y as a normal vector.– Given n examples, the solution α must lie on the

intersection of n half-spaces.

parameter space (ɑ1, ɑ2)

aa11

aa221i y

2j y

Learning in augmented space:effect of training examples (cont’d)

• Visualize solution in the parameter or feature space.

feature space (y1, y2)

aa11

aa221i y

2j y

parameter space (ɑ1, ɑ2)

1

2

Uniqueness of Solution

• Solution vector α is usually not unique; we can impose

certain constraints to enforce uniqueness:

“Find unit-length weight vector that maximizes the minimum distance from the training examples to the separating plane”

Iterative Optimization

• Define an error function J(α) (i.e., missclassifications) that is minimized if α is a solution vector.

• Minimize J(α) iteratively: ( 1) ( ) ( ) kk k k α α p

( )kα(k)(k) α(k+1)(k+1)( ) kk p kp search directionsearch direction

learning ratelearning rate

How should we define pk?

Choosing pk using Gradient Descent

learninglearning

raterate

( ( ))k J k p α

(note: replace a with α)

Gradient Descent (cont’d)

-2 -1 0 1 2-2

-1

0

1

2

solution space -

(0)α( )kα

J(α)

α


-2 -1 0 1 2-2

-1

0

1

2

-2 -1 0 1 2-2

-1

0

1

2

0.37= 0.39=

ηηηη

• What is the effect of the learning rate?

fast by overshoots solutionslow but converges to solution

J(α)


• How to choose the learning rate (k)?

• If J(α) is quadratic, then H is constant which implies that the learning rate is constant.

Taylor series approximation

optimum learning rate

Hessian (2nd derivatives)

(note:replace a with α)

Choosing pk using Newton’s Method

requires inverting H

1 ( ( ))k H J k p α


Newton’s method (cont’d)

-2 -1 0 1 2-2

-1

0

1

2

If J(α) is quadratic, Newton’s method converges in one step!

J(α)

Gradient descent vs Newton’s method

“Normalized” Problem

• If yi in ω2, replace yi by -yi

• Find α such that: αtyi>0

replace yi by -yi

Seek a hyperplane that separates patterns from different categories

Seek a hyperplane that puts normalized patterns on the same (positive) side

Perceptron rule

• Use Gradient Descent assuming:

where Y(α) is the set of samples misclassified by α.• If Y(α) is empty, Jp(α)=0; otherwise, Jp(α)>0

( )

( ) ( )tp

Y

J

y α

α α y

Find α such that: αtyi>0

Perceptron rule (cont’d)

• The gradient of Jp(α) is:

• The perceptron update rule is obtained using gradient descent:

( )

( )pY

J

y α

y

( )

( 1) ( ) ( )Y

k k k

y α

α α yor


( )

( ) ( )tp

Y

J

y α

α α y


missclassifiedexamples

(note: replace a with α andYk with Y(αα))


• Move the hyperplane so that training samples are on its positive side.

aa22

aa11

Example:


Perceptron Convergence Theorem: If training samples are linearly separable, then the sequence of weight vectors by the above algorithm will terminate at a solution vector in a finite number of steps.

ηη(k)=1(k)=1

k α α y

one exampleone example

at a timeat a time


order of examples: y2 y3 y1 y3

“Batch” algorithm leads to a smoothertrajectory in solutionspace.