Linear Discrimination Functions

Linear Discrimination Functions

Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica

Nicola Fanizzi

Dipartimento di InformaticaUniversita degli Studi di Bari

November 30, 2008

Corso di Apprendimento Automatico Linear Discrimination Functions

Outline

Linear modelsGradient descentPerceptronMinimum square error approachLinear and logistic regression


Linear Discriminant Functions I

A linear discriminant function can be written as

g(x) = w1x1 + · · ·+ wdxd = ~w t~x + w0

where~w = weight vectorw0 = bias or threshold

A 2-class linear classifier implements the following decisionrule: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0


Linear Discriminant Functions II

The equation g(x) = 0 defines the decision surface thatseparates points assigned to ω1 from points assigned to ω2.

When g(x) is linear, this decision surface is a hyperplane (H).


Linear Discriminant Functions IIIH divides the feature space into 2 half spaces:R1 for 1, and R2 for 2

If x1 and x2 are both on the decision surface

~w t~x1 + w0 = ~w t~x2 + w0 ⇒ ~w t (~x1 − ~x2) = 0

w is normal to any vector lying in the hyperplane


Linear Discriminant Functions IV

If we express ~x as

~x = ~xp + r~w||~w ||

where ~xp is the normal projection of ~x onto H, and r is thealgebraic distance from ~x to the hyperplane

Since g(~xp) = 0,we have g(~x) = ~w t~x + w0 = r ||~w || i.e. r = g(~x)

||~w ||

r is signed distance:r > 0 if ~x falls in R1, r < 0 if ~x falls in R2

Distance from the origin to the hyperplane is w0||~w ||


Linear Discriminant Functions V


Multicategory Case I

2 approaches to extend the LDF approach to the multicategorycase:

ωi / not ωi Reduce the problem to c − 1 two-class problems:Problem #i : Find the functions that separatespoints assigned to ωi from those not assigned to ωi

ωi / ωj Find the c(c − 1)/2 linear discriminants, one forevery pair of classes

Both approaches can lead to regions in which the classificationis undefined


Multicategory Case II


Pairwise Classification

Idea:build model for each pair of classes,using only training data from those classesProblem:Have to solve k(k−1)

2 classification problems for k -classproblemTurns out not to be a problem in many cases becausetraining sets become small:

Assume data evenly distributed, i.e. 2nk per learning

problem for n instances in totalSuppose learning algorithm is linear in nThen runtime of pairwise classification is proportional tok(k−1)

2 × 2nk = (k − 1)n


Linear Machine I

Define c linear discriminant functions:

gi(~x) = ~w ti ~x + wi0 i = 1, . . . , c

Linear Machine classifier: x ∈ ωi if gi(~x) > gj(~x) for all i 6= jIn case of equal scores, the classification is undefined

A LM divides the feature space into c decision regions,with gi(~x) the largest discriminant if ~x is in Ri

If Ri and Rj are contiguous, the boundary between them isa portion of the hyperplane Hij defined by:

gi(~x) = gj(~x) or (~wi − ~wj)t~x + (wi0 − wj0)


Linear Machine IIIt follows that ~wi − ~wj is normal to HijThe signed distance from ~x to Hij is:

gi(~x)− gj(~x)

||~wi − ~wj ||

There are c(c − 1)/2 pairs of (convex) regionsNot all regions are contiguous, and the total number ofsegments in the surfaces is often less than c(c − 1)/2

3 and 5 class problems


Generalized LDF I

The LDF can be written g(~x) = w0 +∑d

i=1 wixiBy adding d(d + 1)/2 terms involving the products of pairs ofcomponents of ~x , we obtain the quadratic discriminant function:

g(~x) = w0 +d∑

i=1

wixi +d∑

i=1

d∑j=1

wijxixj

The separating surface defined by g(~x) = 0 is a second-degreeor hyperquadric surfaceBy continuing to add terms such as wijkxixjxk we obtain theclass of polynomial discriminant functions


Generalized LDF IIThe generalized LDF is defined as

g(~x) =d∑

i=1

aiyi(~x) = ~at~y

where ~a is a d-dimensional weight vector,and yi(~x) are arbitrary functions of ~x

The resulting discriminant function is not linear in ~x , but itis linear in ~yThe functions yi(~x) map points in d-dimensional ~x-spaceto points in the d-dimensional ~y -space


Generalized LDF IIIExample: Let the quadratic discriminant function beg(~x) = a1 + a2x + a3x2

The 3-dimensional vector is then y =

1xx2


2-class Linearly-Separable Case I

g(~x) =d∑

i=0

wixi = ~at~y

where x0 = 1 and~y t = [1 ~x ] = [1 x1 · · · xd ] is an augmented feature vector and~at = [w0 ~w ] = [w0 w1 · · · wd ] is an augmented weight vector

The hyperplane decision surface H defined ~at~y = 0passes through the origin in ~y -space

The distance from any point ~y to H is given by ~at~y||~a|| = g(~x)

||~a||

Because ~a =√

(1 + ||~w ||2) this distance is less then thedistance from ~x to H


2-class Linearly-Separable Case IIProblem: find [w0 ~w ] = ~a

Suppose that we have a set of n examples {~y1, . . . , ~yn}labeled ω1 or ω2

Look for a weight vector ~a that classifies all the examplescorrectly:

~at~yi > 0 and ~yi is labeled ω1 or~at~yi < 0 and ~yi is labeled ω2

If ~a exists, the examples are linearly separable


2-class Linearly-Separable Case IIISolutions

Replacing all the examples labeled ω2 by their negatives,one can look for a weight vector ~a such that ~at~yi > 0 for allthe examples~a a.k.a. separating vector or solution vectorEach example ~yi places a constraint on the possiblelocation of a solution vector~at~yi = 0 defines a hyperplane through the origin having ~yias a normal vectorThe solution vector (if it exists) must be on the positive sideof every hyperplaneSolution Region = intersection of the n half-spaces


2-class Linearly-Separable Case IV

Any vector that lies in the solution region is a solutionvector: the solution vector (if it exists) is not uniqueAdditional requirements to find a solution vector closer tothe middle of the region (a solution that is more likely toclassify new examples correctly)Seek a unit-length weight vector that maximizes theminimum distance from the examples to the separatingplane


2-class Linearly-Separable Case V

Seek the minimum-length weight vector satisfying~at~yi ≥ b ≥ 0The solution region shrinks by margins b/||~yi ||


Gradient Descent I

Define a criterion function J(~a) that is minimized if ~a is asolution vector (~at~yi ≥ 0, ∀i = 1, . . . ,n)Start with some arbitrary vector ~a(1)

Compute the gradient vector ∇J(~a(1))

The next value ~a(2) is obtained by moving a distance from~a(1) in the direction of steepest descenti.e. along the negative of the gradient

In general, ~a(k + 1) is obtained from ~a(k) using

~a(k + 1)← ~a(k)− η(k)∇J(~a(k))

where η(k) is the learning rate


Gradient Descent II


Gradient Descent & Delta Rule I

To understand, consider a simpler linear machine (a.k.a. unit),where

o = w0 + w1x1 + · · ·+ wnxn

Let’s learn wi ’s that minimize the squared error

E [~w ] ≡ 12

∑d∈D

(~td − ~od )2

where:D is set of training examples 〈~x , t〉t is the target output value


Gradient Descent & Delta Rule IIGradient

∇E [~w ] ≡[∂E∂w0

,∂E∂w1

, · · · ∂E∂wn

]Training rule:

∆~w = −η∇E [~w ]

i.e.,

∆wi = −η ∂E∂wi

Note that η is constant


Gradient Descent & Delta Rule III

∂E∂wi

=∂

∂wi

12

∑d

(td − od )2

=12

∑d

∂

∂wi(td − od )2

=12

∑d

2(td − od )∂

∂wi(td − od )

=∑

d

(td − od )∂

∂wi(td − ~w · ~xd )

∂E∂wi

=∑

d

(td − od )(−xid )


Basic GRADIENT-DESCENT Algorithm

GRADIENT-DESCENT(D, η)D: training set,η: learning rate (e.g. .5)

Initialize each wi to some small random valueuntil the termination condition is met do

Initialize each ∆wi to zero.for each 〈~x , t〉 ∈ D do

Input the instance ~x to the unit and compute the output ofor each wi do

∆wi ← ∆wi + η(t − o)xi

for each weight wi do

wi ← wi + ∆wi


Incremental (Stochastic) GRADIENT DESCENT I

Approximation of the standard GRADIENT-DESCENT

Batch GRADIENT-DESCENT:Do until satisfied

1 Compute the gradient ∇ED[~w ]

2 ~w ← ~w − η∇ED[~w ]

Incremental GRADIENT-DESCENT:Do until satisfied

For each training example d in D1 Compute the gradient ∇Ed [~w ]2 ~w ← ~w − η∇Ed [~w ]


Incremental (Stochastic) GRADIENT DESCENT II

ED[~w ] ≡ 12

∑d∈D

(td − od )2

Ed [~w ] ≡ 12

(td − od )2

Training rule (delta rule):

∆wi ← η(t − o)xi

similar to perceptron training rule, yet unthresholdedconvergence is only asymptotically guaranteedlinear separability is no longer needed !


Standard vs. Stochastic GRADIENT-DESCENT

Incremental-GD can approximate Batch-GD arbitrarily closely ifη made small enough

error summed over all examples before summing updatedupon each examplestandard GD more costly per update step and can employlarger ηstochastic GD may avoid falling in local minima because ofusing Ed instead of ED


Newton’s Algorithm

J(~a) ' J(~a(k)) +∇J t (~a− ~a(k)) +12

(~a− ~a(k))tH(~a− ~a(k))

where H = ∂2J∂ai∂aj

is the Hessian matrix

Choose ~a(k + 1) to minimize this function:~a(k + 1)← ~a(k)− H−1∇J(~a)

Greater improvement per step than GD but not applicablewhen H is singularTime complexity O(d3)


Perceptron I

Assumption:data is linearly separable

Hyperplane:∑d

i=0 wixi = 0assuming that there is a constant attribute x0 = 1 (bias)

Algorithm for learning separating hyperplane:perceptron learning rule

Classifier:If∑d

i=0 wixi > 0 then predict ω1 (or +1),otherwise predict ω2 (or −1)


Perceptron II

Thresholded output

o(x1, . . . , xn) =

{+1 if w0 + w1x1 + · · ·+ wdxd > 0−1 otherwise.

Simpler vector notation: o(~x) = sgn(~x) =

{+1 if ~w~x > 0−1 otherwise.

Space of the hypotheses: {~w | ~w ∈ Rn}


Decision Surface of a Perceptron

Can represent some useful functionsWhat weights represent g(x1, x2) = AND(x1, x2)?

But some functions not representablee.g., not linearly separable (XOR)Therefore, we’ll want networks of these...


Perceptron Training Rule I

Perceptron criterion function: J(~a) =∑

~y∈Y (~a)(−~at~y)

where Y (~a) is the set of examples misclassified by ~aIf no samples are misclassified, Y (~a) is empty andJ(~a) = 0 (i.e. ~a is a solution vector)J(~a) ≥ 0, since ~at~yi ≤ 0 if ~yi is misclassified

Geometrically, J(~a) is proportional to the sum of thedistances from the misclassified samples to the decisionboundary

Since ∇J =∑

~y∈Y (~a)(−~y) the update rule becomes

~a(k + 1)← ~a(k) + η(k)∑

~y∈Yk (~a)

~y

where Y (~a) is the set of examples misclassified by ~a(k)


Perceptron Training Rule II

wi ← wi + ∆wi

where∆wi = η(t − o)xi

Where:t = c(~x) target valueo perceptron outputη small constant (e.g., .1) learning rate


Perceptron Training Rule IIIPerceptron Learning Rule

Set all weights wi to zerodo

for each instance x in the training dataif x is classified incorrectly by the perceptron

if x belongs to ω1 add it to ~welse subtract it from ~w

until all instances in the training data are classified correctlyreturn ~w

Can prove it will convergeIf training data is linearly separableand η sufficiently small


Perceptron Training Rule IV

η = 1. Sequence of misclassified samples: ~y2, ~y3, ~y1, ~y3


Perceptron Training Rule V

Why does this work?Consider situation where an instance pertaining to the firstclass has been added:

(w0 + x0)x0 + (w1 + x1)x1 + (w2 + x2)x2 + . . .+ (wd + xd )xd

This means output for a has increased by:

x0x0 + x1x1 + x2x2 + . . .+ xdxd

always positive,thus the hyperplane has moved into the correct direction(and output decreases for instances of other class)


Fixed-Increment Single-Sample Perceptron

Perceptron({~y (k)}nk=1): weight vector

input: {~y (k)}nk=1 training examplesbegin initialize ~a, k = 0

do k ← (k + 1) mod nif ~y (k) is misclassified by the model based on ~athen ~a← ~a + ~y (k)

until all examples properly classifiedreturn ~a

end


Comments

The perceptron algorithm adjusts the parameters onlywhen it encounters an error, i.e. a misclassified trainingexampleCorrectly classified examples can be ignoredThe learning rate η can be chosen arbitrarily,it will only impact on the norm of the final ~w(and the corresponding magnitude of w0)The final weight vector ~w is a linear combination of trainingpoints


Linear Models: WINNOW

Another mistake-driven algorithm for finding a separatinghyperplaneAssumes binary data (i.e. attribute values are either zeroor one)Difference: multiplicative updates instead of additiveupdatesWeights are multiplied by a user-specified parameter α > 1(or its inverse)Another difference: user-specified threshold parameter θPredict first class if w0 + w1x1 + w2x2 + · · ·+ wkxk > θ


The Algorithm I

WINNOW

while some instances are misclassifiedfor each instance a in the training data

classify a using the current weightsif the predicted class is incorrect

if a belongs to the first classfor each xi that is 1, multiply wi by α(if xi is 0, leave wi unchanged)

otherwisefor each xi that is 1, divide wi by α(xi is 0, leave wi unchanged)


The Algorithm II

WINNOW is very effective in homing in on relevant features(it is attribute efficient)

Can also be used in an on-line setting in which newinstances arrive continuously (like the perceptronalgorithm)


Balanced WINNOW I

WINNOW doesn’t allow negative weights and this can be adrawback in some applicationsBALANCED WINNOW maintains two weight vectors, one foreach class: w+ and w−

Instance is classified as belonging to the first class (of twoclasses) if: (w+

0 − w−0 ) + (w+1 − w−1 )x1 + (w+

2 − w−2 )x2 +· · ·+ (w+

k − w−k )xk > θ


Balanced WINNOW II

BALANCED WINNOW

while some instances are misclassifiedfor each instance a in the training data

classify a using the current weightsif the predicted class is incorrect

if a belongs to the first classfor each xi that is 1,

multiply w+i by α and divide w−i by α

(if xi is 0, leave w+i and w−i unchanged)

otherwisefor each xi that is 1,

multiply w−i by α and divide w+i by α

(if xi is 0, leave w+i and w−i unchanged)


Nonseparable Case

The Perceptron is an error correcting procedure convergeswhen the examples are linearly separableEven if a separating vector is found for the trainingexamples, it does not follow that the resulting classifier willperform well on independent test dataTo ensure that the performance on training and test datawill be similar, many training samples should be used.Sufficiently large training samples are almost certainly nonlinearly separableNo weight vector can correctly classify every example in anonseparable set

The corrections may never cease if set is nonseparable


Learning rate

If we choose η(k)→ 0 as k →∞ then performance can beacceptable on nonseparable problems while preserving theability to find a solution on separable problemsThe rate at which η(k) approaches zero is important:

Too slow: result will be sensitive to those examples that render the setnonseparable

Too fast: may converge prematurely with sub-optimal results

η(k) can be considered as a function of recentperformance, decreasing it as performance improves: e.g.η(k)← η/k


Minimum Squared Error Approach I

Minimum Squared Error (MSE)It trades the ability to obtain a separating vector for goodperformance on both separable and nonseparableproblemsPreviously, we sought a weight vector ~a making all of theinner products ~at~y ≥ 0In the MSE procedure, one tries to make ~at~yi = bi ,where bi are some arbitrarily specified positive constants

Using matrix notation: Y~a = ~bIf Y is nonsingular, then ~a = Y−1~bUnfortunately Y is not a square matrix, usually with morerows than columnsWhen there are more equations than unknowns, ~a isoverdetermined, and ordinarily no exact solution exists.


Minimum Squared Error Approach II

We can seek a weight vector ~a that minimizes somefunction of an error vector ~e = Y~a− ~bMinimizing the squared length of the error vector isequivalent to minimizing the sum-of-squared-error criterionfunction

J(~a) = ||Y~a− ~b||2 =n∑

i=1

(~at~yi − bi)2

whose gradient is

∇J = 2n∑

i=1

(~at~yi − bi)~yi = 2Y t (Y~a− ~b)

Setting the gradient equal to zero, the following necessarycondition holds: Y tY~a = Y t~b


Minimum Squared Error Approach III

Y tY is a square matrix which is often nonsingular.Therefore, solving for ~a:

~a = (Y tY )−1Y t~b = Y +~b

where Y + = (Y tY )−1Y t is the pseudoinverse of Y

Y + can be written also as limε→0(Y tY + εI)−1Y t

and it can be shown that this limit always exists, hence

~a = Y +~b

the MSE solution to the problem Y~a = ~b


Widrow-Hoff Procedure a.k.a. LMS

The criterion function J(~a) = ||Y~a− ~b||2 could beminimized by a gradient descent procedureAdvantages:

Avoids the problems that arise when Y tY is singularAvoids the need for working with large matrices

Since ∇J = 2Y t (Y~a− ~b) a simple update rule would be{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)(Y~a− ~b)

or, if we consider the samples sequentially{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)

[bk − ~a(k)t~y(k)

]~y(k)


Widrow-Hoff or LMS Agorithm

LMS({~yi}ni=1)

input {~yi}ni=1: training examplesbeginInitialize ~a, ~b, θ, η(·), k ← 0

do k ← k + 1 mod n~a← ~a + η(k)(bk − ~a(k)t~y(k))~y(k)

until |η(k)(bk − ~a(k)t~y(k))~y(k)| < θreturn ~aend


Linear Regression

Standard technique for numeric predictionOutcome is linear combination of attributes:

x = w0 + w1x1 + w2x2 + · · ·+ wdxd

Weights are calculated from the training datastandard math algorithms ~w

Predicted value for first training instance ~x (1)

w0 + w1x (1)1 + w2x (1)

2 + · · ·+ wdx (1)d =

d∑j=0

wjx(1)j

assuming extended vectors with x0 = 1


Probabilistic Classification

Any regression technique can be used for classificationTraining: perform a regression for each classsetting the output to 1 for training instances that belong tothe class and 0 for those that don’tPrediction: predict class corresponding to model withlargest output value (membership value)

Problem: membership values are not in [0,1] range,so aren’t proper probability estimates


Logistic Regression I

Logit transformation

Builds a linear model for a transformed target variable

Assume we have two classes

Logistic regression replaces the target

Pr(1 | ~x)

by this target

log(

Pr(1 | ~x)

1− Pr(1 | ~x)

)

Transformation maps [0,1] to (−∞,+∞)


Logistic Regression II


Example: Logistic Regression Model

Resulting model: Pr(1 | ~x) = 1/1 + e−(w0+w1x1+w2x2+···+wd xd )

Example: Model with w0 = 0.5 and w1 = 1:

Parameters induced from data using maximum likelihood


Maximum Likelihood

Aim: maximize probability of training data wrt parametersCan use logarithms of probabilities and maximizelog-likelihood of model:∑n

i=1(1− x (i)) log (1− Pr(1 | ~x)) + x (i) log (1− Pr(1 | ~x))

where the x (i) are either 0 or 1Weights wi need to be chosen to maximize log-likelihood

relatively simple method:iteratively re-weighted least squares


Summary

Perceptron training rule guaranteed to succeed ifTraining examples are linearly separableSufficiently small learning rate η

Linear unit training rule uses gradient descentGuaranteed to converge to hypothesis with minimumsquared errorGiven sufficiently small learning rate ηEven when training data contains noiseEven when training data not separable by H


Credits

R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann


http://rii.ricoh.com/%7Estork/DHS.html

http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html

http://www.cs.waikato.ac.nz/~ml/weka/book.html

http://www.cs.waikato.ac.nz/~ml/weka/book.html

Linear Discrimination Functions

Documents

Transcript of Linear Discrimination Functions