Linear Discrimination Functions
-
Upload
nicola-fanizzi -
Category
Documents
-
view
227 -
download
0
description
Transcript of Linear Discrimination Functions
Linear Discrimination Functions
Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica
Nicola Fanizzi
Dipartimento di InformaticaUniversita degli Studi di Bari
November 30, 2008
Corso di Apprendimento Automatico Linear Discrimination Functions
Outline
Linear modelsGradient descentPerceptronMinimum square error approachLinear and logistic regression
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions I
A linear discriminant function can be written as
g(x) = w1x1 + · · ·+ wdxd = ~w t~x + w0
where~w = weight vectorw0 = bias or threshold
A 2-class linear classifier implements the following decisionrule: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions II
The equation g(x) = 0 defines the decision surface thatseparates points assigned to ω1 from points assigned to ω2.
When g(x) is linear, this decision surface is a hyperplane (H).
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions IIIH divides the feature space into 2 half spaces:R1 for 1, and R2 for 2
If x1 and x2 are both on the decision surface
~w t~x1 + w0 = ~w t~x2 + w0 ⇒ ~w t (~x1 − ~x2) = 0
w is normal to any vector lying in the hyperplane
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions IV
If we express ~x as
~x = ~xp + r~w||~w ||
where ~xp is the normal projection of ~x onto H, and r is thealgebraic distance from ~x to the hyperplane
Since g(~xp) = 0,we have g(~x) = ~w t~x + w0 = r ||~w || i.e. r = g(~x)
||~w ||
r is signed distance:r > 0 if ~x falls in R1, r < 0 if ~x falls in R2
Distance from the origin to the hyperplane is w0||~w ||
Corso di Apprendimento Automatico Linear Discrimination Functions
Multicategory Case I
2 approaches to extend the LDF approach to the multicategorycase:
ωi / not ωi Reduce the problem to c − 1 two-class problems:Problem #i : Find the functions that separatespoints assigned to ωi from those not assigned to ωi
ωi / ωj Find the c(c − 1)/2 linear discriminants, one forevery pair of classes
Both approaches can lead to regions in which the classificationis undefined
Corso di Apprendimento Automatico Linear Discrimination Functions
Pairwise Classification
Idea:build model for each pair of classes,using only training data from those classesProblem:Have to solve k(k−1)
2 classification problems for k -classproblemTurns out not to be a problem in many cases becausetraining sets become small:
Assume data evenly distributed, i.e. 2nk per learning
problem for n instances in totalSuppose learning algorithm is linear in nThen runtime of pairwise classification is proportional tok(k−1)
2 × 2nk = (k − 1)n
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Machine I
Define c linear discriminant functions:
gi(~x) = ~w ti ~x + wi0 i = 1, . . . , c
Linear Machine classifier: x ∈ ωi if gi(~x) > gj(~x) for all i 6= jIn case of equal scores, the classification is undefined
A LM divides the feature space into c decision regions,with gi(~x) the largest discriminant if ~x is in Ri
If Ri and Rj are contiguous, the boundary between them isa portion of the hyperplane Hij defined by:
gi(~x) = gj(~x) or (~wi − ~wj)t~x + (wi0 − wj0)
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Machine IIIt follows that ~wi − ~wj is normal to HijThe signed distance from ~x to Hij is:
gi(~x)− gj(~x)
||~wi − ~wj ||
There are c(c − 1)/2 pairs of (convex) regionsNot all regions are contiguous, and the total number ofsegments in the surfaces is often less than c(c − 1)/2
3 and 5 class problems
Corso di Apprendimento Automatico Linear Discrimination Functions
Generalized LDF I
The LDF can be written g(~x) = w0 +∑d
i=1 wixiBy adding d(d + 1)/2 terms involving the products of pairs ofcomponents of ~x , we obtain the quadratic discriminant function:
g(~x) = w0 +d∑
i=1
wixi +d∑
i=1
d∑j=1
wijxixj
The separating surface defined by g(~x) = 0 is a second-degreeor hyperquadric surfaceBy continuing to add terms such as wijkxixjxk we obtain theclass of polynomial discriminant functions
Corso di Apprendimento Automatico Linear Discrimination Functions
Generalized LDF IIThe generalized LDF is defined as
g(~x) =d∑
i=1
aiyi(~x) = ~at~y
where ~a is a d-dimensional weight vector,and yi(~x) are arbitrary functions of ~x
The resulting discriminant function is not linear in ~x , but itis linear in ~yThe functions yi(~x) map points in d-dimensional ~x-spaceto points in the d-dimensional ~y -space
Corso di Apprendimento Automatico Linear Discrimination Functions
Generalized LDF IIIExample: Let the quadratic discriminant function beg(~x) = a1 + a2x + a3x2
The 3-dimensional vector is then y =
1xx2
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case I
g(~x) =d∑
i=0
wixi = ~at~y
where x0 = 1 and~y t = [1 ~x ] = [1 x1 · · · xd ] is an augmented feature vector and~at = [w0 ~w ] = [w0 w1 · · · wd ] is an augmented weight vector
The hyperplane decision surface H defined ~at~y = 0passes through the origin in ~y -space
The distance from any point ~y to H is given by ~at~y||~a|| = g(~x)
||~a||
Because ~a =√
(1 + ||~w ||2) this distance is less then thedistance from ~x to H
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case IIProblem: find [w0 ~w ] = ~a
Suppose that we have a set of n examples {~y1, . . . , ~yn}labeled ω1 or ω2
Look for a weight vector ~a that classifies all the examplescorrectly:
~at~yi > 0 and ~yi is labeled ω1 or~at~yi < 0 and ~yi is labeled ω2
If ~a exists, the examples are linearly separable
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case IIISolutions
Replacing all the examples labeled ω2 by their negatives,one can look for a weight vector ~a such that ~at~yi > 0 for allthe examples~a a.k.a. separating vector or solution vectorEach example ~yi places a constraint on the possiblelocation of a solution vector~at~yi = 0 defines a hyperplane through the origin having ~yias a normal vectorThe solution vector (if it exists) must be on the positive sideof every hyperplaneSolution Region = intersection of the n half-spaces
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case IV
Any vector that lies in the solution region is a solutionvector: the solution vector (if it exists) is not uniqueAdditional requirements to find a solution vector closer tothe middle of the region (a solution that is more likely toclassify new examples correctly)Seek a unit-length weight vector that maximizes theminimum distance from the examples to the separatingplane
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case V
Seek the minimum-length weight vector satisfying~at~yi ≥ b ≥ 0The solution region shrinks by margins b/||~yi ||
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent I
Define a criterion function J(~a) that is minimized if ~a is asolution vector (~at~yi ≥ 0, ∀i = 1, . . . ,n)Start with some arbitrary vector ~a(1)
Compute the gradient vector ∇J(~a(1))
The next value ~a(2) is obtained by moving a distance from~a(1) in the direction of steepest descenti.e. along the negative of the gradient
In general, ~a(k + 1) is obtained from ~a(k) using
~a(k + 1)← ~a(k)− η(k)∇J(~a(k))
where η(k) is the learning rate
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent & Delta Rule I
To understand, consider a simpler linear machine (a.k.a. unit),where
o = w0 + w1x1 + · · ·+ wnxn
Let’s learn wi ’s that minimize the squared error
E [~w ] ≡ 12
∑d∈D
(~td − ~od )2
where:D is set of training examples 〈~x , t〉t is the target output value
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent & Delta Rule IIGradient
∇E [~w ] ≡[∂E∂w0
,∂E∂w1
, · · · ∂E∂wn
]Training rule:
∆~w = −η∇E [~w ]
i.e.,
∆wi = −η ∂E∂wi
Note that η is constant
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent & Delta Rule III
∂E∂wi
=∂
∂wi
12
∑d
(td − od )2
=12
∑d
∂
∂wi(td − od )2
=12
∑d
2(td − od )∂
∂wi(td − od )
=∑
d
(td − od )∂
∂wi(td − ~w · ~xd )
∂E∂wi
=∑
d
(td − od )(−xid )
Corso di Apprendimento Automatico Linear Discrimination Functions
Basic GRADIENT-DESCENT Algorithm
GRADIENT-DESCENT(D, η)D: training set,η: learning rate (e.g. .5)
Initialize each wi to some small random valueuntil the termination condition is met do
Initialize each ∆wi to zero.for each 〈~x , t〉 ∈ D do
Input the instance ~x to the unit and compute the output ofor each wi do
∆wi ← ∆wi + η(t − o)xi
for each weight wi do
wi ← wi + ∆wi
Corso di Apprendimento Automatico Linear Discrimination Functions
Incremental (Stochastic) GRADIENT DESCENT I
Approximation of the standard GRADIENT-DESCENT
Batch GRADIENT-DESCENT:Do until satisfied
1 Compute the gradient ∇ED[~w ]
2 ~w ← ~w − η∇ED[~w ]
Incremental GRADIENT-DESCENT:Do until satisfied
For each training example d in D1 Compute the gradient ∇Ed [~w ]2 ~w ← ~w − η∇Ed [~w ]
Corso di Apprendimento Automatico Linear Discrimination Functions
Incremental (Stochastic) GRADIENT DESCENT II
ED[~w ] ≡ 12
∑d∈D
(td − od )2
Ed [~w ] ≡ 12
(td − od )2
Training rule (delta rule):
∆wi ← η(t − o)xi
similar to perceptron training rule, yet unthresholdedconvergence is only asymptotically guaranteedlinear separability is no longer needed !
Corso di Apprendimento Automatico Linear Discrimination Functions
Standard vs. Stochastic GRADIENT-DESCENT
Incremental-GD can approximate Batch-GD arbitrarily closely ifη made small enough
error summed over all examples before summing updatedupon each examplestandard GD more costly per update step and can employlarger ηstochastic GD may avoid falling in local minima because ofusing Ed instead of ED
Corso di Apprendimento Automatico Linear Discrimination Functions
Newton’s Algorithm
J(~a) ' J(~a(k)) +∇J t (~a− ~a(k)) +12
(~a− ~a(k))tH(~a− ~a(k))
where H = ∂2J∂ai∂aj
is the Hessian matrix
Choose ~a(k + 1) to minimize this function:~a(k + 1)← ~a(k)− H−1∇J(~a)
Greater improvement per step than GD but not applicablewhen H is singularTime complexity O(d3)
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron I
Assumption:data is linearly separable
Hyperplane:∑d
i=0 wixi = 0assuming that there is a constant attribute x0 = 1 (bias)
Algorithm for learning separating hyperplane:perceptron learning rule
Classifier:If∑d
i=0 wixi > 0 then predict ω1 (or +1),otherwise predict ω2 (or −1)
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron II
Thresholded output
o(x1, . . . , xn) =
{+1 if w0 + w1x1 + · · ·+ wdxd > 0−1 otherwise.
Simpler vector notation: o(~x) = sgn(~x) =
{+1 if ~w~x > 0−1 otherwise.
Space of the hypotheses: {~w | ~w ∈ Rn}
Corso di Apprendimento Automatico Linear Discrimination Functions
Decision Surface of a Perceptron
Can represent some useful functionsWhat weights represent g(x1, x2) = AND(x1, x2)?
But some functions not representablee.g., not linearly separable (XOR)Therefore, we’ll want networks of these...
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training Rule I
Perceptron criterion function: J(~a) =∑
~y∈Y (~a)(−~at~y)
where Y (~a) is the set of examples misclassified by ~aIf no samples are misclassified, Y (~a) is empty andJ(~a) = 0 (i.e. ~a is a solution vector)J(~a) ≥ 0, since ~at~yi ≤ 0 if ~yi is misclassified
Geometrically, J(~a) is proportional to the sum of thedistances from the misclassified samples to the decisionboundary
Since ∇J =∑
~y∈Y (~a)(−~y) the update rule becomes
~a(k + 1)← ~a(k) + η(k)∑
~y∈Yk (~a)
~y
where Y (~a) is the set of examples misclassified by ~a(k)
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training Rule II
wi ← wi + ∆wi
where∆wi = η(t − o)xi
Where:t = c(~x) target valueo perceptron outputη small constant (e.g., .1) learning rate
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training Rule IIIPerceptron Learning Rule
Set all weights wi to zerodo
for each instance x in the training dataif x is classified incorrectly by the perceptron
if x belongs to ω1 add it to ~welse subtract it from ~w
until all instances in the training data are classified correctlyreturn ~w
Can prove it will convergeIf training data is linearly separableand η sufficiently small
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training Rule IV
η = 1. Sequence of misclassified samples: ~y2, ~y3, ~y1, ~y3
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training Rule V
Why does this work?Consider situation where an instance pertaining to the firstclass has been added:
(w0 + x0)x0 + (w1 + x1)x1 + (w2 + x2)x2 + . . .+ (wd + xd )xd
This means output for a has increased by:
x0x0 + x1x1 + x2x2 + . . .+ xdxd
always positive,thus the hyperplane has moved into the correct direction(and output decreases for instances of other class)
Corso di Apprendimento Automatico Linear Discrimination Functions
Fixed-Increment Single-Sample Perceptron
Perceptron({~y (k)}nk=1): weight vector
input: {~y (k)}nk=1 training examplesbegin initialize ~a, k = 0
do k ← (k + 1) mod nif ~y (k) is misclassified by the model based on ~athen ~a← ~a + ~y (k)
until all examples properly classifiedreturn ~a
end
Corso di Apprendimento Automatico Linear Discrimination Functions
Comments
The perceptron algorithm adjusts the parameters onlywhen it encounters an error, i.e. a misclassified trainingexampleCorrectly classified examples can be ignoredThe learning rate η can be chosen arbitrarily,it will only impact on the norm of the final ~w(and the corresponding magnitude of w0)The final weight vector ~w is a linear combination of trainingpoints
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Models: WINNOW
Another mistake-driven algorithm for finding a separatinghyperplaneAssumes binary data (i.e. attribute values are either zeroor one)Difference: multiplicative updates instead of additiveupdatesWeights are multiplied by a user-specified parameter α > 1(or its inverse)Another difference: user-specified threshold parameter θPredict first class if w0 + w1x1 + w2x2 + · · ·+ wkxk > θ
Corso di Apprendimento Automatico Linear Discrimination Functions
The Algorithm I
WINNOW
while some instances are misclassifiedfor each instance a in the training data
classify a using the current weightsif the predicted class is incorrect
if a belongs to the first classfor each xi that is 1, multiply wi by α(if xi is 0, leave wi unchanged)
otherwisefor each xi that is 1, divide wi by α(xi is 0, leave wi unchanged)
Corso di Apprendimento Automatico Linear Discrimination Functions
The Algorithm II
WINNOW is very effective in homing in on relevant features(it is attribute efficient)
Can also be used in an on-line setting in which newinstances arrive continuously (like the perceptronalgorithm)
Corso di Apprendimento Automatico Linear Discrimination Functions
Balanced WINNOW I
WINNOW doesn’t allow negative weights and this can be adrawback in some applicationsBALANCED WINNOW maintains two weight vectors, one foreach class: w+ and w−
Instance is classified as belonging to the first class (of twoclasses) if: (w+
0 − w−0 ) + (w+1 − w−1 )x1 + (w+
2 − w−2 )x2 +· · ·+ (w+
k − w−k )xk > θ
Corso di Apprendimento Automatico Linear Discrimination Functions
Balanced WINNOW II
BALANCED WINNOW
while some instances are misclassifiedfor each instance a in the training data
classify a using the current weightsif the predicted class is incorrect
if a belongs to the first classfor each xi that is 1,
multiply w+i by α and divide w−i by α
(if xi is 0, leave w+i and w−i unchanged)
otherwisefor each xi that is 1,
multiply w−i by α and divide w+i by α
(if xi is 0, leave w+i and w−i unchanged)
Corso di Apprendimento Automatico Linear Discrimination Functions
Nonseparable Case
The Perceptron is an error correcting procedure convergeswhen the examples are linearly separableEven if a separating vector is found for the trainingexamples, it does not follow that the resulting classifier willperform well on independent test dataTo ensure that the performance on training and test datawill be similar, many training samples should be used.Sufficiently large training samples are almost certainly nonlinearly separableNo weight vector can correctly classify every example in anonseparable set
The corrections may never cease if set is nonseparable
Corso di Apprendimento Automatico Linear Discrimination Functions
Learning rate
If we choose η(k)→ 0 as k →∞ then performance can beacceptable on nonseparable problems while preserving theability to find a solution on separable problemsThe rate at which η(k) approaches zero is important:
Too slow: result will be sensitive to those examples that render the setnonseparable
Too fast: may converge prematurely with sub-optimal results
η(k) can be considered as a function of recentperformance, decreasing it as performance improves: e.g.η(k)← η/k
Corso di Apprendimento Automatico Linear Discrimination Functions
Minimum Squared Error Approach I
Minimum Squared Error (MSE)It trades the ability to obtain a separating vector for goodperformance on both separable and nonseparableproblemsPreviously, we sought a weight vector ~a making all of theinner products ~at~y ≥ 0In the MSE procedure, one tries to make ~at~yi = bi ,where bi are some arbitrarily specified positive constants
Using matrix notation: Y~a = ~bIf Y is nonsingular, then ~a = Y−1~bUnfortunately Y is not a square matrix, usually with morerows than columnsWhen there are more equations than unknowns, ~a isoverdetermined, and ordinarily no exact solution exists.
Corso di Apprendimento Automatico Linear Discrimination Functions
Minimum Squared Error Approach II
We can seek a weight vector ~a that minimizes somefunction of an error vector ~e = Y~a− ~bMinimizing the squared length of the error vector isequivalent to minimizing the sum-of-squared-error criterionfunction
J(~a) = ||Y~a− ~b||2 =n∑
i=1
(~at~yi − bi)2
whose gradient is
∇J = 2n∑
i=1
(~at~yi − bi)~yi = 2Y t (Y~a− ~b)
Setting the gradient equal to zero, the following necessarycondition holds: Y tY~a = Y t~b
Corso di Apprendimento Automatico Linear Discrimination Functions
Minimum Squared Error Approach III
Y tY is a square matrix which is often nonsingular.Therefore, solving for ~a:
~a = (Y tY )−1Y t~b = Y +~b
where Y + = (Y tY )−1Y t is the pseudoinverse of Y
Y + can be written also as limε→0(Y tY + εI)−1Y t
and it can be shown that this limit always exists, hence
~a = Y +~b
the MSE solution to the problem Y~a = ~b
Corso di Apprendimento Automatico Linear Discrimination Functions
Widrow-Hoff Procedure a.k.a. LMS
The criterion function J(~a) = ||Y~a− ~b||2 could beminimized by a gradient descent procedureAdvantages:
Avoids the problems that arise when Y tY is singularAvoids the need for working with large matrices
Since ∇J = 2Y t (Y~a− ~b) a simple update rule would be{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)(Y~a− ~b)
or, if we consider the samples sequentially{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)
[bk − ~a(k)t~y(k)
]~y(k)
Corso di Apprendimento Automatico Linear Discrimination Functions
Widrow-Hoff or LMS Agorithm
LMS({~yi}ni=1)
input {~yi}ni=1: training examplesbeginInitialize ~a, ~b, θ, η(·), k ← 0
do k ← k + 1 mod n~a← ~a + η(k)(bk − ~a(k)t~y(k))~y(k)
until |η(k)(bk − ~a(k)t~y(k))~y(k)| < θreturn ~aend
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Regression
Standard technique for numeric predictionOutcome is linear combination of attributes:
x = w0 + w1x1 + w2x2 + · · ·+ wdxd
Weights are calculated from the training datastandard math algorithms ~w
Predicted value for first training instance ~x (1)
w0 + w1x (1)1 + w2x (1)
2 + · · ·+ wdx (1)d =
d∑j=0
wjx(1)j
assuming extended vectors with x0 = 1
Corso di Apprendimento Automatico Linear Discrimination Functions
Probabilistic Classification
Any regression technique can be used for classificationTraining: perform a regression for each classsetting the output to 1 for training instances that belong tothe class and 0 for those that don’tPrediction: predict class corresponding to model withlargest output value (membership value)
Problem: membership values are not in [0,1] range,so aren’t proper probability estimates
Corso di Apprendimento Automatico Linear Discrimination Functions
Logistic Regression I
Logit transformation
Builds a linear model for a transformed target variable
Assume we have two classes
Logistic regression replaces the target
Pr(1 | ~x)
by this target
log(
Pr(1 | ~x)
1− Pr(1 | ~x)
)
Transformation maps [0,1] to (−∞,+∞)
Corso di Apprendimento Automatico Linear Discrimination Functions
Example: Logistic Regression Model
Resulting model: Pr(1 | ~x) = 1/1 + e−(w0+w1x1+w2x2+···+wd xd )
Example: Model with w0 = 0.5 and w1 = 1:
Parameters induced from data using maximum likelihood
Corso di Apprendimento Automatico Linear Discrimination Functions
Maximum Likelihood
Aim: maximize probability of training data wrt parametersCan use logarithms of probabilities and maximizelog-likelihood of model:∑n
i=1(1− x (i)) log (1− Pr(1 | ~x)) + x (i) log (1− Pr(1 | ~x))
where the x (i) are either 0 or 1Weights wi need to be chosen to maximize log-likelihood
relatively simple method:iteratively re-weighted least squares
Corso di Apprendimento Automatico Linear Discrimination Functions
Summary
Perceptron training rule guaranteed to succeed ifTraining examples are linearly separableSufficiently small learning rate η
Linear unit training rule uses gradient descentGuaranteed to converge to hypothesis with minimumsquared errorGiven sufficiently small learning rate ηEven when training data contains noiseEven when training data not separable by H
Corso di Apprendimento Automatico Linear Discrimination Functions
Credits
R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann
Corso di Apprendimento Automatico Linear Discrimination Functions