1. Fisher Linear Discriminant 2. Multiple Discriminant Analysis
CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear...
Transcript of CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear...
CSE 291 - Linear Models for Classification
Henrik I. Christensen
Computer Science and EngineeringUniversity of California, San Diegohttp://www.hichristensen.net
H. I. Christensen (UCSD) CSE 291- PR 1 / 47
Outline
1 Introduction
2 Linear Recognition
3 Linear Discriminant Functions
4 LSQ for Classification
5 Fisher’s Discriminant Method
6 Perceptrons
7 Summary
H. I. Christensen (UCSD) CSE 291- PR 2 / 47
Introduction
Last time: Bayesian Decision Theory
Today: linear classification of data
Basic pattern recognitionSeparation of data: buy/sellSegmentation of line data, ...
H. I. Christensen (UCSD) CSE 291- PR 3 / 47
Homework
We will use three data-sets
Recognition of handwritten charactersFace recognitionRecognition of trees in a forest
H. I. Christensen (UCSD) CSE 291- PR 4 / 47
Recognition of handwritten characters
43 people contributed data
30 people in the training set and 13 in the test set
Each data point is a 32x32 bit map
Raw and pre-processed data available from http:
//archive.ics.uci.edu/ml/machine-learning-databases/optdigits
H. I. Christensen (UCSD) CSE 291- PR 5 / 47
Recognition of handwritten characters
H. I. Christensen (UCSD) CSE 291- PR 6 / 47
Wine Recommendation
Characterization of wine based on 13 features
alcohol, acid, ash, ...
Relative simple dataset - small enough
Widely used for “simple” benchmarking
Available from https://archive.ics.uci.edu/ml/datasets/Wine
H. I. Christensen (UCSD) CSE 291- PR 7 / 47
Detection of palms
Detection of palms in drone images
Study of images from Belize
Based on recognition of specific trees we can determine the state of theenvironment.
H. I. Christensen (UCSD) CSE 291- PR 8 / 47
Outline
1 Introduction
2 Linear Recognition
3 Linear Discriminant Functions
4 LSQ for Classification
5 Fisher’s Discriminant Method
6 Perceptrons
7 Summary
H. I. Christensen (UCSD) CSE 291- PR 9 / 47
Simple Example - Bolts or Needles
H. I. Christensen (UCSD) CSE 291- PR 10 / 47
Classification
Given
An input vector: XA set of classes: ci ∈ C, i = 1, . . . , k
Mapping m : X → CSeparation of space into decision regions
Boundaries termed decision boundaries/surfaces
H. I. Christensen (UCSD) CSE 291- PR 11 / 47
Basis Formulation
It is a 1-of-K coding problem
Target vector: t = (0, . . . , 1, . . . , 0)
Consideration of 3 different approaches1 Optimization of discriminant function2 Bayesian Formulation: p(ci |x)3 Learning & Decision fusion
H. I. Christensen (UCSD) CSE 291- PR 12 / 47
Outline
1 Introduction
2 Linear Recognition
3 Linear Discriminant Functions
4 LSQ for Classification
5 Fisher’s Discriminant Method
6 Perceptrons
7 Summary
H. I. Christensen (UCSD) CSE 291- PR 13 / 47
Discriminant Functions
Objective: input vector x assigned to a class ci
Simple formulation:y(x) = wTx + w0
w is termed a weight vector
w0 is termed a bias
Two class example: c1 if y(x) ≥ 0 otherwise c2
H. I. Christensen (UCSD) CSE 291- PR 14 / 47
Basic Design
Two points on decision surface xa and xb
y(xa) = y(xb) = 0⇒ wT (xa − xb) = 0
w perpendicular to decision surface
wTx
||w||= − w0
||w||
Define: w = (w0,w) and x = (1, x) so that:
y(x) = wT x
H. I. Christensen (UCSD) CSE 291- PR 15 / 47
Linear discriminant function
x2
x1
wx
y(x)‖w‖
x⊥
−w0‖w‖
y = 0y < 0
y > 0
R2
R1
H. I. Christensen (UCSD) CSE 291- PR 16 / 47
Multi Class Discrimination
Generation of multiple decision functions
yk(x) = wTk x + wk0
Decision strategyj = arg max
i∈1..kyi (x)
H. I. Christensen (UCSD) CSE 291- PR 17 / 47
Multi-Class Decision Regions
Ri
Rj
Rk
xA
xB
x
H. I. Christensen (UCSD) CSE 291- PR 18 / 47
Example - Bolts or Needles
H. I. Christensen (UCSD) CSE 291- PR 19 / 47
Minimum distance classification
Suppose we have computed the mean value for each of the classes
mneedle = [0.86, 2.34]T and mbolt = [5.74, 5, 85]T
We can then compute the minimum distance
dj(x) = ||x −mj ||
argminidi (x) is the best fit
Decision functions can be derived
H. I. Christensen (UCSD) CSE 291- PR 20 / 47
Bolts / Needle Decision Functions
Needle dneedle(x) = 0.86x1 + 2.34x2 − 3.10
Bolt dbolt(x) = 5.74x1 + 5.85x2 − 33.59
Decision boundarydi (x)− dj(x) = 0
dneedle/bolt(x) = −4.88x1 − 3.51x2 + 30.49
H. I. Christensen (UCSD) CSE 291- PR 21 / 47
Example decision surface
H. I. Christensen (UCSD) CSE 291- PR 22 / 47
Outline
1 Introduction
2 Linear Recognition
3 Linear Discriminant Functions
4 LSQ for Classification
5 Fisher’s Discriminant Method
6 Perceptrons
7 Summary
H. I. Christensen (UCSD) CSE 291- PR 23 / 47
Maximum Likelihood & Least Squares
Assume observation from a deterministic function contaminated by GaussianNoise
t = y(x ,w) + ε p(ε|β) = N(ε|0, β−1)
the problem at hand is then
p(t|x ,w , β) = N(t|y(x ,w), β−1)
From a series of observations we have the likelihood
p(t|X,w , β) =N∏i=1
N(ti |wTφ(xi ), β−1)
H. I. Christensen (UCSD) CSE 291- PR 24 / 47
Maximum Likelihood & Least Squares (2)
This results in
ln p(t|w, β) =N
2lnβ − N
2ln(2π)− βED(w)
where
ED(w) =1
2
N∑i=1
{ti −wTφ(xi )}2
is the sum of squared errors
H. I. Christensen (UCSD) CSE 291- PR 25 / 47
Maximum Likelihood & Least Squares (3)
Computing the extrema yields:
wML =(ΦTΦ
)−1ΦT t
where
Φ =
φ0(x1) φ1(x1) · · · φM−1(x1)φ0(x1) φ1(x2) · · · φM−1(x2)
......
. . ....
φ0(xN) φ1(xN) · · · φM−1(xN)
H. I. Christensen (UCSD) CSE 291- PR 26 / 47
Line Estimation
Least square minimization:
Line equation: y = ax + bError in fit:
∑i (yi − axi − b)2
Solution: (y 2
y
)=
(x2 xx 1
)(ab
)Why might this be a non-robust solution?
H. I. Christensen (UCSD) CSE 291- PR 27 / 47
LSQ on Lasers
Line model: ri cos(φi − θ) = ρ
Error model: di = ri cos(φi − θ)− ρOptimize: argmin(ρ,θ)
∑i (ri cos(φi − θ)− ρ)2
Error model derived in Deriche et al. (1992)
Well suited for “clean-up” of Hough lines
H. I. Christensen (UCSD) CSE 291- PR 28 / 47
Total Least Squares
Line equation: ax + by + c = 0
Error in fit:∑
i (axi + byi + c)2 where a2 + b2 = 1.
Solution: (x2 − x x xy − x y
xy − x y y2 − y y
)(ab
)= µ
(ab
)where µ is a scale factor.
c = −ax − by
H. I. Christensen (UCSD) CSE 291- PR 29 / 47
Line Representations
The line representation is crucialOften a redundant model is adoptedLine parameters vs end-pointsImportant for fusion of segments.End-points are less stable
H. I. Christensen (UCSD) CSE 291- PR 30 / 47
Sequential Adaptation
In some cases one at a time estimation is more suitable
Also known as gradient descent
w(τ+1) = w(τ) − η∇En
= w(τ) − η(tn −w(τ)Tφ(xn))φ(xn)
Known as least-mean square (LMS). An issue is how to choose η?
H. I. Christensen (UCSD) CSE 291- PR 31 / 47
Regularized Least Squares
As seen in lecture 2 sometime control of parameters might be useful.
Consider the error function:
ED(w) + λEW (w)
which generates
1
2
N∑i=1
{ti − w tφ(xi )}2 +λ
2wTw
which is minimized by
w =(λI + ΦTΦ
)−1ΦT t
H. I. Christensen (UCSD) CSE 291- PR 32 / 47
Least Squares for Classification
We could do LSQ for regression and we can perform an approximation to theclassification vector CConsider:
yk(x) = wTk x + wk0
Rewrite toy(x) = WT x
Assuming we have a target vector T
H. I. Christensen (UCSD) CSE 291- PR 33 / 47
Least Squares for Classification
The error is then:
ED(W) =1
2Tr{
(XW − T)T (XW − T)}
The solution is then
W =(XT X
)−1
XTT
H. I. Christensen (UCSD) CSE 291- PR 34 / 47
LSQ and Outliers
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
H. I. Christensen (UCSD) CSE 291- PR 35 / 47
Outline
1 Introduction
2 Linear Recognition
3 Linear Discriminant Functions
4 LSQ for Classification
5 Fisher’s Discriminant Method
6 Perceptrons
7 Summary
H. I. Christensen (UCSD) CSE 291- PR 36 / 47
Fisher’s linear discriminant
Selection of a decision function that maximizes distance between classes
Assume for a starty = WTx
Compute m1 and m2
m1 = 1N1
∑i∈C1
xi m2 = 1N2
∑j∈C2
xj
Distance:m2 −m1 = wT (m2 −m1)
where mi = wmi
H. I. Christensen (UCSD) CSE 291- PR 37 / 47
The suboptimal solution
−2 2 6
−2
0
2
4
H. I. Christensen (UCSD) CSE 291- PR 38 / 47
The Fisher criterion
Consider the expression
J(w) =wTSBw
wTSWw
where SB is the between class covariance and SW is the within classcovariance, i.e.
SB = (m1 −m2)(m1 −m2)T
andSW =
∑i=C1
(xi −m1)(xi −m1)T +∑i=C2
(xi −m2)(xi −m2)T
Optimized when(wTSBw)Sww = (wTSWw)SBw
orw ∝ S−1
w (m2 −m1)
H. I. Christensen (UCSD) CSE 291- PR 39 / 47
The Fisher result
−2 2 6
−2
0
2
4
H. I. Christensen (UCSD) CSE 291- PR 40 / 47
Generalization to N > 2
Define a stacked weight factor
y = WTx
The within class covariance generalizes to
Sw =K∑
k=1
Sk
The between class covariance is
SB =K∑
k=1
Nk(mk −m)(mk −m)T
It can be shown that J(w) is optimized by the eigenvectors to the equation
S = S−1W SB
H. I. Christensen (UCSD) CSE 291- PR 41 / 47
Outline
1 Introduction
2 Linear Recognition
3 Linear Discriminant Functions
4 LSQ for Classification
5 Fisher’s Discriminant Method
6 Perceptrons
7 Summary
H. I. Christensen (UCSD) CSE 291- PR 42 / 47
Perceptron Algorithm
Developed by Rosenblatt (1962)
Formed an important basis for neural networks
Use a non-linear transformation φ(x)
Construct a decision function
y(x) = f(wTφ(x)
)where
f (a) =
{+1, a ≥ 0−1, a < 0
H. I. Christensen (UCSD) CSE 291- PR 43 / 47
The perceptron criterion
Normally we wantw tφ(xn) > 0
Given the target vector definition
Ep(w) = −∑
n inM
wTφntn
Where M represents all the mis-classified samples
We can use gradient descent
w(τ+1) = w(τ) − η∇EP(w) = w(τ) + ηφntn
H. I. Christensen (UCSD) CSE 291- PR 44 / 47
Perceptron learning example
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
H. I. Christensen (UCSD) CSE 291- PR 45 / 47
Outline
1 Introduction
2 Linear Recognition
3 Linear Discriminant Functions
4 LSQ for Classification
5 Fisher’s Discriminant Method
6 Perceptrons
7 Summary
H. I. Christensen (UCSD) CSE 291- PR 46 / 47
Summary
Basics for discrimination / classification
Obviously not all problems are linear
Optimization of the distance/overlap between classes
Minimizing the probability of error classification
Basic formulation as an optimization problem
How to optimize between cluster distance? Covariance Weighted
Basic recursive formulation
Could we make it more robust?
Thursday: Sub-space methods
H. I. Christensen (UCSD) CSE 291- PR 47 / 47
Deriche, R., Vaillant, R., & Faugeras, O. 1992. From Noisy Edges Points to 3D Reconstruction of a Scene : A Robust Approach and Its UncertaintyAnalysis. Vol. 2. World Scientific. Series in Machine Perception and Artificial Intelligence. Pages 71–79.
H. I. Christensen (UCSD) CSE 291- PR 47 / 47