Linear Classifiers
description
Transcript of Linear Classifiers
Linear ClassifiersLinear Classifiers
Dept. Computer Science & Dept. Computer Science & Engineering, Engineering,
Shanghai Jiao Tong UniversityShanghai Jiao Tong University
23/4/22 Linear Classifiers 2
OutlineOutline• Linear Regression• Linear and Quadratic Discriminant Functions• Reduced Rank Linear Discriminant Analysis• Logistic Regression• Separating Hyperplanes
23/4/22 Linear Classifiers 3
Linear RegressionLinear Regression• The response classes is coded by a indicator
variable. A K classes depend on K indicator variables, as Y(k), k=1,2,…, K, each indicate a class. And N train instances of the indicator vector could form a indicator response matrix YY.
• To the Data set X(k), there is map:X X : X(k) Y(k)
• According to the linear regression model:f(x)=(1,XXT)B^
Y Y ^=XX(XXTXX)-1XXTYY
23/4/22 Linear Classifiers 4
Linear RegressionLinear Regression• Given X, the classification should be:
• In another form, a target tk is constructed for each class, the tk presents the kth column of a K identity matrix, according to the a sum-of-squared-norm criterion :
and the classification is:
xfxG kGk
maxarg
N
i
TiiB
Bxy1
2,1min
2
minarg kGk txfxG
23/4/22 Linear Classifiers 5
• The data come from three classes in IR2 and easily separated by linear decision boundaries.
• The left plot shows the boundaries found by linear regression of indicator response variables.
• The middle class is completely masked .
Problems of the linear regressionProblems of the linear regression
The rug plot at bottom indicates the positions and the class membership of each observations. The 3 curves are the fitted regressions to the 3-class indicator variables.
23/4/22 Linear Classifiers 6
Problems of the linear regressionProblems of the linear regression
• The left plot shows the boundaries found by linear discriminant analysis. And the right shows the fitted regressions to the 3-class indicator variables.
23/4/22 Linear Classifiers 7
Linear Discriminant Analysis Linear Discriminant Analysis • According to the Bayes optimal classification
mentioned in chapter 2, the posteriors is needed. post probability :assume:
——condition-density of X in class G=k. ——prior probability of class k, with
Bayes theorem give us the discriminant:
xfk
11
K
k kk
XG |Pr
K
l ll
kk
xfxfxXkG
1
|Pr
23/4/22 Linear Classifiers 8
Linear Discriminant AnalysisLinear Discriminant Analysis• Multivariate Gaussian density:
• Comparing two classes k and l , assume
kkT
k xxp
kpk exf
121
2/2/21
)(
)()(21log
log)()(log
)|Pr()|Pr(log
1
1
lkT
lklkl
k
l
k
l
k
x
xfxf
xXlGxXkG
kk ,
23/4/22 Linear Classifiers 9
Linear Discriminant AnalysisLinear Discriminant Analysis• The linear log-odds function above implies that the
class k and l is linear in x; in p dimension a hyperplane.
• Linear Discriminant Function:
So we estimate
kkkkT
k xx log21)( 11
,ˆ ,ˆ kk
23/4/22 Linear Classifiers 10
Parameter EstimationParameter Estimation
.)/()ˆ)(ˆ(ˆ
;/ˆ
datak Class ofnumber theis ,ˆ
1
K
k kg
Tkiki
kgkik
kk
k
i
i
KNxx
Nx
NNN
23/4/22 Linear Classifiers 11
LDA RuleLDA Rule
)}()( | x {boundary
)}({maxvar)(g :rule
log21)(
k
11
xxDecision
xxLDA
xx
lk
ll
kkkkT
k
23/4/22 Linear Classifiers 12
Linear Discriminant AnalysisLinear Discriminant Analysis
• Three Gaussian distribution with the same covariance and different means. The Bayes boundaries are shown on the left (solid lines). On the right is the fitted LDA boundaries on a sample of 30 drawn from each Gaussian distribution.
23/4/22 Linear Classifiers 13
Quadratic Discriminant AnalysisQuadratic Discriminant Analysis• When the covariance are different
• This is the Quadratic Discriminant Function• The decision boundary is described by a quadratic
equation
kkkkk xxx log21log
21)( 1
)}()(:{ xxx lk
23/4/22 Linear Classifiers 14
LDA & QDALDA & QDA
• Boundaries on a 3-classes problem found by both the linear discriminant analysis in the original 2-dimensional space XX11, X, X22 (the left) and in a 5-dimensional space XX11, X, X22, X, X1212, X, X11
22, X, X222 2 (the right).
23/4/22 Linear Classifiers 15
LDA & QDALDA & QDA
• Boundaries on the 3-classes problem found by LDA in the 5-dimensional space above (the left) and by Quadratic Discriminant Analysis (the right).
23/4/22 Linear Classifiers 16
Regularized Discriminant Ana.Regularized Discriminant Ana.• Shrink the separate covariances of QDA toward a
common covariance as in LDA.
Regularized QDA• was allowed to be shrunk toward the scalar
covariance.
Regularized LDA• Together :
]1,0[,ˆ)1(ˆ)(ˆ kk
]1,0[,ˆ)1(ˆ)(ˆ 2
),(ˆ
23/4/22 Linear Classifiers 17
Regularized Discriminant Ana.Regularized Discriminant Ana.• Could use• In recent micro expression work, we can use
where in a “SHRUNKEN CENTROID”
)ˆdiag()1(ˆ)(ˆ
K
P
j j
jkjK S
uxx log
21)ˆ(
)(1
2
2
jkuˆ
23/4/22 Linear Classifiers 18
• Test and training errors for the vowel data, using regularized discriminant analysis with a series of values of . The optimum for the test data occurs around , close to quadratic discriminant analysis
]1,0[
9.0
23/4/22 Linear Classifiers 19
Computations for LDAComputations for LDA
• The eigen-decomposition for eachwhere is orthonormal, and is a diagonal matrix of positive eigenvalues .
• So the ingredients for are:
Tkkkk UDU
l klk
xTkk
Tx
Tkxk
Tx
d
xUDxUxx
logˆlog
ˆˆˆˆˆ 11
kU kDpp
kld)(xk
kkkkk xxx log21log
21)( 1
23/4/22 Linear Classifiers 20
Reduced Rank LDAReduced Rank LDA• Let • Let
• LDA: • Closest centroid in sphered space( apart from )
• Can project data onto K-1 dim subspace spanned by , and lose nothing!
• Can project even lower dim using principal components of , k=1, …, K.
TUDU
K/-
KT
K
T
UUUDUXXUDX
212/1*
2/12/1*
ˆˆˆˆ
i.e.
KKK xx ˆlogˆ)(2**
21
Klog
**1
ˆ,,ˆKUU
*ˆKU
23/4/22 Linear Classifiers 21
Reduced Rank LDAReduced Rank LDA• Compute matrix M of centroids• Compute• Compute B*, cov matrix of M* , and
• is l-th discriminant variable ( canonical variable )
2/1*ˆ MWMW and ,
pK
TBVDVB ***
*2/1ll
Tll vWvXvZ with
23/4/22 Linear Classifiers 22
• A two-dimensional plot of the vowel training data. There are eleven classes with ,and this is the best view in terms of a LDA model. The heavy circles are the projected mean vectors for each class.
10IRX
23/4/22 Linear Classifiers 23
• Projections onto different pairs of canonical varieties
23/4/22 Linear Classifiers 24
Reduced Rank LDAReduced Rank LDA
• Although the line joining the centroids defines the direction of greatest centroid spread, the projected data overlap because of the covariance ( left panel).
• The discriminant direction minimizes this overlap for Gaussian data ( right panel).
23/4/22 Linear Classifiers 25
Fisher’s problemFisher’s problem• Find “between-class var” is
maximized relative to “within-class var”.• Maximize “Rayleigh quotient” :
..tsXaZ T
etc.
2
12
2222
1
01..),max(
!1..),max(
)max(
vaWaaWaatsBaa
vaWaatsBaa
WaaBaa
T
TT
TTT
T
23/4/22 Linear Classifiers 26
• Training and test error rates for the vowel data, as a function of the dimension of the discriminant subspace.
• In this case the best rate is for dimension 2.
23/4/22 Linear Classifiers 27
23/4/22 Linear Classifiers 28
South African Heart
Disease Data
23/4/22 Linear Classifiers 29
Logistic RegressionLogistic Regression• Model:
1
1 0,
1
1 0,
0,
20,
10,1
)exp(11)Pr(
1,,1,)exp(1
)exp()Pr(
)Pr()1Pr(
)Pr()1Pr(
K
lTll
K
lTll
Tkk
Tk
T
xxXKg
Kkx
xxXkg
xxXKg
xXKgLog
xxXkgxXg
Log
n
iigi xPL
1
);(log)(
:Likelihood Log
23/4/22 Linear Classifiers 30
Logistic Regression / LDALogistic Regression / LDA
Same form as Logistic Regressionx
xxXKgxXkg
Tkk
KkT
KkT
KkK
k
0
1
1
)(
)()(21log
)Pr()Pr(
log
K
k kk
kk
K
lTll
Tkk
XXXkgXLDA
x
xxXkgXkgXLR
1
1
1 0,
0,
),;()Pr(),;(),Pr(:
)exp(1
)exp()Pr()Pr(),Pr(:
Conditional Likelihood
Probability
23/4/22 Linear Classifiers 31
Rosenblatt’s Perceptron Learning AlgorithmRosenblatt’s Perceptron Learning Algorithm
• --distance of points in M to boundary• Stochastic Gradient Descent
nsObservatioedMiscassifiMy
xyD
i
Mi
Tii
}1,1{
)(),( 00
),( 0D
i
ii
Mii
Miii
yxy
yDxyD
00
0
23/4/22 Linear Classifiers 32
Separating HyperplanesSeparating Hyperplanes• A toy example with two
classes separable by hyperplane. The orange line is the least squares solution, which misclassifies one of the training points. Also shown are two blue separating hyperplanes found by the perceptron learning algorithm with different random starts.
•
23/4/22 Linear Classifiers 33
Optimal Separating Hyperplanes
NiCxy
C
Tii ,,1,)(
max
0
1,, 0
to subject