Lecture 13: Multiclass Classification (I)lu/ST7901/lecture notes... · 2019. 10. 5. · Linear...
Transcript of Lecture 13: Multiclass Classification (I)lu/ST7901/lecture notes... · 2019. 10. 5. · Linear...
Lecture 13: Multiclass Classification (I)
Wenbin Lu
Department of StatisticsNorth Carolina State University
Fall 2019
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 43
Multiclass Classification
General setup
Bayes rule for multiclass problems
under equal costsunder unequal costs
Linear multiclass methods
Linear regression modelLiner discriminant analysis (LDA)
Visualization of LDAGeneralizing LDA
Multiple logistic regression Models
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 43
General Setup
Multiclass Problems
Class label Y ∈ {1, ...,K},K ≥ 3.
The classifier f : Rd −→ {1, ...,K}.Examples: zip code classification, multiple-type cancer classification,vowel recognition
The loss function L(Y , f (X)) =∑L
k=1
∑Kl=1 C (l , k)I (Y = l , f (X) = k),
where
C (l , k) = cost of classifying a sample in class l to class k.
In general, C (k , k) = 0 for any k = 1, · · · ,K . The loss can be describedas an K × K matrix.For example, K = 3,
C =
0 1 11 0 11 1 0
.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 43
General Setup
Terminologies
Prior class probabilities (marginal distribution of Y ):
πk = P(Y = k), k = 1, · · · ,K
Conditional distribution of X given Y : assume that
gk is the conditional density of X given Y = k , k = 1, · · · ,K .
Marginal density of X (a mixture):
g(x) =K∑
k=1
πkgk(x), k = 1, · · · ,K .
Posterior class probabilities (conditional dist. of Y given X):
P(Y = k |X = x) =πkgk(x)
g(x), k = 1, · · · ,K .
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 43
Bayes Rule for Multiclass Problems
Bayes Rule
The optimal (Bayes) rule aims to minimize the average loss function overthe population
φB(x) = arg minf
EX,Y L(Y , f (X))
= arg minf
EXEY |XL(Y , f (X)),
where
EY |XL(Y , f (X)) =K∑
k=1
I (f (X) = k)K∑l=1
C (l , k)P(Y = l |X).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 43
Bayes Rule for Multiclass Problems
Bayes Rule Under Equal Costs
If C (k , l) = I (k 6= l), then R[f ] = EXEY |XI (Y 6= f (X )) and
EY |XL(Y , f (X)) =K∑
k=1
I (f (X) = k)K∑l 6=k
P(Y = l |X )
=K∑l=k
I (f (X) = k)[1− P(Y = k |X)]
Given x, the minimizer of EY |X=xL(Y , f (X)) is f (x) = k∗, where
k∗ = arg mink=1,..,K
{1− P(Y = k |x)}.
The Bayes rule is
φB(x) = k∗ if P(Y = k∗|x) = maxk=1,...,K
P(Y = k |x),
which assigns x to the most probable class using Pr(Y = k |X = x).Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 43
Bayes Rule for Multiclass Problems
Bayes Rule Under Unequal Costs
For the general loss function, the Bayes rule can be derived as
φB(x) = k∗ if k∗ = arg mink=1,...,K
K∑l=1
C (l , k)P(Y = l |x).
There is not a simple analytic form for the solution.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 43
Bayes Rule for Multiclass Problems
Decision (Discriminating) Functions
For multiclass problems, we generally need to estimate multiplediscriminant functions fk(x), k = 1, · · · ,K
Each fk(x) is associated with class k.
fk(x) represents the evidence strength of a sample (x, y) belonging toclass k .
The decision rule constructed using fk ’s is
f (x) = k∗, where k∗ = arg maxk=1,...,K
fk(x).
The decision boundary of the classification rule f between class k andclass l is defined as
{x : fk(x) = fl(x)}.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 43
Traditional Methods for Multiclass Problems
Traditional Methods: Divide and Conquer
Main ideas:
(i) Decompose the multiclass classification problem into multiple binaryclassification problems.
(ii) Use the majority voting principle (a combined decision from thecommittee) to predict the label
Common approaches: simple but effective
One-vs-rest (one-vs-all) approaches
Pairwise (one-vs-one, all-vs-all) approaches
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 43
Traditional Methods for Multiclass Problems
One-vs-rest Approach
One of the simplest multiclass classifier; commonly used in SVMs; alsoknown as the one-vs-all (OVA) approach
(i) Solve K different binary problems: classify “class k” versus “the restclasses” for k = 1, · · · ,K .
(ii) Assign a test sample to the class giving the largest fk(x) (mostpositive) value, where fk(x) is the solution from the kth problem
Properties:
Very simple to implement, perform well in practice
Not optimal (asymptotically): the decision rule is not Fisherconsistent if there is no dominating class (i.e. arg max pk(x) < 1
2 ).
Read: Rifkin and Klautau (2004) “In Defense of One-vs-all Classification”
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 43
Traditional Methods for Multiclass Problems
Pairwise Approach
Also known as all-vs-all (AVA) approach
(i) Solve(K
2
)different binary problems: classify “class k” versus “class j”
for all j 6= k. Each classifier is called gij .
(ii) For prediction at a point, each classifier is queried once and issues avote. The class with the maximum number of (weighted) votes is thewinner.
Properties:
Training process is efficient, by dealing with small binary problems.
If K is big, there are too many problems to solve. If K = 10, we needto train 45 binary classifiers.
Simple to implement; perform competitively in practice.
Read: Park and Furnkranz (2007) “Efficient Pairwise Classification”
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 43
Linear Regression Models
Linear Classifier for Multiclass Problems
Assume the decision boundaries are linear. Common linear classificationmethods:
Multivariate linear regression methods
Linear log-odds (logit) models
Linear discriminant analysis (LDA)Multiple logistic regression
Separating hyperplane – explicitly model the boundaries as linear
Perceptron model, SVMs
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 43
Linear Regression Models Multivariate Linear Regression
Coding for Linear Regression
For each class k , code it using an indicator variable Zk :
Zk = 1 if Y = k;
Zk = 0 if Y 6= k .
The response y is coded using a vector z = (z1, ..., zK ). If Y = k ,only the kth component of z = 1.
The entire training samples can be coded using a n × K matrix,denoted by Z .
Call the matrix Z the indicator response matrix.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 43
Linear Regression Models Multivariate Linear Regression
Multivariate Lineaer Regression
Consider the multivariate model
Z = XB + E, or Zk = fk ≡ Xbk + εk , k = 1, · · · ,K .
Zk is the kth column of the indicator response matrix Z .
X is n × (d + 1) input matrix
E is n × K matrix of errors.
B is (d + 1)× K matrix of parameters, bk is the kth column of B.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 43
Linear Regression Models Multivariate Linear Regression
Ordinary Least Square Estimates
Minimize the residual sum of squares
RSS(B) =K∑
k=1
n∑i=1
[zik − β0k −d∑
j=1
βjkxij ]2
= tr[(Z − XB)T (Z − XB)].
The LS estimates has the same form as univariate case
B = (XTX )−1XTZ ,
Z = X (XTX )−1XTZ ≡ XB.
Coefficient bk for yk is same as univariate OLS estimates. (Multipleoutputs do not affect the OLS estimates).
For x, compute f (x) =[(1 x)B
]Tand classify it as
f (x) = argmaxk=1,...,K fk(x).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 43
Linear Regression Models Multivariate Linear Regression
About Multiple Linear Regression
Justification: Each fk is an estimate of conditional expectation of Yk ,which is the conditional probability of (X ,Y ) belonging to the class k
E(Yk |X = x) = Pr(Y = k |X = x).
If the intercept is in the model (column of 1 in X), then∑Kk=1 fk(x) = 1 for any x.
If the linear assumption is appropriate, then fk give a consistentestimate of the probability for k = 1, · · · ,K , as the sample size ngoes to infinity.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 43
Linear Regression Models Multivariate Linear Regression
Limitations of Linear Regression
There is no guarantee that all fk ∈ [0, 1]; some can be negative orgreater than 1.
Linear structure of X ’s can be rigid.
Serious masking problem for K ≥ 3 due to rigid linear structure.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 43
Linear Regression Models Multivariate Linear Regression
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4Linear Regression
1
1
1
1
1
1111
1
1
1
1
11
1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
11
11
1
1
1
1 11
1
1
1
1
1
1
1
1 1
1
1
1
1
11
1 1 1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
111 1
1
11 1
1
1
1 1
1
1
11
1
1
1
1
1
1
1
1
1
1
111
1
1
1
1
1
1
1
11
11
1
1
11
1
1
1
1
1
1
1
1 111
1
1
11
1
1
11
11
11
1
1
1
1
1
111
1
11
1
1
11
1
1
11
1
1
1
1
1
1
1
1
11
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1 111 1
1
1
1
1
1
1
1
1
1 11
11
1
1
1
1
1
1
1
1
1 1
1
11
11
1
1
1
1
1
1 1
1
1
1
1
1
11
11
1
1 1
11
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
11
1
1
1
11
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1 1
1
1
1
1
1 1
111
1
1
111
11
11
1
111
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
11
1
11
1
1
1
1
11
1
11
1
1
1 1
11
1
11
1
11
1
1
1
1
11
1
11
1
11
11
1
1
1
11
1
1
1
11
1
11
11
1
1
1
1
11
11
1
1
1
1
1
111 1
11
1 1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1 1
11
1 1
11
1
1
1
1
1
1
1
1
1
1
11
22
2
2
22
2
2
2
2
2
2
22
2
22
22 22
2
2 2
2
2 222
2
22
2
2
2
2
2
2
2
2
22
2
22
22
2
22
2
2
22
2
2
2
2
2
2
2 2
2
2
2
22
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2 22
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
2
22
2
2 22 2
2
2
2
2
2
22
22
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
22
2
22
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
2
22
2
2
2
22
2
2
2
2
2
2
2
2
2
22
2
2
2
22
22
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
222
2
22
2
2
2
2
2
22
2
2
22
2
2
2 2
2
22
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
22
2
2
2
2
2
22
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2 2
2
2 22
2
2
2
2
22
2
2
222
2
2
2
22
22
222
22
2
2
22
2
2
2 2
2
2
222
2
2
2
2
2
22
2
2 2
2
22
2
2
22
2
2
22
2
2
2
2
2
2
22
2 2
2
2
2
2
2
2
22
2
2 2
2
2
22
22 22
2
2
2 2
2
22
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2 2
2
2 2
2
2
2
22
2
2
2
22
222
2
22
2
22
222
2
22
2
2
2
22
2
2
2
2
22
2
2
22
2 2
2
2
2
22
2
2
22
2
2 2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
3
3
3
3
3
3
33
3
3
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
33
3
3 3 3
3
3
3
33
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
333
3 3
3
3
33
3
33
33
3
3
3
3
333
3
3
33
33
33
33
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
33
33
3
3
33
3
3
333
3
3
3
33
3
3
3
3
3
3
3
33
3
3
3
3
33
3
3
3
33
3
3
3 3
3
3
3
3
3
3
3
33
3
3
3
3
3 33
3
33
3
3
3
3
333
3
3
3
33
3
3
3
3 3
33
3 333
3
3
3
3
33
3
33
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
33
3
3
3
333
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
33
3
3
33
33
3
3
3
3
3
33
3
3
33
3
3
3
33
3
3 33
3
3
3
33
3
33
333
3
3
33
33
3
3
3
3
3
3 3
3
3
3
3
33
3
3
33
3
3
3
3
33
3 3
3
3
33
3
33
33
3
3
33
33
33
33 3
3
3
3
3
3
3
33
33
33
33
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
33
3
3
3 33
3
3
3
33
3
3
33
3
3
3
3
3
3
3
3
3
33
3
33
3
3
3
3
3
33
3
3
3
3 3
3
3
33
3
3
33
3
3
3
33
333
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
33
3
3
3
3
333
3
3
3
3
33
3
3
33
3
3
3
3
33
33
33
Linear Discriminant Analysis
1
1
1
1
1
1111
1
1
1
1
11
1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
11
11
1
1
1
1 11
1
1
1
1
1
1
1
1 1
1
1
1
1
11
1 1 1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
111 1
1
11 1
1
1
1 1
1
1
11
1
1
1
1
1
1
1
1
1
1
111
1
1
1
1
1
1
1
11
11
1
1
11
1
1
1
1
1
1
1
1 111
1
1
11
1
1
11
11
11
1
1
1
1
1
111
1
11
1
1
11
1
1
11
1
1
1
1
1
1
1
1
11
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1 111 1
1
1
1
1
1
1
1
1
1 11
11
1
1
1
1
1
1
1
1
1 1
1
11
11
1
1
1
1
1
1 1
1
1
1
1
1
11
11
1
1 1
11
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
11
1
1
1
11
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1 1
1
1
1
1
1 1
111
1
1
111
11
11
1
111
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
11
1
11
1
1
1
1
11
1
11
1
1
1 1
11
1
11
1
11
1
1
1
1
11
1
11
1
11
11
1
1
1
11
1
1
1
11
1
11
11
1
1
1
1
11
11
1
1
1
1
1
111 1
11
1 1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1 1
11
1 1
11
1
1
1
1
1
1
1
1
1
1
11
22
2
2
22
2
2
2
2
2
2
22
2
22
22 22
2
2 2
2
2 222
2
22
2
2
2
2
2
2
2
2
22
2
22
22
2
22
2
2
22
2
2
2
2
2
2
2 2
2
2
2
22
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2 22
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
2
22
2
2 22 2
2
2
2
2
2
22
22
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
22
2
22
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
2
22
2
2
2
22
2
2
2
2
2
2
2
2
2
22
2
2
2
22
22
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
222
2
22
2
2
2
2
2
22
2
2
22
2
2
2 2
2
22
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
22
2
2
2
2
2
22
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2 2
2
2 22
2
2
2
2
22
2
2
222
2
2
2
22
22
222
22
2
2
22
2
2
2 2
2
2
222
2
2
2
2
2
22
2
2 2
2
22
2
2
22
2
2
22
2
2
2
2
2
2
22
2 2
2
2
2
2
2
2
22
2
2 2
2
2
22
22 22
2
2
2 2
2
22
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2 2
2
2 2
2
2
2
22
2
2
2
22
222
2
22
2
22
222
2
22
2
2
2
22
2
2
2
2
22
2
2
22
2 2
2
2
2
22
2
2
22
2
2 2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
3
3
3
3
3
3
33
3
3
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
33
3
3 3 3
3
3
3
33
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
333
3 3
3
3
33
3
33
33
3
3
3
3
333
3
3
33
33
33
33
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
33
33
3
3
33
3
3
333
3
3
3
33
3
3
3
3
3
3
3
33
3
3
3
3
33
3
3
3
33
3
3
3 3
3
3
3
3
3
3
3
33
3
3
3
3
3 33
3
33
3
3
3
3
333
3
3
3
33
3
3
3
3 3
33
3 333
3
3
3
3
33
3
33
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
33
3
3
3
333
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
33
3
3
33
33
3
3
3
3
3
33
3
3
33
3
3
3
33
3
3 33
3
3
3
33
3
33
333
3
3
33
33
3
3
3
3
3
3 3
3
3
3
3
33
3
3
33
3
3
3
3
33
3 3
3
3
33
3
33
33
3
3
33
33
33
33 3
3
3
3
3
3
3
33
33
33
33
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
33
3
3
3 33
3
3
3
33
3
3
33
3
3
3
3
3
3
3
3
3
33
3
33
3
3
3
3
3
33
3
3
3
3 3
3
3
33
3
3
33
3
3
3
33
333
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
33
3
3
3
3
333
3
3
3
3
33
3
3
33
3
3
3
3
33
33
33
PSfrag repla ementsX1X1
X 2X 2Figure 4.2: The data ome from three lasses inIR2 and are easily separated by linear de ision bound-aries. The right plot shows the boundaries found bylinear dis riminant analysis. The left plot shows theboundaries found by linear regression of the indi a-tor response variables. The middle lass is ompletelymasked (never dominates).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 43
Linear Regression Models Multivariate Linear Regression
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4
11
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
11
1
11
111
1
1
1
1
1
1
1
1
1
1
1111
1
11
111
11
1
1
11
1
1
1
1
1
1
1
1
1
11
111
11
1
111
1
1
1
1
1
1111
11
11
1
1
11
11
1
111
1
11
1
1
1
11
11
1
1
1
1
1
11
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
111
1
11
111
11
11
1
111
1
222 222 2 2 2222 22 22 2 22 22222 2 22 222 22 2 22 22 22 22222 22 2 222 2 2222 222 222 222 2222 22 22 2 22 222 2222 22 222 222 222 222 222 2 22 22 222 222 22 22 2222 22 2 222 2 22 2 22 2 22 22 222 2 222 222 222 2
33
3
3
3
3
33
33
3
3
3
3
3
3
3
3
3
33
3
33
333
3
3
3
3
3
3
3
33
3
3333
3
33
333
33
3
3
33
3
3
3
3
3
3
3
3
3
33
333
33
3
333
3
33
3
3
3333
33
33
3
3
33
33
3
333
3
33
3
3
3
33
33
3
3
3
3
3
33
3
3
33
3
3
3
3
3
3
33
3
3
3
3
3
3
3
333
3
33
333
33
33
3
333
3
0.0
0.5
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Degree = 1; Error = 0.25
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
1
1
111
1
1
1
1
1
1
1
1
1
1
11
11
1
11
111
1
1
1
1
111
1
1
11
11
1
1
11
1111 1
1
111
11 1
11
1111
11
11
1
1
1111
1
1111
111
11 11 111 11
11
1 11 1111 11 1
111 1 11 1 11 1 1
1 111
11 1 111 111 1111
22
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
222
2
2
2
2
2
2
2
2
2
2
22
22
2
22
222
2
2
2
2 2222
2
222
22 222 2222 22
22 2
2
22
22 2222 22 2222
22 222 2222
22 2
2
2
22
22
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
22
2
2
2
2
2
2
2
22
2
2
2
2
2
2
22
2
2
2
22
2
2
333 33
33 3 333
3
33
33
33
3 3333
3
3 33 333
3
3
3 33 33
33 33333 33 3 33
3 3
3333
3
33
3
3
3
3
33
333
3 3
333
3
3
33
33
3333
33
33
3
3
33
33
3333
3
333
3
3
33
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
33
3
3
0.0
0.5
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Degree = 2; Error = 0.03
Figure 4.3: The e�e ts of masking on linear regres-sion in IR for a three- lass problem. The rug plot atthe base indi ates the positions and lass membershipof ea h observation. The three urves in ea h panel arethe �tted regressions to the three- lass indi ator vari-ables; for example, for the red lass, yred is 1 for thered observations, and 0 for the green and blue. The �tsare linear and quadrati polynomials. Above ea h plotis the training error rate. The Bayes error rate is 0:025for this problem, as is the LDA error rate.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 43
Linear Regression Models Multivariate Linear Regression
About Masking Problems
One-dimensional example: (see the previous figure)
Projected the data onto the line joining the three centroid
There is no information in the northwest-southeast direction
Three curves are regression lines
The left panel is for the linear regression fit:the middle class is never dominant!The right panel is for the quadratic fit (masking problem is solved!)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 43
Linear Regression Models Multivariate Linear Regression
How to Solve Masking Problems
In the previous example with K = 3
Using Quadratic regression rather than linear regression
Linear regression error 0.25; Quadratic regression error 0.03; Bayeserror 0.025.
In general,
If K ≥ 3 classes are lined up, we need polynomial terms up to degreeK − 1 to resolve the masking problem.
In the worst case, need O(pK−1) terms.
Though LDA is also based on linear functions, it does not suffer frommasking problems.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 43
Linear Regression Models Multivariate Linear Regression
Large K , Small d Problem
Masking often occurs for large K and small d .Example: Vowel data K = 11 and d = 10. (Two-dimensional for trainingdata using LDA projection)
A difficult classification problem, as the class overlap is considerate
the best methods achieve 40% test error
Method Traing Error Test Error
Linear regression 0.48 0.67LDA 0.32 0.56QDA 0.01 0.53Logistic regression 0.22 0.51
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 43
Linear Regression Models Multivariate Linear Regression
�������� � ������� �� �������� �������� ���������� � �������� ���� ������� �
Coordinate 1 for Training Data
Coord
inate 2
for Tr
aining
Data
-4 -2 0 2 4
-6-4
-20
24
oooooo
ooo
o
oo
ooo
o oo
ooo
oo
o
o o o o oo
ooo o oo
oo
o
o
o o
o oo
ooo
oo oooo
o
o
oo
o
o
oooo
o
o
oooooo
o ooo
o
o
o
o
oooo
ooo o
oo
o
oo
oo
o
ooooo
oooooo
o
o
ooo
o
o
o oooo o
ooo o oo
ooo o
oo
oo o ooo
o ooo
oo
oooooo
oooooo
oo
oooo
oooooo
ooooo o
oooooo
oooooo
oo
oooo
o oo
oo
o
o ooooo
ooooo
o
oooooo
oooooo
oooooo
oooooo
oo
o
o
oo
ooooo oo ooo
o oo o
oooo
oooo
o
o
oo
o
o
oo
oooo
o
o
oooo
o
o oo
o
oo
o
oooooo
oo oooo
oo
oo oo o
oo
oo o
o o o
oo
ooo
oooo
oooo
o
o
oo
o
ooo
ooo
ooo
o o ooo o
o oooo
o
ooooo
o
o o
oo
oo
oo o ooo
ooo
oo o
o
oo
o o o
ooooo
o
ooo
oo
o
o
oo
ooo
ooooo
o
oooo
oo
o oooo
o
oooooo
ooo o
oo
ooooo oo oo
o o o
oooooo
oo
ooo o
oo
o oo o
ooo
ooo
o
o
oo
oo
oo
oooo
oooo
oo
ooo ooo
oo
oo oo
o ooo
oo
ooo
ooo
ooo
ooo
oooooo
oo
o
oo o
••••
•••• •••• ••
••
••••
••
Linear Discriminant Analysis
������ ��� � ����������� � �� �� �� ��� �����
�� ����� ��� �� � � ���� ���� � � ����� ��
���� �� �� ��� ��� � ���� �� � ��� ��� ����
��� ������� �� ��� ���� � �� �� ���!��� ��
������ ��� ��� � ���� �� � ��� ��� �� �� �������
�� �
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 43
Linear Regression Models Linear Discriminant Analysis
Model Assumptions for Linear Discriminant Analysis
Model Setup
Let πk be prior probabilities of class k
Let gk(x) be the class-conditional densities of X in class k .
The posterior probability
Pr(Y = k|X = x) =gk(x)πk∑Kl=1 gl(x)πl
.
Model Assumptions:
Assume each class density is multivariate Gaussian N(µk ,Σk).
Further, we assume Σk = Σ for all k (equal covariance required!)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 43
Linear Regression Models Linear Discriminant Analysis
Linear Discriminant Function
The log-ratio between class k and l is
logPr(Y = k |X = x)
Pr(Y = l |X = x)= log
πkπl
+ loggk(x)
gl(x)
= [xTΣ−1µk −1
2µTk Σ−1µk + log πk ]−
[xTΣ−1µl −1
2µTl Σ−1µl + log πl ]
For each class k , its discriminant function is
fk(x) = xTΣ−1µk −1
2µTk Σ−1µk + log πk ,
The quadratic term of x canceled due to the equal covariance matrix.The decision rule is
Y (x) = argmaxk=1,...,K fk(x).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 43
Linear Regression Models Linear Discriminant Analysis
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4
+ +
+3
21
1
1
2
3
3
3
1
2
3
3
2
1 1 21
1
3
3
1 21
2
3
2
3
3
1
2
2
1
1
1
1
3
2
2
2
2
1 3
2 2
3
1
3
1
3
3 2
1
3
3
2
3
1
3
3
21
33
2
2
32
2
211
1
1
1
2
1
3
3
11
3
32
2
2
23
1
2
Figure 4.5: The left panel shows three Gaussian distri-butions, with the same ovarian e and di�erent means.In luded are the ontours of onstant density en losing95% of the probability in ea h ase. The Bayes de isionboundaries between ea h pair of lasses are shown (bro-ken straight lines), and the Bayes de ision boundariesseparating all three lasses are the thi ker solid lines (asubset of the former). On the right we see a sampleof 30 drawn from ea h Gaussian distribution, and the�tted LDA de ision boundaries.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 43
Linear Regression Models Linear Discriminant Analysis
Interpretation of LDA
log Pr(Y = k |X = x) = −1
2(x− µk)TΣ−1(x− µk) + log πk + const.
The constant term does not involve x.
If prior probabilities are same, the LDA classifies x to the classwith centroid closest to x, using the squared Mahalanobis distance,dΣ(x ,µ), based on the common covariance matrix.
If dΣ(x− µk) = dΣ(x− µl), then the prior determines the classifier.
Special Case: Σk = I for all k
fk(x) = −1
2‖x− µk‖2 + log πk .
only Euclidean distance is needed!
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 43
Linear Regression Models Linear Discriminant Analysis
Mahalanobis Distance
Mahalanobis distance is a distance measure introduced by P. C.Mahalanobis in 1936.Def: The Mahalanobis distance of x = (x1, . . . , xd)T from a set of pointswith mean µ = (µ1, . . . , µd)T and covariance matrix Σ is defined as:
dΣ(x,µ) =[(x− µ)TΣ−1(x− µ)
]1/2.
It differs from Euclidean distance in that
it takes into account the correlations of the data set
it is scale-invariant.
It can also be defined as a dissimilarity measure between x and x′ of thesame distribution with the covariance matrix Σ:
dΣ(x, x′) =[(x− x′)TΣ−1(x− x′)
]1/2.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 28 / 43
Linear Regression Models Linear Discriminant Analysis
Special Cases
If Σ is the identity matrix, the Mahalanobis distance reduces to theEuclidean distance.
If Σ is diagonal diag(σ21, · · · , σ2
d), then the resulting distance measureis called the normalized Euclidean distance:
dΣ(x, x′) =
[d∑
i=1
(xi − x ′i )2
σ2i
]1/2
,
where σi is the standard deviation of the xi and x ′i .
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 29 / 43
Linear Regression Models Linear Discriminant Analysis
Normalized Distance
Assume we like to decide whether x belongs to a class. Intuitively, thedecision relies on the distance of x to the class center µ: the closer to µ,the more likely x belongs to the class.
To decide whether a given distance is large or small, we need to knowif the points in the class are spread out over a large range or a smallrange. Statistically, we measure the spread by the standard deviationof distances of the sample points to µ.
If dΣ(x,µ) is less than one standard deviation, then it is highlyprobable that x belongs to the class. The further away it is, the morelikely that x not belonging to the class. This is the concept of thenormalized distance.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 43
Linear Regression Models Linear Discriminant Analysis
Motivation of Mahalanobis Distance
The drawback of the normalized distance is that the sample points areassumed to be distributed about µ in a spherical manner.
For non-spherical situations, for instance ellipsoidal, the probability ofx belonging to the class depends not only on the distance from µ, butalso on the direction.
In those directions where the ellipsoid has a short axis, x must becloser. In those where the axis is long, x can be further away from thecenter.
The ellipsoid that best represents the class’s probability distribution can beestimated by the covariance matrix of the samples.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 31 / 43
Linear Regression Models Linear Discriminant Analysis
Linear Decision Boundary of LDA
Linear discriminant functions
fk(x) = βT1kx + β0k , fj(x) = βT
1jx + β0j .
Decision boundary function between class k and j is
(β1k − β1j)Tx + (β0k − β0j) = βT
1 x + β0 = 0.
Since β1k = Σ−1µk and β1j = Σ−1µj , the decision boundary has thedirectional vector
β1 = Σ−1(µk − µj)
β1 in generally not in the direction of µk − µj
The discriminant direction β1 minimizes the overlap for Gaussian data
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 32 / 43
Linear Regression Models Linear Discriminant Analysis
Parameter Estimation in LDA
In practice, we estimate the parameters from the training data
πk = nk/n, where nk is sample size in
µk =∑
Yi=k xi/nk
The sample covariance matrix for the kth class:
Sk =1
nk − 1
∑Yi=k
(xi − µk)(xi − µk)T .
(Unbiased) pooled sample covariance is a weighted average
Σ =K∑
k=1
nk − 1∑Kl=1(nl − 1)
Sk
=K∑
k=1
∑Yi=k
(xi − µk)(xi − µk)T/(n − K ).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 33 / 43
Linear Regression Models Linear Discriminant Analysis
Implemenation of LDA
Compute the eigen-decomposition:
Σ = UDUT
Sphere the data using Σ, using the following transformation
X ∗ = D−1/2UTX .
The common covariance estimate of X ∗ is the identity matrix.
In the transformed space, classify the point to the closest transformedcentroid, with the adjustment using πk ’s.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 34 / 43
Linear Regression Models Reduced-Rank LDA
An Alternative View
Fisher’s reduced rank LDA:
Find the linear combination Z = aTX such that the between-classvariance is maximized relative to the within-class variance.
The first discriminant coordinate: maxa1
aT1 Ba1
aT1 Wa1, where B is the
covariance matrix of the class centroid matrix M (M is of sizeK × d), and W is the within-class covariance.
The other discriminant coordinates: maxa`aT` Ba`aT` Wa`
subject to
aT` Waj = 0; j = 1, . . . , `− 1.
The solution can be obtained by the eigen-decomposition of W−1B.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 35 / 43
Linear Regression Models Reduced-Rank LDA
Dimension Reduction in LDA
LDA allows us to visualize informative low-dimensional projections of data
The K centroids in d-dimensional input space span a subspace H,with dim(H) ≤ K − 1
Project x∗ onto H and make a distance comparison there.
We can ignore the distances orthogonal to H, since they willcontribute equally to each class.
Data can be viewed in H without losing any information. When K � d ,dimension reduction!
When K = 3, we can view the data in a two-dimensional plot
When K > 3, we might to find a subspace HL ⊂ H optimal for LDAin some sense.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 36 / 43
Linear Regression Models Reduced-Rank LDA
Subspace Optimality
Fisher’s Optimality for HL:the projected centroids are spread out as much as possible in terms ofvariance.
This amounts to: finding principal components subspaces of thecentroids.
These principal components are called discriminant coordinates orcanonical variates.
Pick up the top-ranked discriminant coordinates. The lower rank, thecentroids are less spread out.
For visualization:we project the data onto the principal components of the K centroids.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 37 / 43
Linear Regression Models Reduced-Rank LDA
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4
Coordinate 1
Coord
inate
3
-4 -2 0 2 4
-20
2
oooooo
o
o
oo
o
o
oo
oo o
o
oooooo
o
oo
o oo
o
o
oooo
oo o o o
o oo
oo
oo
o
ooooo
ooooo o
oooo
oo
oooooo
o
o
o
o
o
o
o
o
ooo
o
oooo
oo
oo
oo
o o
oooooo
oo
ooo
oo
ooo oo
oooooo
oo
o
o ooo
ooooo
oo ooo
o
oo
o
o
oo
oooooo
ooo
ooo
oooooo
ooo
o
oo
ooo
oo
o ooooo o
ooo
ooo
oo
oo
o
o
o
o
o
ooo
oooo
oo ooo
ooo
o
ooo
oo
ooo
oo
o
oooo
oo
oooooo
o
oo
o
oo
ooo
o
oo
o o
o
o
o
o
oo
oooo o
ooo
oo
oooo
o
o
oooo
oo
oo
o
o
o
oo
o oo
o
o
oo
oooo
o
o
oooo
o
o
oo oo
oo
oo
o
o
o o ooo
o
oooo
o
o
ooo
o
oo
o
oo
oo
o
ooo
ooo
o ooo
o
o
o
o
oooo
o
oo
oo
o o
o
ooo
ooo
o
oo
oo
o
ooo o
oo
o
o
o o
ooo
ooo
oo
o
o
o
o
oo
o
oo
ooo
o
o
o
o
o
o
oooo
o ooooo
ooo
o
o
o
ooo o o
o
oooooo
o
ooo
oo
ooooo
o
o o ooo ooo
o oo
o
oo o
oo
o
oo o
ooo
o
o
o
o
oo
oooooo
o
oo
oo
o
o
oo
ooo
o
o
oo oo
ooooo
o oooooo
oooo
o
o
ooo
oo
o
••••••••
•••• •• •••• ••••
Coordinate 2
Coord
inate
3
-6 -4 -2 0 2 4
-20
2
oooooo
o
o
oo
o
o
oo
ooo
o
oooo
oo
o
ooooo
o
o
oooo
ooooo
o oo
oo
oo
o
ooooo
oooo
oo
oooo
oo
oooooo
o
o
o
o
o
o
o
o
ooo
o
oooo
oo
oo
oo
o o
oooooo
oo
oo
o
oo
oooooo
ooooo
oo
o
oooo
ooooo
ooooo
o
oo
o
o
oo
oooooo
oooo
oo
oooooo
oooo
oo
ooooo
ooo
oo
oo
oooooo
oo
oo
o
o
o
o
o
ooo
ooooooo
ooooo
o
ooooo
ooo
oo
o
oooo
oo
ooooo
o
o
oo
o
oo
ooo
o
oo
oo
o
o
o
o
oo
oooo o
ooo
oo
oooo
o
o
oooo
oo
oo
o
o
o
oo
ooo
o
o
oo
oooo
o
o
oooo
o
o
ooo o
oo
ooo
o
ooooo
o
oo
oo
o
o
ooo
o
oo
o
oo
oo
o
ooo
ooo
oooo
o
o
o
o
ooo o
o
oo
oo
o o
o
oo o
ooo
o
oo
oo
o
oooo
oo
o
o
oo
oo
oooo
oo
o
o
o
o
oo
o
oo
ooo
o
o
o
o
o
o
oo oo
oooooo
ooo
o
o
o
oooo o
o
oooooo
o
oo
o
oo
oooooo
oooooo oo
ooo
o
oo o
oo
o
ooo
o oo
o
o
o
o
oo
oooo oo
o
oo
oo
o
o
o o
ooo
o
o
oooo
ooooo oo
ooooo
oo oo
o
o
ooo
oo
o
•••• •• ••
•••••••••••• ••
Coordinate 1
Coord
inate
7
-4 -2 0 2 4
-3-2
-10
12
3
oooooo
ooo
o
o
o
oo
oo oo
oooooo
o o oooo
ooo
ooo
o o oo
o
o
oo
o
ooo
ooo
ooo
oooo
o
o
oooo
oooooo
oo
o
o
oo
o
o
o
o ooooooo
ooo
o
o
oo
o o
oo
oo
oo
oo
oooo
oooo oo
ooo
ooooo
o o
oo
ooo
ooo
ooo
o
oo
o ooo
oo
oooooo
ooooo
o
oooooo
oooo
o
oooo
ooo
ooooo o
o
o
oo
oo
ooo
o
o
o
oooo
oo
o ooo
oo
oooooo
oooo
oo
ooo
oo
oooo
ooo
oo
oooo
oo
o
oo
o
oooooo
oo
oo
o o o oooo
o
oooo
o
o
oo
ooo o o
ooooo
o
o
o
o
oo
o o oooo
oooo
oo
o
ooooooo
oo ooo
oooo o
oo o
o
oooo
oo
oo
ooo
o
oo
o ooo
o
o
oo
oo
oo
oo
ooo o
ooo
o
oo
oooo
oo
oo
oo
oo
ooo oo
o
ooo
ooo o
ooo o oo
oo
o
oo
oo
oo
oo
oo o
ooo
oooooo
oooooo
oo
ooo
o
oo
o
ooo
ooo ooo
oooooo
o ooo
oo
oo
oo
o
o
oo
ooo
o
oo
o o
o
o
oo
oo o
ooo
oooo
o
o
oo
oo
oooooo
ooo
oo
o
ooo o oo
o ooo
oo
o
oooo
ooo
ooooo
o
ooo
o
oo
o
o
o o
•••••••• •••• •• •••• ••••
Coordinate 9
Coord
inate
10
-2 -1 0 1 2 3-2
-10
12
ooo
ooo
oo
o
oo o
o
oo oo
o
ooo
ooo
ooo
o
oo
oo
o
oo
o
o
oooo
o
oo
o
ooo
o o
ooo
o
o
oo
ooo
oooooooooooo
oo
o o
ooo
o o
o oo
oo
oo
o o
o
o o oo
o
ooo
ooo
oooo o
oooo
ooo
oo
ooo o
oo oo oo
o
o
oo
o o
ooo
oo
o
oooo
o o
o
oooo o oooo
o o
oooooo
o
ooo
o
o
ooo o
oo
ooo ooo
o
oo
ooo
ooooo
o
oo
ooo
o ooooo
o
oooo
oo
oooo
o o ooo
o
oo
oooo
oo
oo oo
oo
oo
ooo
o
oo o
oo
o
oo
oo
oo
ooo
oo
oo
oo
ooo
o
oo
oo
o
oooo
o o
ooooo
o
oo
oo
oo
o oo
ooo
o
oooo
o
oooo o o
oo
oo o
o
o o
o
oo
o
oo ooo
o
oo o
o
oo
oo
oo
oo
ooo oo o
o ooooo
oo
o
o
oo
ooooo o
oo
ooo o
oo
ooo
o
o
o o oo
o
o
o
o
oo
o
o
o
oo o
o
o
o
oo
o
o
o
oo
oo
o
ooo o o
o
oo
oo
o o
ooo
oo
o
oo
ooo
o
oo
oo
oo
oo
o
oo
o
o
o
o o
o
o
o
o
oo o o
ooo
ooo
o
ooooo
oo
o
o
o
o
oo o
oo
o
ooo
o
oo
o ooo
oo
o
oo
oo
o
o
o
ooo
o
oooooo
oo
o oo
o
o oo
ooo
o
ooo
oo
oo
o
o
oo
••••••••••••••••••••••
Linear Discriminant Analysis
Figure 4.8: Four proje tions onto pairs of anoni alvariates. Noti e that as the rank of the anoni al vari-ates in reases, the entroids be ome less spread out. Inthe lower right panel they appear to be superimposed,and the lasses most onfused.Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 38 / 43
Linear Regression Models Reduced-Rank LDA
�������� � ������� �� �������� �������� ���������� � �������� ���� ������� �
o
o
oo
o
o o
o
o
o
oo
o
o o
o
o
o
o
o
o
o
o
oo
o
o
oo
oo
o
oo
o
o
o o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
oo
o
o
o o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o o o
o
o
oo
o
o
oo
oo
oo
o
o
o
o
o
o
o
o
o oo
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o oo
o
o
oo
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Canonical Coordinate 1
Cano
nical
Coord
inate
2Classification in Reduced Subspace
••••
•••• •••• ••
••
••••
••
������ �� �������� �������� �� ��� ����� ������
��� ���� �� ��� �������������� ������� ������ ��
��� ���� ��� ��������� ��������� ���� ���� �� ���
����������������� �������� ��� ������� ��������
��� ����������������� ��� ������� �� ��� ��� ��
���������� �� ������
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 39 / 43
Linear Regression Models Reduced-Rank LDA
Generalizing Linear Discriminant Analysis
More flexible, more complex decision boundaries (Chapter 12)
FDA: flexible discriminant analysis
allow nonlinear boundary: recast the LDA problem as a linearregression problem and then use basis expansions to do discrimination
PDA: penalized discriminant analysis
select variables: penalize the coefficients to be smooth or sparse, whichis more interpretablesparse discriminant analysis (Clemmensen et al., 2011); R package“sparseLDA”.
MDA: mixture discriminant analysis
allow each class by a mixture of two or more Gaussian with differentcentroids, but each component Gaussian (within and between classes)shares the same covariance matrix.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 40 / 43
Linear Regression Models Multiple Logistic Regression
Multiple Logistic Models
Model the K posterior probabilities by linear functions of x.We use logit transformations
logPr(Y = 1|X = x)
Pr(Y = K |X = x)= β10 + βT
1 x
logPr(Y = 2|X = x)
Pr(Y = K |X = x)= β20 + βT
2 x
... ...
logPr(Y = K − 1|X = x)
Pr(Y = K |X = x)= β(K−1)0 + βT
K−1x
The choice of denominator in the odd-ratios is arbitrary.The parameter vector
θ = {β10,β1, ...., β(K−1)0,βK−1}.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 41 / 43
Linear Regression Models Multiple Logistic Regression
Probability Estimates
We use the logit transformation to assure that
the total probabilities sum to one
each estimated probability lies in [0, 1]
pk(x) ≡ Pr(Y = k |x) =exp (βk0 + βT
k x)
1 +∑K−1
l=1 exp (βl0 + βTl x)
for k = 1, ...,K − 1.
pK (x) ≡ Pr(Y = K |x) =1
1 +∑K−1
l=1 exp (βl0 + βTl x)
Check∑K
k=1 pk(x) = 1 for any x.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 42 / 43
Linear Regression Models Multiple Logistic Regression
Maximum Likelihood Estimate for Logistic Models
The joint conditional likelihood of yi given xi is
l(θ) =n∑
i=1
log pyi (θ; xi )
Using the Newton-Raphson method:
solve iteratively by minimizing re-weighted least squares.
provide consistent estimates if the models are specified correctly.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 43 / 43