Lecture 4 – Discriminant Analysis, k-Nearest Neighbors
Johan WågbergDivision of Systems and ControlDepartment of Information TechnologyUppsala University.
Email: [email protected]
The mini project has now started!
Task: Classify songs according to whether Andreas likesthem or not (based on features from the Spotify Web-API).
Training data: 750 songs labeled as like or dislike.
Test data: 200 unlabeled songs. Upload your predictionsto the leaderboard!
Examination: A written report (max 6 pages).
Instructions and data is available at the course homepagehttp://www.it.uu.se/edu/course/homepage/sml/project/
You will receive your group password per e-mail tomorrow. Deadline for submitting your finalsolution: February 21
1 / 30 [email protected]
Summary of Lecture 3 (I/VI)
The classification problem amounts to modeling the relationship between the input xand a categorical output y, i.e., the output belongs to one out of M distinct classes.
A classifier is a prediction model y(x) that maps any input x into a predicted classy ∈ {1, . . . ,M}.
Common classifier predicts each input as belonging to the most likely class accordingto the conditional probabilities
p(y = m |x) for m ∈ {1, . . . ,M}.
2 / 30 [email protected]
Summary of Lecture 3 (II/VI)
For binary (two-class) classification, y ∈ {−1, 1}, the logistic regression model is
p(y = 1 |x) ≈ g(x) = eθTx
1 + eθTx.
The model parameters θ are found by maximum likelihood by (numerically)maximizing the log-likelihood function,
log `(θ) = −n∑
i=1log
(1 + e−yiθ
Txi
).
3 / 30 [email protected]
Summary of Lecture 3 (III/VI)
Using the common classifier, we get the prediction model
y(x) =
1 if g(x) > 0.5 ⇔ θTx > 0
−1 otherwise
This attempts to minimize the total misclassification error.
More generally, we can use
y(x) ={
1 if g(x) > r,
−1 otherwise
where 0 ≤ r ≤ 1 is a user chosen threshold.
4 / 30 [email protected]
Summary of Lecture 3 (IV/VI)
Confusion matrix:Predicted conditiony = 0 y = 1 Total
True y = 0 TN FP Ncondition y = 1 FN TP P
Total N* P*
For the classifier
y(x) ={
1 if g(x) > r,
−1 otherwisethe numbers in the confusion matrix will depend on the threshold r.• Decreasing r ⇒ TN,FN decrease and FP,TP increase.• Increasing r ⇒ TN,FN increase and FP,TP decrease.
5 / 30 [email protected]
Summary of Lecture 3 (V/VI)
Some commonly used performance measures are:
• True positive rate: TPR = TPP = TP
FN + TP ∈ [0, 1]
• False positive rate: FPR = FPN = FP
FP + TN ∈ [0, 1]
• Precision: Prec = TPP* = TP
FP + TP ∈ [0, 1]
6 / 30 [email protected]
Summary of Lecture 3 (VI/VI)
0 0.2 0.4 0.6 0.8 1
False positive rate
0
0.2
0.4
0.6
0.8
1
Tru
e po
sitiv
e ra
te
ROC curve for logistic regression
• ROC: plot of TPR vs. FPR as r ranges from 0 to 1.1• Area Under Curve (AUC): condensed performance measure for the classifier,taking all possible thresholds into account.
1Possible to draw ROC curves based on other tuning parameters as well.7 / 30 [email protected]
Contents – Lecture 4
1. Summary of lecture 32. Generative models3. Linear Discriminant Analysis (LDA)4. Quadratic Discriminant Analysis (QDA)5. A nonparametric classifier – k-Nearest Neighbors (kNN)
8 / 30 [email protected]
Generative models
A discriminative model describes how the output y is generated, i.e. p(y |x).
A generative model describes how both the output y and the input x is generated viap(y,x).
Predictions can be derived using laws of probability
p(y |x) = p(y,x)p(x) = p(y)p(x | y)∫
y p(y,x) .
9 / 30 [email protected]
Generative models for classification
We have built classifiers based on models gm(x?) of the conditional probabilitiesp(y = m |x).
A generative model for classification can be expressed as
p(y = m |x) = p(y = m)p(x | y = m)∑Mm=1 p(y = m)p(x | y = m)
• p(y = m) denotes the marginal (prior) probability of class m ∈ {1, . . . ,M}.• p(x | y = m) denotes the conditional probability density of x for an observationfrom class m.
10 / 30 [email protected]
Multivariate Gaussian density
The p-dimensional Gaussian (normal) probability density function with mean vector µand covariance matrix Σ is,
N (x |µ,Σ) = 1(2π)p/2|Σ|1/2 exp
(−1
2(x− µ)TΣ−1(x− µ)),
where µ : p× 1 vector and Σ : p× p positive definite matrix.
Let x = (x1, . . . , xp)T ∼ N (µ,Σ),• µj is the mean of xj ,• Σjj is the variance of xj ,• Σij (i 6= j) is the covariance between xi and xj .
11 / 30 [email protected]
Multivariate Gaussian density
05
0.05
5
0.1
Val
ue o
f PD
F 0.15
x2
0
x1
0.2
0
-5 -5
Σ =(
1 00 1
)
05
0.1
5
Val
ue o
f PD
F
0.2
x2
0
x1
0.3
0
-5 -5
Σ =(
1 0.40.4 0.5
)12 / 30 [email protected]
Linear discriminant analysis
We need,1. The prior class probabilities πm , p(y = m), m ∈ {1, . . . ,M}.2. The conditional probability densities of the input x, fm(x) , p(x | y = m), for
each class m.This gives us the model
gm(x) = πmfm(x)∑Mm=1 πmfm(x)
13 / 30 [email protected]
Linear discriminant analysis
For the first task, a natural estimator is the proportion of training samples in the mthclass
πm = 1n
n∑i=1
I{yi = m} = nm
n
where n is the size of the training set and nm the number of training samples of classm.
14 / 30 [email protected]
Linear discriminant analysis
For the second task, a simple model is to assume that fm(x) is a multivariate normaldensity with mean vector µm and covariance matrix Σm,
fm(x) = 1(2π)p/2|Σm|1/2 exp
(−1
2(x− µm)TΣ−1m (x− µm)
).
If we further assume that all classes share the same covariance matrix,
Σ def= Σ1 = · · · = ΣM ,
the remaining parameters of the model are
µ1,µ2, . . . ,µM ,Σ.
15 / 30 [email protected]
Linear discriminant analysis
These parameters are naturally estimated as the (within class) sample means andsample covariance, respectively:
µm = 1nm
∑i:yi=m
xi, m = 1, . . . ,M,
Σ = 1n−M
M∑m=1
∑i:yi=m
(xi − µm)(xi − µm)T.
Modeling the class probabilities using the normal assumptions and these parameterestimates is referred to as Linear Discriminant Analysis (LDA).
16 / 30 [email protected]
Linear discriminant analysis, summary
The LDA classifier assigns a test input x to class m for which,
δm(x) = xTΣ−1µm −
12 µT
mΣ−1µm + log πm
is largest, where
πm = nm
nm = 1, . . . ,M,
µm = 1nm
∑i:yi=m
xi, m = 1, . . . ,M,
Σ = 1n−M
M∑m=1
∑i:yi=m
(xi − µm)(xi − µm)T.
LDA is a linear classifier (its decision boundaries are linear).
17 / 30 [email protected]
ex) LDA decision boundary
Illustration of LDA decision boundary – the level curves of two Gaussian PDFs withthe same covariance matrix intersect along a straight line, δ1(x) = δ2(x).
x1
x 2
18 / 30 [email protected]
ex) Simple spam filter
We will use LDA to construct a simple spam filter:• Output: y ∈ {spam, ham}• Input: x = 57-dimensional vector of “features” extracted from the email
• Frequencies of 48 predefined words (make, address, all, . . . )• Frequencies of 6 predefined characters (;, (, [, !, $, #)• Average length of uninterrupted sequences of capital letters• Length of longest uninterrupted sequence of capital letters• Total number of capital letters in the e-mail
• Dataset consists of 4, 601 emails classified as either spam or ham (split into 75 %training and 25 % testing)
UCI Machine Learning Repository — Spambase Dataset(https://archive.ics.uci.edu/ml/datasets/Spambase)
19 / 30 [email protected]
ex) Simple spam filter
LDA confusion matrix (test data):
y = 0 y = 1y = 0 667 23y = 1 102 358
Table: Threshold r = 0.5
y = 0 y = 1y = 0 687 3y = 1 237 223
Table: Threshold r = 0.9
20 / 30 [email protected]
ex) Simple spam filter
A linear classifier uses a linear decision boundary for separating the two classes asmuch as possible.
In Rp this is a (p− 1)-dimensional hyperplane. How can we visualize the classifier?
One way: project the (labeled) test inputs onto the normal of the decision boundary.
-0.5 0 0.5
Projection on normal
0
50
100
150
200Y=0Y=1
21 / 30 [email protected]
Gmail’s spam filter
Official Gmail blog July 9, 2015:
“. . . the spam filter now uses an artificial neural network to detect and block theespecially sneaky spam—the kind that could actually pass for wanted mail.” — SriHarsha Somanchi, Product Manager
Official Gmail blog February 6, 2019:
“With TensorFlow, we are now blocking around 100 million additional spam messagesevery day” — Neil Kumaran, Product Manager, Gmail Security
22 / 30 [email protected]
Quadratic discriminant analysis
Do we have to assume a common covariance matrix?
No! Estimating a separate covariance matrix for each class leads to an alternativemethod, Quadratic Discriminant Analysis (QDA).
Whether we should choose LDA or QDA has to do with the bias-variance trade-off,i.e. the risk of over- or underfitting.
Compared to LDA, QDA. . .• has more parameters• is more flexible (lower bias)• has higher risk of overfitting (larger variance)
23 / 30 [email protected]
ex) QDA decision boundary
Illustration of QDA decision boundary – the level curves of two Gaussian PDFs withdifferent covariance matrices intersect along a curved (quadratic) line.
x1
x 2
Thus, QDA is not a linear classifier.24 / 30 [email protected]
Parametric and nonparametric models
So far we have looked at a few parametric models,• linear regression,• logistic regression,• LDA and QDA,
all of which are parametrized by a fixed-dimensional parameter .
Non-parametric models instead allow the flexibility of the model to grow with theamount of available data.
N can be very flexible (= low bias)H can suffer from overfitting (high variance)H can be compute and memory intensive
As always, the bias-variance trade-off is key also when working with non-parametricmodels.
25 / 30 [email protected]
k-NN
The k-nearest neighbors (k-NN) classifier is a simple non-parametric method.
Given training data T = {(xi, yi)}ni=1, for a test input x?,1. Identify the set
R? = {i : xi is one of the k training data points closest to x?}
2. Classify x? according to a majority vote within the neighborhood R?, i.e. assignx? to class m for which ∑
i∈R?
I{yi = m}
is largest.
26 / 30 [email protected]
ex) k-NN on a toy model
We illustrate the k-NN classifier on a synthetic example where the optimal classifier isknown (colored regions).
-6 -4 -2 0 2 4X
1
-4
-2
0
2
4
X2
Y=1Y=2Y=3
27 / 30 [email protected]
ex) k-NN on a toy model
The decision boundaries for the k-NN classifier are shown as black lines.
k=1
-6 -4 -2 0 2 4X
1
-4
-2
0
2
4
X2
Y=1Y=2Y=3
k=10
-6 -4 -2 0 2 4X
1
-4
-2
0
2
4
X2
Y=1Y=2Y=3
k=100
-6 -4 -2 0 2 4X
1
-4
-2
0
2
4
X2
Y=1Y=2Y=3
28 / 30 [email protected]
ex) k-NN on a toy model
k=1
-6 -4 -2 0 2 4X
1
-4
-2
0
2
4
X2
Y=1Y=2Y=3
k=10
-6 -4 -2 0 2 4X
1
-4
-2
0
2
4
X2
Y=1Y=2Y=3
k=100
-6 -4 -2 0 2 4X
1
-4
-2
0
2
4
X2
Y=1Y=2Y=3
The choice of the tuning parameter k controls the model flexibility:• Small k ⇒ small bias, large variance• Large k ⇒ large bias, small variance
29 / 30 [email protected]
A few concepts to summarize lecture 4
Generative model: A model that describes both the output y and the input x.
Linear discriminant analysis (LDA): A classifier based on the assumption that the distribution of the input x ismultivariate Gaussian for each class, with different means but the same covariance. LDA is a linear classifier.
Quadratic discriminant analysis (QDA): Same as LDA, but where a different covariance matrix is used for eachclass. QDA is not a linear classifier.
Non-parametric model: A model where the flexibility is allowed to grow with the amount of available trainingdata.
k-NN classifier: A simple non-parametric classifier based on classifying a test input x according to the classlabels of the k training samples nearest to x.
30 / 30 [email protected]
Top Related