CS340: Machine Learning Naive Bayes classifiers Kevin Murphy · apply Bayes rule to compute the...
Transcript of CS340: Machine Learning Naive Bayes classifiers Kevin Murphy · apply Bayes rule to compute the...
CS340: Machine Learning
Naive Bayes classifiers
Kevin Murphy
1
Classifiers
• A classifier is a function f that maps input feature vectorsx ∈ X to output class labels y ∈ {1, . . . , C}
• X is the feature space eg X = IRp or X = {0, 1}p (can mixdiscrete and continuous features)
•We assume the class labels are unordered (categorical) andmutually exclusive. (If we allow an input to belong to multipleclasses, this is called a multi-label problem.)
• Goal: to learn f from a labeled training set of N input-outputpairs, (xi, yi), i = 1 : N .
•We will focus our attention on probabilistic classifiers, i.e., methodsthat return p(y|x).
• Alternative is to learn a discriminant function f (x) = y(x) topredict the most probable label.
2
Generative vs discriminative classifiers
•Discriminative: directly learn the function that computes the classposterior p(y|x). It discriminates between different classes given
the input.
• Generative: learn the class-conditional density p(x|y) foreach value of y, and learn the class priors p(y); then one canapply Bayes rule to compute the posterior
p(y|x) =p(x, y)
p(x)=
p(x|y)p(y)
p(x)
where p(x) =∑C
y′=1 p(y′|x).
3
Generative classifiers
We usually use a plug-in approximation for simplicity
p(y = c|x,D) ≈ p(y = c|x, θ, π) =p(x|y = c, θc)p(y = c|π)
∑
c′ p(x|y = c′, θc′)p(y = c′|π)
where D is the training data, π are the parameters of the class priorp(y) and θ are the paramters of the class-conditional densities p(x|y).
4
Class prior
• Class priorp(y = c|π) = πc
•MLE
πc =
∑Ni=1 I(yi = c)
N=
Nc
Nwhere Nc is the number of training examples that have class label c.
• Posterior mean (using Dirichlet prior)
πc =Nc + 1
N + C
5
Gaussian class-conditional densities
Suppose x ∈ IR2 representing the height and weight of adult Westerners(in inches and pounds respectively), and y ∈ {1, 2} represents male orfemale. A natural choice for the class-conditional density is a two-dimensional Gaussian
p(x|y = c, θc) = N (x|µc, Σc)
where the mean and covariance matrix depend on the class c
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
dotted red circle = female, solid blue cross =male
6
Naive Bayes assumption
Assume features are conditionally independent given class.
p(x|y = c, θc) =
p∏
d=1
p(xd|y = c, θc) =
p∏
d=1
N (xd|µcd, σcd)
This is equivalent to assuming that Σc is diagonal.
55 60 65 70 750
0.2
0.4height male
55 60 65 70 750
0.2
0.4height female
100 150 200 2500
0.2
0.4weight male
100 150 200 2500
0.2
0.4weight female
7
Training
55 60 65 70 750
0.2
0.4height male
55 60 65 70 750
0.2
0.4height female
100 150 200 2500
0.2
0.4weight male
100 150 200 2500
0.2
0.4weight female
h w y67 125 m79 210 m71 150 m68 140 f67 142 f60 120 f
h wm µ = 71.66, σ = 3.13 µ = 175.62, σ = 32.40f µ = 65.07, σ = 3.19 µ = 129.69, σ = 18.67
8
Testing
p(y = m|x) =p(x|y = m)p(y = m)
p(x|y = m)p(y = m) + p(x|y = f )p(y = f )
Let us assume p(y = m) = p(y = f ) = 0.5
p(y = m|x) =p(x|y = m)
p(x|y = m) + p(x|y = f )
=p(xh|y = m)p(xw|y = m)
p(xh|y = m)p(xw|y = m) + p(xh|y = f )p(xw|y = f )
=N (xh; µmh, σmh) ×N (xh; µmw, σmw)
“ + N (xh; µfh, σfh) ×N (xh; µfw, σfw)
9
Testing
55 60 65 70 750
0.2
0.4height male
55 60 65 70 750
0.2
0.4height female
100 150 200 2500
0.2
0.4weight male
100 150 200 2500
0.2
0.4weight female
h wm µ = 71.66, σ = 3.13 µ = 175.62, σ = 32.40f µ = 65.07, σ = 3.19 µ = 129.69, σ = 18.67
h w p(y = m|x)72 18060 10068 155
10
Binary data
• Suppose we want to classify email into spam vs non-spam.
• A simple way to represent a text document (such as email) is as abag of words.
• Let xd = 1 if word d occurs in the document and xd = 0 otherwise.
0 200 400 600
0
100
200
300
400
500
600
700
800
900
words
docu
men
ts
training set
11
Parameter estimation
Class-conditional denstity becomes a product of Bernoullis
p(~x|Y = c, θ) =
p∏
d=1
θxdcd
(1 − θcd)1−xd
MLE
θcd =Ncd
Nc
Posterior mean (with Beta(1,1) prior)
θcd =Ncd + 1
Nc + 2
12
Fitted class conditional densities p(x = 1|y = c)
0 100 200 300 400 500 600 7000
0.2
0.4
0.6
0.8
1word freq for class 1
0 100 200 300 400 500 600 7000
0.2
0.4
0.6
0.8
1word freq for class 2
13
Numerical issues
When computing
P (Y = c|~x) =P (~x|Y = c)P (Y = c)
∑Cc′=1 P (~x|Y = c′)P (Y = c′)
you will oftne encounter numerical underflow since p(~x, y = c) isvery small.Take logs
bcdef= log[P (~x|Y = c)P (Y = c)]
log P (Y = c|~x) = bc − log
C∑
c′=1
ebc′
but ebc will underflow!
14
Log-sum-exp trick
log(e−120 + e−121) = log(
e−120(e0 + e−1))
= log(e0 + e−1) − 120
In general
log∑
c
ebc = log
[
(∑
c
ebc)e−BeB
]
= log
[
(∑
c
ebc−B)eB
]
=
[
log(∑
c
ebc−B)
]
+ B
where B = maxc bc.In matlab, use logsumexp.m.
15
Softmax function
p(y = c|~x, θ, π) =p(x|y = c)p(y = c)
∑
c′ p(x|y = c′)p(y = c′)
=exp[log p(x|y = c) + log p(y = c)]
∑
c′ exp[log p(x|y = c′) + log p(y = c′)]
=exp [log πc +
∑
d xd log θcd]∑
c′ exp [log πc′ +∑
d xd log θc′d]
Now define vectors
~x = [1, x1, . . . , x1p]
βc = [log πc, log θc1, . . . , log θcp]
Hence
p(y = c|~x, β) =exp[βT
c ~x]∑
c′ exp[βTc′~x]
16
Logistic function
If y is binary
p(y = 1|x) =eβT
1 x
eβT1 x + eβT
2 x
=1
1 + e(β2−β1)Tx
=1
1 + ewTx
= σ(wTx)
where we have defined w = β2 − β1 and σ(·) is the logistic or sig-
moid function
σ(u) =1
1 + e−u
17