Bayes Decision Rule and Naïve Bayes Classifierlsong/teaching/CSE6740fall13/lecture8.pdf · Bayes...
Transcript of Bayes Decision Rule and Naïve Bayes Classifierlsong/teaching/CSE6740fall13/lecture8.pdf · Bayes...
Bayes Decision Rule and Naïve Bayes Classifier
Machine Learning I CSE 6740, Fall 2013
Le Song
Gaussian Mixture model
A density model 𝑝(𝑋) may be multi-modal: model it as a mixture of uni-modal distributions (e.g. Gaussians)
Consider a mixture of 𝐾 Gaussians
𝑝(𝑋) = 𝜋𝑘𝓝(𝑋|𝜇𝑘 , Σ𝑘)𝐾𝑘=1
Learn 𝜋𝑘 ∈ 0,1 , 𝜇𝑘, Σ𝑘;
2
mixing proportion
mixture Component
𝓝(𝜇1, Σ1)
𝓝(𝜇2, Σ2)
For each data point 𝑥𝑖:
Randomly choose a mixture component, 𝑧𝑖 = 1,2,…𝐾 , with
probability 𝑝 𝑧𝑖|𝜋 = 𝜋𝑧𝑖
Then sample the actual value of 𝑥𝑖 from a Gaussian distribution
𝑝 𝑥𝑖 𝑧𝑖 = 𝓝 𝑥 𝜇𝑧𝑖 , Σ𝑧𝑖
Joint distribution over 𝑥 𝑎𝑛𝑑 𝑧 𝑝 𝑥, 𝑧 = 𝑝 𝑧 𝑝 𝑥 𝑧 = 𝜋𝑧𝓝 𝑥 𝜇𝑧, Σ𝑧
Marginal distribution 𝑝(𝑥)
𝑝 𝑥 = 𝑝 𝑥, 𝑧
𝐾
𝑧=1
= 𝜋𝑧𝓝 𝑥 𝜇𝑧, Σ𝑧
𝐾
𝑧=1
𝑧1 = 1
𝑥1
Imagine generating your data …
3
Bayes rule
𝑃 𝑧 𝑥 = 𝑃 𝑥 𝑧 𝑃 𝑧
𝑃 𝑥=𝑃(𝑥, 𝑧)
𝑃(𝑥, 𝑧)𝑧
4
posterior
likelihood Prior
normalization constant
Prior: 𝑝 𝑧 = 𝜋𝑧 Likelihood: 𝑝 𝑥 𝑧 = 𝓝 𝑥 𝜇𝑧, Σ𝑧
Posterior: 𝑝 𝑧 𝑥 =𝜋𝑧𝓝 𝑥 𝜇𝑧,Σ𝑧 𝜋𝑧𝑧 𝓝 𝑥 𝜇𝑧,Σ𝑧
Details of EM
We intend to learn the parameters that maximizes the log−likelihood of the data
𝑙 𝜃; 𝐷 = log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃
𝐾
𝑧𝑖=1
𝑚
𝑖=1
Expectation step (E-step): What do we take expectation over?
𝑓 𝜃 = 𝐸𝑞 𝑧1,𝑧2,..,𝑧𝑚 [log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃
𝑚
𝑖=1
]
Maximization step (M-step): how to maximize? 𝜃𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑓 𝜃
5
Nonconvex Difficult!
E-step: what is 𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚
𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚 : posterior distribution of the latent variables
𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚 = 𝑝(𝑧𝑖|𝑥𝑖 , 𝜃𝑡)
𝑚
𝑖
For each data point 𝑥𝑖, compute 𝑝(𝑧𝑖 = 𝑘|𝑥𝑖) for each 𝑘
𝜏𝑘𝑖 = 𝑝 𝑧𝑖 = 𝑘 𝑥𝑖
=𝜋𝑘𝓝 𝑥
𝑖 𝜇𝑘 , Σ𝑘 𝜋𝑘𝑘 𝓝 𝑥
𝑖 𝜇𝑘 , Σ𝑘
6
E-step: compute the expectation
𝑓 𝜃 = 𝐸𝑞 𝑧1,𝑧2,..,𝑧𝑚 log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃
𝑚
𝑖=1
= 𝐸𝑝(𝑧𝑖|𝑥𝑖,𝜃𝑡) log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃
𝑚
𝑖=1
In the case of Gaussian mixture:
𝑓(𝜃) = 𝐸𝑝(𝑧𝑖|𝑥𝑖,𝜃𝑡) log 𝜋𝑧𝑖 − 𝑥𝑖 − 𝜇𝑧𝑖
⊤Σ𝑧𝑖 𝑥
𝑖 − 𝜇𝑧𝑖 + log Σ𝑧𝑖 + 𝑐
𝑚
𝑖=1
=
𝐾
𝑘=1
𝜏𝑘𝑖 log 𝜋𝑘 − 𝑥
𝑖 − 𝜇𝑘⊤Σ𝑘 𝑥𝑖 − 𝜇𝑘 + log Σ𝑘 + 𝑐
𝑚
𝑖=1
7
M-step: maximize 𝑓 𝜃
𝑓 𝜃 = 𝐾𝑘=1 𝜏𝑖𝑘 log 𝜋𝑘 − 𝑥
𝑖 − 𝜇𝑘⊤Σ𝑘 𝑥𝑖 − 𝜇𝑘 + log Σ𝑘 + 𝑐 𝑚
𝑖=1
For instance, we want to find 𝜋𝑘, and 𝜋𝑘 = 1𝐾𝑖=1
Form Lagrangian
𝐿 =
𝐾
𝑘=1
𝜏𝑘𝑖 log 𝜋𝑘 + 𝑜𝑡ℎ𝑒𝑟 𝑡𝑒𝑟𝑚𝑠
𝑚
𝑖=1
+ 𝜆(1 − 𝜋𝑘
𝐾
𝑖=1
)
Take partial derivative and set to 0
𝜕𝐿
𝜕𝜋𝑘= 𝜏𝑘𝑖
𝜋𝑘
𝑚
𝑖=1
− 𝜆 = 0
⇒ 𝜋𝑘 =1
𝜆 𝜏𝑘𝑖
𝑚
𝑖=1
⇒ 𝜆 = 𝑚
8
EM graphically
9
𝑓(𝜃) 𝑙 𝜃; 𝐷
𝑙 𝜃𝑡; 𝐷
E − step
M− step 𝜃𝑡+1
Experiments with mixture of 3 Gaussians
10
Why do we need density estimation?
Learn the “shape” of the data cloud
Assess the likelihood of seeing a particular data point
Is this a typical data point? (high density value)
Is this an abnormal data point / outlier? (low density value)
Building block for more sophisticated learning algorithms
Classification, regression, graphical models …
11
Outlier detection
Declare data points in low density region as outliers
First fit density 𝑝 𝑥 , and then set a threshold 𝑡
Data points in region with density small than 𝑡, ie., 𝑝 𝑥 ≤ 𝑡
12
Geometric method for outlier detector
Rather than first fit density and then threshold, directly find the boundary between inlier and outlier
13
-6 -4 -2 0 2 4 6-6
-4
-2
0
2
4
6
boundary
Minimum enclosing ball
Find the smallest ball such that it include all data points.
Parameterized with a center 𝑐 and radius 𝑟
Optimization problem (quadratically constrained quadratic programming)
min𝑟,𝑐 𝑟2
𝑠. 𝑡. 𝑥𝑖 − 𝑐⊤𝑥𝑖 − 𝑐 ≤ 𝑟2,
∀𝑖 = 1,… ,𝑚
14
𝑟
𝑐
Find center 𝑐 and radius 𝑟
Lagrangian of the problem:
𝐿 = 𝑟2 + 𝛼𝑖 𝑥𝑖 − 𝑐
⊤𝑥𝑖 − 𝑐 − 𝑟2
𝑚
𝑖=1
, 𝛼𝑖 ≥ 0
Set partial derivative of 𝐿 to 0
𝜕𝐿
𝜕𝑟= 2𝑟 − 2 𝛼𝑖
𝑚
𝑖=1
𝑟 = 0 ⇒ 𝛼𝑖 = 1
𝑚
𝑖=1
𝜕𝐿
𝜕𝑐= 𝛼𝑖(−2𝑥
𝑖 + 2𝑐)
𝑚
𝑖=1
= 0 ⇒ c = 𝛼𝑖𝑥𝑖
𝑚
𝑖=1
15
c and r not in explicit
form
Manipulate the Lagrangian
𝐿 = 𝑟2 + 𝛼𝑖 𝑥𝑖 − 𝑐
⊤𝑥𝑖 − 𝑐 − 𝑟2
𝑚
𝑖=1
𝐿 = 𝑟2 1 − 𝛼𝑖
𝑚
𝑖=1
+ 𝛼𝑖𝑥𝑖⊤𝑥𝑖
𝑚
𝑖
− 2𝑐⊤ 𝛼𝑖𝑥𝑖
𝑚
𝑖
+ 𝛼𝑖
𝑚
𝑖=1
𝑐⊤𝑐
𝐿 = 𝛼𝑖𝑥𝑖⊤𝑥𝑖 −
𝑚
𝑖=1
𝛼𝑖𝛼𝑗𝑥𝑖⊤𝑥𝑗
𝑚
𝑗=1
𝑚
𝑖=1
16
Express Lagrangian 𝐿 in 𝛼
max𝛼𝐿(𝛼) = 𝛼𝑖𝑥
𝑖⊤𝑥𝑖 −
𝑚
𝑖=1
𝛼𝑖𝛼𝑗𝑥𝑖⊤𝑥𝑗
𝑚
𝑗=1
𝑚
𝑖=1
𝑠. 𝑡. 𝛼𝑖 = 1
𝑚
𝑖=1
, 𝛼𝑖 ≥ 0
A quadratic programming problem 𝐿 = 𝑏⊤𝛼 − 𝛼⊤𝐴𝛼 with linear constraints
17
𝐴𝑖𝑗 = 𝑥𝑖⊤𝑥𝑗 𝑏𝑖 = 𝑥
𝑖⊤𝑥𝑖 𝛼
Experiment with minimum enclosing ball
18
-6 -4 -2 0 2 4 6-6
-4
-2
0
2
4
6
Classification
Represent the data
A label is provided for each data point, eg., 𝑦 ∈ −1,+1
Classifier
19
Outline
What is the theoretically the best classifier?
Bayesian decision rule for minimum error .
What is the difficulty of achieve the minimum error?
What do we do in practice?
20
Decision-making: divide high-dimensional space
Distributions of sample from normal (positive class) and abnormal (negative class) tissues
21
Class conditional distribution
In classification case, we are given the label for each data point
Class conditional distribution: 𝑃 𝑥 𝑦 = 1 , 𝑃(𝑥|𝑦 = −1) (how to compute?)
Class prior: 𝑃 𝑦 = 1 , 𝑃(𝑦 = −1) (how to compute?)
22
𝑝 𝑥 𝑦 = 1 = 𝓝(𝑥; 𝜇1, Σ1)
𝑝 𝑥 𝑦 = −1 = 𝓝(𝑥; 𝜇2, Σ2)
How to come up with decision boundary
Given class conditional distribution: 𝑃 𝑥 𝑦 = 1 , 𝑃(𝑥|𝑦 =− 1), and class prior: 𝑃 𝑦 = 1 , 𝑃(𝑦 = −1)
23
𝑃 𝑥 𝑦 = 1 = 𝓝(𝑥; 𝜇1, Σ1)
𝑃 𝑥 𝑦 = −1 = 𝓝(𝑥; 𝜇2, Σ2)
?
Use Bayes rule
𝑃 𝑦 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦
𝑃 𝑥=𝑃(𝑥, 𝑦)
𝑃(𝑥, 𝑦)𝑧
24
posterior
likelihood Prior
normalization constant
Prior: 𝑃 𝑦 Likelihood (class conditional distribution : 𝑝 𝑥 𝑦 =
𝓝 𝑥 𝜇𝑦, Σ𝑦
Posterior: 𝑃 𝑦 𝑥 =𝑃(𝑦)𝓝 𝑥 𝜇𝑦,Σ𝑦
𝑃(𝑦)𝓝 𝑥 𝜇𝑦,Σ𝑦𝑦
Bayes Decision Rule
Learning: prior: 𝑝 𝑦 ,class conditional distribution : 𝑝 𝑥 𝑦
The poster probability of a test point
𝑞𝑖 𝑥 ≔ 𝑃 𝑦 = 𝑖 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦
𝑃 𝑥
Bayes decision rule:
If 𝑞𝑖 𝑥 > 𝑞𝑗(𝑥), then 𝑦 = 𝑖, otherwise 𝑦 = 𝑗
Alternatively:
If ratio 𝑙 𝑥 =𝑃(𝑥|𝑦=𝑖)
𝑃(𝑥|𝑦=𝑗)>𝑃 𝑦=𝑗
𝑃 𝑦=𝑖, then 𝑦 = 𝑖, otherwise 𝑦 = 𝑗
Or look at the log-likelihood ratio h x = − ln𝑞𝑖(𝑥)
𝑞𝑗 𝑥
25
Example: Gaussian class conditional distribution
Depending on the Gaussian distributions, the decision boundary can be very different
Decision boundary: h x = − ln𝑞𝑖 𝑥
𝑞𝑗 𝑥= 0
26
Is Bayesian decision rule optimal?
Optimal with respect to what criterion?
Bayes error: expected probability of a sample being incorrectly classified
Given a sample, we compute posterior distribution 𝑞1 𝑥 ≔𝑃 𝑦 = 1 𝑥 and 𝑞2 𝑥 ≔ 𝑃 𝑦 = 2 𝑥 , the probability of error:
𝑟 𝑥 = min{𝑞1 𝑥 , 𝑞2(𝑥)}
Bayes error
𝐸𝑝 𝑥 𝑟 𝑥 = ∫ 𝑟 𝑥 𝑝 𝑥 𝑑𝑥
27
Bayes error
The Bayes error:
𝜖 = 𝐸𝑝 𝑥 𝑟 𝑥 = ∫ 𝑟 𝑥 𝑝 𝑥 𝑑𝑥
= min 𝑞1 𝑥 , 𝑞2 𝑥 𝑝 𝑥 𝑑𝑥
= min𝑃 𝑦 = 1 𝑝(𝑥|𝑦 = 1)
𝑝(𝑥),𝑝 𝑦 = 2 𝑝(𝑥|𝑦 = 2)
𝑝(𝑥)𝑝 𝑥 𝑑𝑥
= min 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1 , 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2) 𝑑𝑥
28
Bayes error II
𝜖 = min 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1 , 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2) 𝑑𝑥
= 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1𝐿1
𝑑𝑥 + 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2)𝐿2
𝑑𝑥
29
𝑃(𝑦 = 1)𝑝(𝑥|𝑦 = 1)
𝑃(𝑦 = 2)𝑝(𝑥|𝑦 = 2)
𝐿2 𝐿1
More on Bayes error
Bayes error is the lower bound of probability of classification error
Bayes decision rule is the theoretically best classifier that minimize probability of classification error
However, computing Bayes error or Bayes decision rule is in general a very complex problem. Why?
Need density estimation
Need to do integral, eg. ∫ 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1𝐿1𝑑𝑥
30
What do people do in practice?
Use simplifying assumption for 𝑝 𝑥 𝑦 = 1
Assume 𝑝 𝑥 𝑦 = 1 is Gaussian
Assume 𝑝 𝑥 𝑦 = 1 is fully factorized
Rather than model density of 𝑝 𝑥 𝑦 = 1 , directly go for the
decision boundary h x = − ln𝑞𝑖 𝑥
𝑞𝑗 𝑥
Logistic regression
Support vector machine
Neural networks
31
Naïve Bayes Classifier
Use Bayes decision rule for classification
𝑃 𝑦 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦
𝑃 𝑥
But assume 𝑝 𝑥 𝑦 = 1 is fully factorized
𝑝 𝑥 𝑦 = 1 = 𝑝(𝑥𝑖|𝑦 = 1)
𝑑
𝑖=1
Or the variables corresponding to each dimension of the data are independent given the label
32