Naïve Bayes - 计算语言学研究所
Transcript of Naïve Bayes - 计算语言学研究所
2007-4-3 Wang Houfeng, Icler, PKU 1
Naïve Bayes
Wang HoufengInstitute of Computational Linguistics
Peking University
2007-4-3 Wang Houfeng, Icler, PKU 2
Outline
Introduction
• Maximum Likelihood Estimation
• Naïve Bayesian Classification
2007-4-3 Wang Houfeng, Icler, PKU 3
Introduction
Bayesian learning enables us to form predictions based on probabilities. It provides a framework for probabilistic reasoning. The advantages:
There may be noise in the data;can provide “prior knowledge” in constructing a
hypothesis;Predictions are probabilistic;The framework provides a view of optimal decision
making
2007-4-3 Wang Houfeng, Icler, PKU 4
Probability Definition
• A probability is a function from an event space to a real number between 0 and 1 (inclusive), where,– 0 ≤ P(A) ≤ 1,
• 0: indicates impossibility• 1: indicates certainty
– P(Ω) = 1 where, Ω : sample space – P(X) ≤ p(Y) for any X ⊆ Y– If A1, A2,…,An is disjoint (Ai ∩ Aj ) =∅, then
11( ) ( )n n
i iiiP A P A
=== ∑∪
2007-4-3 Wang Houfeng, Icler, PKU 5
Interpretation of probability
• Relative Frequency– Suppose that an experiment is performed n times
and A occurs f times. The relative frequency for an event A is:
nf
n=
occurs A times ofNumber
• If we let n get infinitely large, nfAP
nlim)(
→∞=
2007-4-3 Wang Houfeng, Icler, PKU 6
Interpretation of probability
• In frequentist statistics, probabilities are associated only with the data, i.e., outcomes of repeatable observations:
Probability = limiting frequency
2007-4-3 Wang Houfeng, Icler, PKU 7
Some Rules• For any two events, A and B, the probability of
their union, P(A ∪ B), is
)()()()( BAPBPAPBAP ∩−+=∪
A B
•• A Special CaseA Special Case, When two events A and B are mutually exclusive, mutually exclusive, P(A∩B) = 0 and P(A∪B) = P(A) + P(B).
2007-4-3 Wang Houfeng, Icler, PKU 8
Independence
• Two events A and B are independentindependent if and only if:
P(B|A) = P(B)P(B|A) = P(B)P(B|A) = P(B)
P(A ∩ B) = P(A) P(B) P(A ∩ B) = P(A) P(B)
P(A|B) = P(A)P(AP(A||B) = P(A)B) = P(A)
or
or
2007-4-3 Wang Houfeng, Icler, PKU 9
Probabilistic Classification
• Input: x = [x1,x2]T ,Output: C ∈{-1,1}• Prediction:
1 2
1 2 1 2
1 if ( 1| ) 0 5 choose
-1 otherwiseor equivalently
1 if ( 1| ) ( 1| ) choose
1, otherwise
C P C x ,x .C
C P C x ,x P C x ,x C
= = >⎧⎨ =⎩
= = > = −⎧⎨ = −⎩
2007-4-3 Wang Houfeng, Icler, PKU 10
Basic of Probability Learning
• Goal: find the best hypothesis from some space H of hypotheses, given the observed data D;
• Define best to be: most probable hypothesis in H;• In order to do that, we need to assume a probability
distribution over the H;• In addition, we need to know something about the
relation between the data observed and the hypotheses.
2007-4-3 Wang Houfeng, Icler, PKU 11
Bayes TheoremHypothesis space : H, Dataset: D. Four probabilities are introduced as follows:
• P(h): the prior probability of h∈H before data is observed. Reflects background knowledge; If no information, uniform distribution is chosen.
• P(D): the probability of seeing data D, it is evidence. (No knowledge of the hypothesis)
• P(D|h): is the probability of the data given h. It is called the likelihood of h with respect to D.
• P(h|D): The posterior probability of h after having seen data D. The probability h is the target.
2007-4-3 Wang Houfeng, Icler, PKU 12
Bayes Theorem
Bayes theorem relates the posterior probability of a hypothesis given the data with the three probabilities mentioned before:
Posterior probability Prior probabilityLikelihood
Evidence
( | ) ( )( | )( )
P D h P hP h DP D
=i
2007-4-3 Wang Houfeng, Icler, PKU 13
Hypotheses in Bayesian
• Hypotheses h refers to processes that could have generated the data D
• Bayesian inference provides a distribution over these hypotheses, given D
• P(D|h) is the probability of D being generated by the process identified by h
• Hypotheses h are mutually exclusive: only one process could have generated D
2007-4-3 Wang Houfeng, Icler, PKU 14
The origin of Bayes’ rule
• For any two random variables:
( , ) ( ) ( | )( , ) ( ) ( | )
p A B p A p B Ap A B p B p A B
==
)|()()|()( ABpApBApBp =
)()|()()|(
BpABpApBAp =
2007-4-3 Wang Houfeng, Icler, PKU 15
Bayes’ rule in odds form
D: datah1, h2: modelsP(h1|D): posterior probability h1 generated the data P(D|h1): likelihood of data under model h1P(h1): prior probability h1 generated the data
1
1 1
22 2
1 1
2 2
( , )( | ) ( , )( )
( , )( | ) ( , )( )
( | ) ( ) ( | ) ( )
P h DP h D P h Dp D
P h DP h D P h Dp D
p D h p hp D h p h
= =
=
likelihood ratio
2007-4-3 Wang Houfeng, Icler, PKU 16
Comparing two hypotheses
D: HHTHTHypotheses h1:“fair coin”; Hypotheses h2: “always heads”P(D|h1) = 1/25 P(h1) = 999/1000
P(D|h2) = 0 P(h2) = 1/1000
P(h1|D) / P(h2|D) = infinity
1 1 1
2 2 2
( | ) ( | ) ( ) ( | ) ( | ) ( )
P h D p D h p hP h D p D h p h
=
2007-4-3 Wang Houfeng, Icler, PKU 17
Comparing two hypotheses
D: HHHHHHypotheses h1:“fair coin”; Hypotheses h2: “always heads”
P(D|h1) = 1/25 P(h1) = 999/1000 P(D|h2) = 1 P(h2) = 1/1000
P(h1|D) / P(h2|D) ≈ 30
1 1 1
2 2 2
( | ) ( | ) ( ) ( | ) ( | ) ( )
P h D p D h p hP h D p D h p h
=
2007-4-3 Wang Houfeng, Icler, PKU 18
Comparing two hypotheses
D: HHHHHHHHHHh1, h2: “fair coin”, “always heads”P(D|h1) = 1/210 P(h1) = 999/1000 P(D|h2) = 1 P(h2) = 1/1000
P(h1|D) / P(h2|D) ≈ 1
1 1 1
2 2 2
( | ) ( | ) ( ) ( | ) ( | ) ( )
P h D p D h p hP h D p D h p h
=
2007-4-3 Wang Houfeng, Icler, PKU 19
h1 vs. h2
• Hypotheses h1:“fair coin”; Hypotheses h2: “always heads”
• Which one is better?
2007-4-3 Wang Houfeng, Icler, PKU 20
For K (>2) Classes
( ) ( ) ( )( )
( ) ( )
( ) ( )1
||
|
|
i ii
i iK
k kk
P D h P hP h D
P D
P D h P h
P D h P h=
=
=
∑
( ) ( )
( ) ( )1
0 and 1
choose if | max |
K
i ii
i i k k
P h P h
h P h D P h D=
≥ =
=
∑
2007-4-3 Wang Houfeng, Icler, PKU 21
Outline
• Introduction
Maximum Likelihood Estimation
• Naïve Bayesian Classification
2007-4-3 Wang Houfeng, Icler, PKU 22
Maximum A Posteriori
• Now, attempts to find the most probable one h ∈H, given the observed data.
• A method that looks for the hypothesis with maximum P(h|D) is called a maximum a posteriori method or MAP.
( | )
( | ) ( )( )
( | ) ( )
, ( )
Argmax
Argmax
Argmax
MAPh H
h H
h H
h P h D
P D h P hP D
P D h P h
where P D is independent of h
∈
∈
∈
=
=
=
i
i
2007-4-3 Wang Houfeng, Icler, PKU 23
Example: QA
• Query: Does patient have cancer or not?• Two hypothesis: the patient has cancer, ⊕, the patient doesn’t
have cancer, ;• Prior knowledge: over the entire population of people .008 have cancer;• The lab test returns a correct positive result in 98% of the cases
in which cancer is actually present and a correct negative in 97% of the cases in which cancer is actually not present
• P(cancer) = .008, P(¬cancer) = .992• P(⊕|cancer) = .98, P( |cancer) = .02• P(⊕|¬cancer)=.03, P( |¬cancer)=.97• So given a new patient with a positive lab test, should we
diagnose the patient as having cancer or not??
2007-4-3 Wang Houfeng, Icler, PKU 24
Example: QA
• Which is the MAP hypothesis?
( )( )
| 0.98
| 0.02
P Cancer
P Cancer
⊕ =
=
( )( )
| 0.03
| 0.97
P Cancer
P Cancer
⊕ ¬ =
¬ =
( )( )
0.008
0.992
P Cancer
P Cancer
=
¬ =
P(⊕|cancer)P(cancer) = (.98).008 = .0078P(⊕|¬cancer)P(¬cancer)=(.03).992=.0298
( | ) ( ) .0078( | ) .21( ) .0078 .0298
P cancer P cancerP cancerP
⊕⊕ = = =
⊕ +
• So, the MAP hypothesis : hMAP = ¬ cancer
2007-4-3 Wang Houfeng, Icler, PKU 25
Maximum Likelihood hypothesis
• Assume that a priori, hypotheses are equally probable.
( ) ( ), ,i j i jP h P h h h H= ∀ ∈
• Maximum Likelihood hypothesis can be got:
( | )argmaxMLh P D h∈
=h H
• Now, just need to look for the hypothesis that best explains the data.
2007-4-3 Wang Houfeng, Icler, PKU 26
Maximum Likelihood Estimator
1 2 nLet D={x ,x ,...x } is training set,
( ) ( | ) ( | )argmax argmaxM L ih h P D h P x h∈ ∈
= = ∏ixh H h H
( )Let 0MLh hh
∂=
∂
2007-4-3 Wang Houfeng, Icler, PKU 27
An simple example• Assuming
– A coin has a probability p of heads, 1-p of tails.– Observation: We toss a coin N times, and the result is a
set of Hs and Ts, and there are M Hs.• What is the value of p based on MLE, given the
observation? ( ) log ( | ) log[ (1 ) ]
log ( ) log(1 )
M N ML P D p pM p N M p
θ θ −= = −= + − −
( ) ( log ( )log(1 )) 01
dL d M p N M p M N Mdp dp p p
θ + − − −= = − =
−
p= M/N
2007-4-3 Wang Houfeng, Icler, PKU 28
Bayesian Learning : Unbiased Coin
• Coin Flip– Sample space: Ω = {Head, Tail}
– Scenario: given coin is either fair or has a 60% bias in favor of Head
• h1 ≡ fair coin: P(Head) = 0.5
• h2 ≡ 60% bias towards Head: P(Head) = 0.6
– Objective: to decide between the hypotheses
• A Priori Distribution on H– P(h1) = 0.75, P(h2) = 0.25
2007-4-3 Wang Houfeng, Icler, PKU 29
Bayesian Learning : Unbiased Coin
• Collection of Evidence– First piece of evidence: d=a single coin toss, comes up Head
– Q: What does the agent believe now?
– A: Compute P(d) = P(d | h1) P(h1) + P(d | h2) P(h2)
• Bayesian Inference: Compute P(d) = P(d | h1) P(h1) + P(d | h2) P(h2)– P(Head) = 0.5 • 0.75 + 0.6 • 0.25 = 0.375 + 0.15 = 0.525
– This is the probability of the observation d = Head
2007-4-3 Wang Houfeng, Icler, PKU 30
Bayesian Learning : Unbiased Coin
• Bayesian Learning– Now apply Bayes’s Theorem
• P(h1 | d) = P(d | h1) P(h1) / P(d) = 0.375 / 0.525 = 0.714
• P(h2 | d) = P(d | h2) P(h2) / P(d) = 0.15 / 0.525 = 0.286
• Belief has been revised downwards for h1, upwards for h2
• The agent still thinks that the fair coin is the more likely hypothesis
– Suppose we were to use the ML approach (i.e., assume equal priors)
• Belief is revised upwards from 0.5 for h1
• Data then supports the bias coin better
2007-4-3 Wang Houfeng, Icler, PKU 31
Bayesian Learning : Unbiased Coin
• More Evidence: Sequence D of 100 coins with 70 heads and 30 tails– P(D) = (0.5)50 • (0.5)50 • 0.75 + (0.6)70 • (0.4)30 • 0.25
– Now P(h1 | d) << P(h2 | d)
2007-4-3 Wang Houfeng, Icler, PKU 32
Brute Force MAP Hypothesis Learner
• Intuitive Idea: Produce Most Likely h Given Observed D
• Algorithm Find-MAP-Hypothesis (D)
– 1. FOR each hypothesis h ∈ H
Calculate the conditional (i.e., posterior) probability:
– 2. RETURN the hypothesis hMAP with the highest conditional
probability
( | ) ( )( | )( )
P D h P hP h DP D
=
( | )Argmax MAPh H
h P h D∈
=
2007-4-3 Wang Houfeng, Icler, PKU 33
Outline
• Introduction
• Maximum Likelihood Estimation
Naïve Bayes Classification
2007-4-3 Wang Houfeng, Icler, PKU 34
Bayesian Classification
• Framework– Find most probable classification (as opposed to MAP hypothesis)
– f: X → C (domain ≡ instance space, range ≡ finite set of values)
– Instances x ∈ X can be described as a collection of features x ≡ (a1, a2, …, an)
– Performance element: Bayesian classifier
• Given: an example
• Output: the most probable value cj ∈ C
( ) ( )( ) ( )
∈ ∈
∈
= =
=
…
…j j
j
MAP j j 1 2 nc C c C
1 2 n j jc C
v arg max P c | x arg max P c | a ,a , , a
arg max P a ,a , , a |c P c
2007-4-3 Wang Houfeng, Icler, PKU 35
Bayesian Classification
• Parameter Estimation Issues– Easily estimating P(cj): only count the frequency(cj) in D = {<x, c(x)>}
– But infeasible to estimate P(a1,a2, …, an | cj): too many 0 values
– Need to make assumptions that allow us to estimate P(x | c)
• Intuitive Idea– hMAP(x) is not necessarily the most probable classification!
– Example• Three possible hypotheses: P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3
• Suppose that for new instance x, h1(x) = +, h2(x) = –, h3(x) = –
• What is the most probable classification of x?
2007-4-3 Wang Houfeng, Icler, PKU 36
Bayes Optimal Classification (BOC)
• Example:• P(h1 | D) = 0.4, P(– | h1) = 0, P(+ | h1) = 1
• P(h2 | D) = 0.3, P(– | h2) = 1, P(+ | h2) = 0
• P(h3 | D) = 0.3, P(– | h3) = 1, P(+ | h3) = 0
• Result:
( ) ( )∈
∈
⎡ ⎤= = ⋅⎣ ⎦∑j
i
BOC j i ic C h Hc* c arg max P c |h P h | D
( ) ( )
( ) ( )
| | 0 .4
| | 0 .6i
i
i ih H
i ih H
P h P h D
P h P h D∈
∈
+ ⋅ =⎡ ⎤⎣ ⎦
− ⋅ =⎡ ⎤⎣ ⎦
∑
∑
( ) ( )* argmax | |j
i
BOC j i ic C h H
c c P c h P h D∈
∈
⎡ ⎤= = ⋅⎣ ⎦ −=∑
h
P(h)
2007-4-3 Wang Houfeng, Icler, PKU 37
New Issue
• BOC Computationally expensive. It computes the posterior prob for every h∈H and combines the predictions of each hypothesis to classify each new instance.
• Solution:– Choose a hypothesis h (eg. MAP)from H at random,
according to the posterior probability distribution over H, and use it to predict the novel instances(Gibbs Classifier)!
– We expect to simplify this estimation: Naïve Bayes
2007-4-3 Wang Houfeng, Icler, PKU 38
Naïve Bayes: Characterics
• Very difficult to compute the likelihood P( a1,a2, …,an | cj) • To Simplify the assumption(Naïve Bayes):
– attribute values x independent given target value v
( ) ( )( ) ( )
a
a
where, a
∈ ∈
∈
= =
=
…
…
…
j j
j
MAP j j 1 2 nc C c C
1 2 n j jc C
1 2 n
c arg max P c | x arg max P c | a , a , ,
arg max P a , a , , | c P c
x = (a , a , , )
( ) ( )a =∏…1 2 n j i ji
P a ,a , , |c P a |c
( ) ( )∈
= ∏j
NB j i jc C ic arg max P c P a | c
• Results comparable to ANN and decision trees in some domains• Moderate or large training set available• Successful Applications: Classifying text documents
2007-4-3 Wang Houfeng, Icler, PKU 39
Naive Bayes’ Classifier
Given category cj, ai are independent:
p(x| cj) = p(a1| cj) p(a2| cj) ... p(an| cj)
cj
a1 a2 an
p(an | cj )p(a1 | cj ) p(a2 | cj )
2007-4-3 Wang Houfeng, Icler, PKU 40
Naïve Bayes: Independence Issue
• Conditional Independence Assumption Often Violated
– CI assumption:
– However, it works well surprisingly well anyway
– Note
• Don’t need estimated conditional probabilities to be correct
• Only need( ) ( )
( ) ( )
ˆ ˆ∈
=
∈
⎡ ⎤= ⎢ ⎥
⎣ ⎦⎡ ⎤= ⎣ ⎦
∏
…
j
j
n
NB j k jc C k
j 1 2 n jc C
c argmax P c P a | c
argmax P c P a , a , , a | c1
( )ˆjP c | x
( ) ( )a = ∏…1 2 n j k jk
P a ,a , , | c P a | c
2007-4-3 Wang Houfeng, Icler, PKU 41
Naïve Bayes Algorithm
• Simple (Naïve) Bayes Assumption
• Simple (Naïve) Bayes Classifier
• Algorithm Naïve-Bayes-Learn– FOR each target value cj
FOR each attribute value ak of each attribute x
– RETURN
( ) ( )a =∏…1 2 n j k jk
P a ,a , , |c P a |c
( ) ( )∈
= ∏j
NB j k jc C kc arg max P c P a | c
( ) ( )ˆ ⎡ ⎤← ⎣ ⎦j jP c estimate P c
( ) ( )ˆ ⎡ ⎤← ⎣ ⎦k j k jP a | c estimate P a |c
( ){ }ˆ k jP a | c
2007-4-3 Wang Houfeng, Icler, PKU 42
Example
• Concept: PlayTennis ( ) ( )ˆ ˆ ak∈=
⎡ ⎤= ⎢ ⎥
⎣ ⎦∏
j
n
NB j k jc Cc argmax P c P | c
1
Day Outlook Temperature Humidity Wind PlayTennis?1 Sunny Hot High Light No2 Sunny Hot High Strong No3 Overcast Hot High Light Yes4 Rain Mild High Light Yes5 Rain Cool Normal Light Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Light No9 Sunny Cool Normal Light Yes10 Rain Mild Normal Light Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Light Yes14 Rain Mild High Strong No
2007-4-3 Wang Houfeng, Icler, PKU 43
Example• Application of Naïve Bayes: to computation
– P(PlayTennis = {Yes, No}) 2 cases
– P(Outlook = {Sunny, Overcast, Rain} | PT = {Yes, No}) 6 cases
– P(Temp = {Hot, Mild, Cool} | PT = {Yes, No}) 6 cases
– P(Humidity = {High, Normal} | PT = {Yes, No}) 4 cases
– P(Wind = {Light, Strong} | PT = {Yes, No}) 4 cases
• Query: New Example x = <Sunny, Cool, High, Strong, ?>– Desired inference: P(PlayTennis = Yes | x) = 1 - P(PlayTennis = No | x)
– P(PlayTennis = Yes) = 9/14 = 0.64 P(PlayTennis = No) = 5/14 = 0.36
– P(Outlook = Sunny | PT = Yes) = 2/9 P(Outlook = Sunny | PT = No) = 3/5
– P(Temperature = Cool | PT = Yes) = 3/9 P(Temperature = Cool | PT = No) = 1/5
– P(Humidity = High | PT = Yes) = 3/9 P(Humidity = High | PT = No) = 4/5
– P(Wind = Strong | PT = Yes) = 3/9 P(Wind = Strong | PT = No) = 3/5
2007-4-3 Wang Houfeng, Icler, PKU 44
Example
• Inference– P(PlayTennis = Yes, <Sunny, Cool, High, Strong>) =
P(Yes) P(Sunny | Yes) P(Cool | Yes) P(High | Yes) P(Strong | Yes) ≈
0.0053
– P(PlayTennis = No, <Sunny, Cool, High, Strong>) =P(No) P(Sunny | No) P(Cool | No) P(High | No) P(Strong | No) ≈ 0.0206
– So, vNB = No
– By normalization:
0.0206 /(0.0053 + 0.0206 ) ≈ 0.795
( ) ( )∈
= ∏j
NB j k jc C kc arg max P c P a |c
2007-4-3 Wang Houfeng, Icler, PKU 45
Naïve Bayes: Two classes
• Naïve Bayes method gives a method for predicting .• In the case of two classes, c∈{0,1} we predict that
c=1 iff:
( ) ( )∈
= ∏j
NB j k jc C kc arg m ax P c P a | c
0k
k
=
=
= • =>
= • =∏∏
nj k j1
nj k j1
P(c 1) P(a | c 1)1
P(c 0) P(a | c )
2007-4-3 Wang Houfeng, Icler, PKU 46
Naïve Bayes: Two classes
1
1
Denote:
, and 0
0 , and 0 0
0
n
kn
k
=
=
= = = = = =
= = = =
= = = =
= • =
= • =
= •=
∏∏
k
k k j k k j
k j k k j k
k j k k j k
j k j
j k j
aj k k
p P(a 1 | c 1), q P(a 1 | c 0)
Now, P(a 1 | c 1) = p P(a | c 1) = ( 1 - p )
P(a 1 | c ) = q P(a | c ) = ( 1 - q )
So,
P(c 1) P(a | c 1)
P(c 0) P(a | c )
P(c 1) p (1 - p1
1
)
)
n
kn
k
=
=
>= •
∏∏
k
k k
1-a
a 1-aj k k
1P(c 0) q (1 - q
( ) ( )∈
= ∏j
NB j k jc C kc arg m ax P c P a | c
2007-4-3 Wang Houfeng, Icler, PKU 47
Naïve Bayes: Two classes
11
1 1
)( ))
) )( )
nn
kkn nk k
==
= =
= •= •= >
= • = •
∏∏∏ ∏
ak
k k
ak k k
ka 1-a j k
j k k ka 1-a
kj k k j kk
pP(c 1) (1 - pP(c 1) p (1 - p 1 - p 1qP(c 0) q (1 - q P(c 0) (1 - q1 - q
=+ + −
=
=
>∑ ∑j k k k
k k
j
kk kj k
P(c 1) 1 - plog log ( ) 0P
Take logarithm;
p qlog log1(
we predict c 1
- p 1 - q
c 0
) 1 - q
iff :
a
So, the naive Bayes is a linear separator with
)p
= =
=
−
=
k kk
k k
k
k k
k k
k k
q q 1- pw log ( 1- q
if p q then w 0 and the feature i
p
s
log lo
irre
g1- q
le1- p
vant
( ) ( )∈
= ∏j
NB j k jc C kc arg m ax P c P a | c
2007-4-3 Wang Houfeng, Icler, PKU 48
Naïve Bayes: Two classes
• Which is simply the sigmoid function used in the neural network representation.
exp( )= =
= − ⇒ = −= =∑ ∑j j
k k k kk kj j
P(c 1 | x ) P(c 1 | x )log w a b w a b
P(c 0 | x ) P(c 0 | x )
• In the case of two classes we have that:
1= = =j jP(c 0 | x ) 1 - P(c | x )• but since
= =+ +∑j
k kk
1P(c 1 | x )1 exp(- w a b)
• Thus, we get:
2007-4-3 Wang Houfeng, Icler, PKU 49
Naïve Bayes as a Perceptron
Σsum
a1 weight
“-1”
b: Initial threshold,
sume−+11
)= k kk
k k
p 1 - qw log ( q 1 - p
)= 1 11
1 1
p 1 - qw log ( q 1 - p
a2
ak
an
2007-4-3 Wang Houfeng, Icler, PKU 50
Naïve Bayes: Zero Probability
• If we never see something (0-time)in the train set:
, ; called equivalent sample size. It can be interpreted
as augmenting the n acttual observations by a
m-estimate:
n
( |
)
"
k
k
k j
where is a prior distribute of ais
additio
n mpP a c
m virtual sam
np
mnal
m+
=+
" ples distributed according to p
( | ) 0 !kk j
nP a cn
= =
• We can smooth them by m-estimation:
• How should we deal with them?
2007-4-3 Wang Houfeng, Icler, PKU 51
Document Classification/ Categorization
Assign labels to each document or web-page:• Labels are most often topics such as Sina-categories
e.g., “体育,” “财经,” “教育”…
• Labels may be genrese.g., "editorials" "movie-reviews" "news“
• Labels may be opinione.g., “like”, “hate”, “neutral”
• Labels may be domain-specific binarye.g., "interesting-to-me" : "not-interesting-to-me”e.g., “spam” : “not-spam”
2007-4-3 Wang Houfeng, Icler, PKU 52
Learning to Classify Text
• Naïve Bayes has been used heavily on text classification.• Instance space D: Text documents• Text Categories (the set of targets): C• Training Set: a set of examples (instances with label)• How to classify a new text?
Document representation
Group A Group B
2007-4-3 Wang Houfeng, Icler, PKU 53
Document representation
1. Attributes: Each word position is an attribute that takes the value corresponding to the word in that position. For example if a document has 100 words, then there are 100 attributes.
2. Attributes: Consider the number of words in the dictionary ~50,000. Consider each of these words an attribute and simply count how many times they appear in the document.
2007-4-3 Wang Houfeng, Icler, PKU 54
Attribute-1• Attributes: One attribute for each word position; • Number of attributes = L (length of longest document);• Type of attribute = N (number of words);• An instance: a list of length L;• What are the probabilities needed to estimate ?
= ∀i k j kP(a word | c ) word
• Too many probabilities: ( C x L x N ~ C x 100 x 50,000)• New assumption: see the word in the document does not
depend on its position= = = ∀i k m k jP(a word | c) P(a word | c ) i,m
• This reduces dimensionality to one attribute for each word.• The number of probabilities to estimate is: C x N
2007-4-3 Wang Houfeng, Icler, PKU 55
Attribute-2
• Attributes: all the N words appearing in the training set; Adocument will be represented as a bag of its words;
• Boolean: An instance is a list of length N, 0-word is not in x; 1-word is in X;
• Problem: examples are too long• List only the active features• How many probabilities to estimate:
– C x N
2007-4-3 Wang Houfeng, Icler, PKU 56
Estimating Probabilities
• How to estimate k jP(w | c )
= =k j kk j
j
Num of times word occurs in training texts with label c nP(w | c ) Total number of times all words occurs in training texts with label c n
• Sparsity of data is a problem• if n is small, the estimate is not accurate• if nk is 0, it will dominate the estimate: we will never predict
what if a word that never appeared in training set with label cj but appears in the test data?
2007-4-3 Wang Houfeng, Icler, PKU 57
Smoothing
• There are many ways to smooth;• An empirical issue.
1
Original: ( | )
m-estimate: ( | )
1: ( |
| |,
)|
:
|
kk j
kk j
kk j
if mp and m V
nP w cn
n mpP w cn m
nLaplace P w c
ocabulary we ha
n Vocabu
v
lar
e
y
=
+=
+
+=
+
= =
2007-4-3 Wang Houfeng, Icler, PKU 58
Algorithm Learn-Naïve-Bayes-Text (D, V )
– 1. Collect all words, punctuation, and other tokens that occur in document set D
• Vocabulary ← {all distinct words, tokens occurring in any document x ∈ D}
– 2. Calculate required P(cj) and P(xi = wk | cj) probability terms
• FOR each target value cj ∈ C DO
– Docs [ j ] ← {documents x ∈ D & c(x) = cj }
– Text [ j ] ← Concatenation (docs [ j ] ) // form a single document
– n ← total number of distinct word positions in text [ j ]
– FOR each word wk in Vocabulary
» nk ← number of times word wk occurs in text [ j ]
– 3. RETURN <{P(cj)}, {P(wk | cj)}>
( ) [ ]=j
docs jP c
D
( ) +=
+k
k jnP w |c
n Vocabulary 1
2007-4-3 Wang Houfeng, Icler, PKU 59
Applying Naïve Bayes to Classify Text
• Function Classify-Naïve-Bayes-Text (x, Vocabulary)
– Positions ← {word positions in document x that contain tokens found in
Vocabulary}
– RETURN
• Purpose of Classify-Naïve-Bayes-Text
– Returns estimated target value for new document
– ai: denotes word found in the ith position within x
( ) ( )a∈
∈
= ∏j
NB j i jc C i Positionsc arg max P c P |c
2007-4-3 Wang Houfeng, Icler, PKU 60
Word Sense Disambiguation
• Problem: to look at the words around an ambiguous wordin a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier combines the evidence from all features.
• The Naive Bayes is useful although the assumption is incorrect in the context of text processing:– The structure and linear ordering of words is ignored:
bag of words model.– The presence of one word is independent of another,
which is clearly untrue in text.
2007-4-3 Wang Houfeng, Icler, PKU 61
Word Sense Disambiguation• Problem Definition
– Given: m sentences, each containing a usage of a particular ambiguous word
– Example: “The can will rust.” (auxiliary verb versus noun)
– Label: cj ≡ s ≡ correct word sense (e.g., s ∈ {auxiliary verb, noun})
– Representation: m examples (labeled attribute vectors <(w1, w2, …, wn), s>)
– Return: classifier f: X → C that disambiguates new x ≡ (w1, w2, …, wn)
• Solution Approach: Use Naïve Bayes
( ) ( )( )=
= ∏s…n
1 2 n ii
P s |w ,w , , w P P w | s1
2007-4-3 Wang Houfeng, Icler, PKU 62
Topic Detection
• The task is to identify the most salient topic in a given document
• select a topic, t, from the set of possible topics, T;• compute
( ) ( )1..T i N
w∈
=
= ∏NB itv arg max P t P | t
2007-4-3 Wang Houfeng, Icler, PKU 63
Comments on Naïve Bayes
• Tends to work well despite strong assumption of conditional independence.
• Experiments show it to be quite competitive with other classification methods.
• Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases.
• Does not perform any search of the hypothesis space. Directly constructs a hypothesis from parameter estimates that are easilycalculated from the training data.
• Not guarantee consistency with training data.• Typically handles noise well since it does not even focus on
completely fitting the training data.