Naïve Bayes - 计算语言学研究所

63
2007-4-3 Wang Houfeng, Icler, PKU 1 Naïve Bayes Wang Houfeng Institute of Computational Linguistics Peking University

Transcript of Naïve Bayes - 计算语言学研究所

Page 1: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 1

Naïve Bayes

Wang HoufengInstitute of Computational Linguistics

Peking University

Page 2: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 2

Outline

Introduction

• Maximum Likelihood Estimation

• Naïve Bayesian Classification

Page 3: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 3

Introduction

Bayesian learning enables us to form predictions based on probabilities. It provides a framework for probabilistic reasoning. The advantages:

There may be noise in the data;can provide “prior knowledge” in constructing a

hypothesis;Predictions are probabilistic;The framework provides a view of optimal decision

making

Page 4: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 4

Probability Definition

• A probability is a function from an event space to a real number between 0 and 1 (inclusive), where,– 0 ≤ P(A) ≤ 1,

• 0: indicates impossibility• 1: indicates certainty

– P(Ω) = 1 where, Ω : sample space – P(X) ≤ p(Y) for any X ⊆ Y– If A1, A2,…,An is disjoint (Ai ∩ Aj ) =∅, then

11( ) ( )n n

i iiiP A P A

=== ∑∪

Page 5: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 5

Interpretation of probability

• Relative Frequency– Suppose that an experiment is performed n times

and A occurs f times. The relative frequency for an event A is:

nf

n=

occurs A times ofNumber

• If we let n get infinitely large, nfAP

nlim)(

→∞=

Page 6: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 6

Interpretation of probability

• In frequentist statistics, probabilities are associated only with the data, i.e., outcomes of repeatable observations:

Probability = limiting frequency

Page 7: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 7

Some Rules• For any two events, A and B, the probability of

their union, P(A ∪ B), is

)()()()( BAPBPAPBAP ∩−+=∪

A B

•• A Special CaseA Special Case, When two events A and B are mutually exclusive, mutually exclusive, P(A∩B) = 0 and P(A∪B) = P(A) + P(B).

Page 8: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 8

Independence

• Two events A and B are independentindependent if and only if:

P(B|A) = P(B)P(B|A) = P(B)P(B|A) = P(B)

P(A ∩ B) = P(A) P(B) P(A ∩ B) = P(A) P(B)

P(A|B) = P(A)P(AP(A||B) = P(A)B) = P(A)

or

or

Page 9: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 9

Probabilistic Classification

• Input: x = [x1,x2]T ,Output: C ∈{-1,1}• Prediction:

1 2

1 2 1 2

1 if ( 1| ) 0 5 choose

-1 otherwiseor equivalently

1 if ( 1| ) ( 1| ) choose

1, otherwise

C P C x ,x .C

C P C x ,x P C x ,x C

= = >⎧⎨ =⎩

= = > = −⎧⎨ = −⎩

Page 10: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 10

Basic of Probability Learning

• Goal: find the best hypothesis from some space H of hypotheses, given the observed data D;

• Define best to be: most probable hypothesis in H;• In order to do that, we need to assume a probability

distribution over the H;• In addition, we need to know something about the

relation between the data observed and the hypotheses.

Page 11: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 11

Bayes TheoremHypothesis space : H, Dataset: D. Four probabilities are introduced as follows:

• P(h): the prior probability of h∈H before data is observed. Reflects background knowledge; If no information, uniform distribution is chosen.

• P(D): the probability of seeing data D, it is evidence. (No knowledge of the hypothesis)

• P(D|h): is the probability of the data given h. It is called the likelihood of h with respect to D.

• P(h|D): The posterior probability of h after having seen data D. The probability h is the target.

Page 12: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 12

Bayes Theorem

Bayes theorem relates the posterior probability of a hypothesis given the data with the three probabilities mentioned before:

Posterior probability Prior probabilityLikelihood

Evidence

( | ) ( )( | )( )

P D h P hP h DP D

=i

Page 13: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 13

Hypotheses in Bayesian

• Hypotheses h refers to processes that could have generated the data D

• Bayesian inference provides a distribution over these hypotheses, given D

• P(D|h) is the probability of D being generated by the process identified by h

• Hypotheses h are mutually exclusive: only one process could have generated D

Page 14: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 14

The origin of Bayes’ rule

• For any two random variables:

( , ) ( ) ( | )( , ) ( ) ( | )

p A B p A p B Ap A B p B p A B

==

)|()()|()( ABpApBApBp =

)()|()()|(

BpABpApBAp =

Page 15: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 15

Bayes’ rule in odds form

D: datah1, h2: modelsP(h1|D): posterior probability h1 generated the data P(D|h1): likelihood of data under model h1P(h1): prior probability h1 generated the data

1

1 1

22 2

1 1

2 2

( , )( | ) ( , )( )

( , )( | ) ( , )( )

( | ) ( ) ( | ) ( )

P h DP h D P h Dp D

P h DP h D P h Dp D

p D h p hp D h p h

= =

=

likelihood ratio

Page 16: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 16

Comparing two hypotheses

D: HHTHTHypotheses h1:“fair coin”; Hypotheses h2: “always heads”P(D|h1) = 1/25 P(h1) = 999/1000

P(D|h2) = 0 P(h2) = 1/1000

P(h1|D) / P(h2|D) = infinity

1 1 1

2 2 2

( | ) ( | ) ( ) ( | ) ( | ) ( )

P h D p D h p hP h D p D h p h

=

Page 17: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 17

Comparing two hypotheses

D: HHHHHHypotheses h1:“fair coin”; Hypotheses h2: “always heads”

P(D|h1) = 1/25 P(h1) = 999/1000 P(D|h2) = 1 P(h2) = 1/1000

P(h1|D) / P(h2|D) ≈ 30

1 1 1

2 2 2

( | ) ( | ) ( ) ( | ) ( | ) ( )

P h D p D h p hP h D p D h p h

=

Page 18: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 18

Comparing two hypotheses

D: HHHHHHHHHHh1, h2: “fair coin”, “always heads”P(D|h1) = 1/210 P(h1) = 999/1000 P(D|h2) = 1 P(h2) = 1/1000

P(h1|D) / P(h2|D) ≈ 1

1 1 1

2 2 2

( | ) ( | ) ( ) ( | ) ( | ) ( )

P h D p D h p hP h D p D h p h

=

Page 19: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 19

h1 vs. h2

• Hypotheses h1:“fair coin”; Hypotheses h2: “always heads”

• Which one is better?

Page 20: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 20

For K (>2) Classes

( ) ( ) ( )( )

( ) ( )

( ) ( )1

||

|

|

i ii

i iK

k kk

P D h P hP h D

P D

P D h P h

P D h P h=

=

=

( ) ( )

( ) ( )1

0 and 1

choose if | max |

K

i ii

i i k k

P h P h

h P h D P h D=

≥ =

=

Page 21: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 21

Outline

• Introduction

Maximum Likelihood Estimation

• Naïve Bayesian Classification

Page 22: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 22

Maximum A Posteriori

• Now, attempts to find the most probable one h ∈H, given the observed data.

• A method that looks for the hypothesis with maximum P(h|D) is called a maximum a posteriori method or MAP.

( | )

( | ) ( )( )

( | ) ( )

, ( )

Argmax

Argmax

Argmax

MAPh H

h H

h H

h P h D

P D h P hP D

P D h P h

where P D is independent of h

=

=

=

i

i

Page 23: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 23

Example: QA

• Query: Does patient have cancer or not?• Two hypothesis: the patient has cancer, ⊕, the patient doesn’t

have cancer, ;• Prior knowledge: over the entire population of people .008 have cancer;• The lab test returns a correct positive result in 98% of the cases

in which cancer is actually present and a correct negative in 97% of the cases in which cancer is actually not present

• P(cancer) = .008, P(¬cancer) = .992• P(⊕|cancer) = .98, P( |cancer) = .02• P(⊕|¬cancer)=.03, P( |¬cancer)=.97• So given a new patient with a positive lab test, should we

diagnose the patient as having cancer or not??

Page 24: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 24

Example: QA

• Which is the MAP hypothesis?

( )( )

| 0.98

| 0.02

P Cancer

P Cancer

⊕ =

=

( )( )

| 0.03

| 0.97

P Cancer

P Cancer

⊕ ¬ =

¬ =

( )( )

0.008

0.992

P Cancer

P Cancer

=

¬ =

P(⊕|cancer)P(cancer) = (.98).008 = .0078P(⊕|¬cancer)P(¬cancer)=(.03).992=.0298

( | ) ( ) .0078( | ) .21( ) .0078 .0298

P cancer P cancerP cancerP

⊕⊕ = = =

⊕ +

• So, the MAP hypothesis : hMAP = ¬ cancer

Page 25: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 25

Maximum Likelihood hypothesis

• Assume that a priori, hypotheses are equally probable.

( ) ( ), ,i j i jP h P h h h H= ∀ ∈

• Maximum Likelihood hypothesis can be got:

( | )argmaxMLh P D h∈

=h H

• Now, just need to look for the hypothesis that best explains the data.

Page 26: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 26

Maximum Likelihood Estimator

1 2 nLet D={x ,x ,...x } is training set,

( ) ( | ) ( | )argmax argmaxM L ih h P D h P x h∈ ∈

= = ∏ixh H h H

( )Let 0MLh hh

∂=

Page 27: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 27

An simple example• Assuming

– A coin has a probability p of heads, 1-p of tails.– Observation: We toss a coin N times, and the result is a

set of Hs and Ts, and there are M Hs.• What is the value of p based on MLE, given the

observation? ( ) log ( | ) log[ (1 ) ]

log ( ) log(1 )

M N ML P D p pM p N M p

θ θ −= = −= + − −

( ) ( log ( )log(1 )) 01

dL d M p N M p M N Mdp dp p p

θ + − − −= = − =

p= M/N

Page 28: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 28

Bayesian Learning : Unbiased Coin

• Coin Flip– Sample space: Ω = {Head, Tail}

– Scenario: given coin is either fair or has a 60% bias in favor of Head

• h1 ≡ fair coin: P(Head) = 0.5

• h2 ≡ 60% bias towards Head: P(Head) = 0.6

– Objective: to decide between the hypotheses

• A Priori Distribution on H– P(h1) = 0.75, P(h2) = 0.25

Page 29: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 29

Bayesian Learning : Unbiased Coin

• Collection of Evidence– First piece of evidence: d=a single coin toss, comes up Head

– Q: What does the agent believe now?

– A: Compute P(d) = P(d | h1) P(h1) + P(d | h2) P(h2)

• Bayesian Inference: Compute P(d) = P(d | h1) P(h1) + P(d | h2) P(h2)– P(Head) = 0.5 • 0.75 + 0.6 • 0.25 = 0.375 + 0.15 = 0.525

– This is the probability of the observation d = Head

Page 30: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 30

Bayesian Learning : Unbiased Coin

• Bayesian Learning– Now apply Bayes’s Theorem

• P(h1 | d) = P(d | h1) P(h1) / P(d) = 0.375 / 0.525 = 0.714

• P(h2 | d) = P(d | h2) P(h2) / P(d) = 0.15 / 0.525 = 0.286

• Belief has been revised downwards for h1, upwards for h2

• The agent still thinks that the fair coin is the more likely hypothesis

– Suppose we were to use the ML approach (i.e., assume equal priors)

• Belief is revised upwards from 0.5 for h1

• Data then supports the bias coin better

Page 31: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 31

Bayesian Learning : Unbiased Coin

• More Evidence: Sequence D of 100 coins with 70 heads and 30 tails– P(D) = (0.5)50 • (0.5)50 • 0.75 + (0.6)70 • (0.4)30 • 0.25

– Now P(h1 | d) << P(h2 | d)

Page 32: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 32

Brute Force MAP Hypothesis Learner

• Intuitive Idea: Produce Most Likely h Given Observed D

• Algorithm Find-MAP-Hypothesis (D)

– 1. FOR each hypothesis h ∈ H

Calculate the conditional (i.e., posterior) probability:

– 2. RETURN the hypothesis hMAP with the highest conditional

probability

( | ) ( )( | )( )

P D h P hP h DP D

=

( | )Argmax MAPh H

h P h D∈

=

Page 33: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 33

Outline

• Introduction

• Maximum Likelihood Estimation

Naïve Bayes Classification

Page 34: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 34

Bayesian Classification

• Framework– Find most probable classification (as opposed to MAP hypothesis)

– f: X → C (domain ≡ instance space, range ≡ finite set of values)

– Instances x ∈ X can be described as a collection of features x ≡ (a1, a2, …, an)

– Performance element: Bayesian classifier

• Given: an example

• Output: the most probable value cj ∈ C

( ) ( )( ) ( )

∈ ∈

= =

=

…j j

j

MAP j j 1 2 nc C c C

1 2 n j jc C

v arg max P c | x arg max P c | a ,a , , a

arg max P a ,a , , a |c P c

Page 35: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 35

Bayesian Classification

• Parameter Estimation Issues– Easily estimating P(cj): only count the frequency(cj) in D = {<x, c(x)>}

– But infeasible to estimate P(a1,a2, …, an | cj): too many 0 values

– Need to make assumptions that allow us to estimate P(x | c)

• Intuitive Idea– hMAP(x) is not necessarily the most probable classification!

– Example• Three possible hypotheses: P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3

• Suppose that for new instance x, h1(x) = +, h2(x) = –, h3(x) = –

• What is the most probable classification of x?

Page 36: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 36

Bayes Optimal Classification (BOC)

• Example:• P(h1 | D) = 0.4, P(– | h1) = 0, P(+ | h1) = 1

• P(h2 | D) = 0.3, P(– | h2) = 1, P(+ | h2) = 0

• P(h3 | D) = 0.3, P(– | h3) = 1, P(+ | h3) = 0

• Result:

( ) ( )∈

⎡ ⎤= = ⋅⎣ ⎦∑j

i

BOC j i ic C h Hc* c arg max P c |h P h | D

( ) ( )

( ) ( )

| | 0 .4

| | 0 .6i

i

i ih H

i ih H

P h P h D

P h P h D∈

+ ⋅ =⎡ ⎤⎣ ⎦

− ⋅ =⎡ ⎤⎣ ⎦

( ) ( )* argmax | |j

i

BOC j i ic C h H

c c P c h P h D∈

⎡ ⎤= = ⋅⎣ ⎦ −=∑

h

P(h)

Page 37: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 37

New Issue

• BOC Computationally expensive. It computes the posterior prob for every h∈H and combines the predictions of each hypothesis to classify each new instance.

• Solution:– Choose a hypothesis h (eg. MAP)from H at random,

according to the posterior probability distribution over H, and use it to predict the novel instances(Gibbs Classifier)!

– We expect to simplify this estimation: Naïve Bayes

Page 38: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 38

Naïve Bayes: Characterics

• Very difficult to compute the likelihood P( a1,a2, …,an | cj) • To Simplify the assumption(Naïve Bayes):

– attribute values x independent given target value v

( ) ( )( ) ( )

a

a

where, a

∈ ∈

= =

=

j j

j

MAP j j 1 2 nc C c C

1 2 n j jc C

1 2 n

c arg max P c | x arg max P c | a , a , ,

arg max P a , a , , | c P c

x = (a , a , , )

( ) ( )a =∏…1 2 n j i ji

P a ,a , , |c P a |c

( ) ( )∈

= ∏j

NB j i jc C ic arg max P c P a | c

• Results comparable to ANN and decision trees in some domains• Moderate or large training set available• Successful Applications: Classifying text documents

Page 39: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 39

Naive Bayes’ Classifier

Given category cj, ai are independent:

p(x| cj) = p(a1| cj) p(a2| cj) ... p(an| cj)

cj

a1 a2 an

p(an | cj )p(a1 | cj ) p(a2 | cj )

Page 40: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 40

Naïve Bayes: Independence Issue

• Conditional Independence Assumption Often Violated

– CI assumption:

– However, it works well surprisingly well anyway

– Note

• Don’t need estimated conditional probabilities to be correct

• Only need( ) ( )

( ) ( )

ˆ ˆ∈

=

⎡ ⎤= ⎢ ⎥

⎣ ⎦⎡ ⎤= ⎣ ⎦

j

j

n

NB j k jc C k

j 1 2 n jc C

c argmax P c P a | c

argmax P c P a , a , , a | c1

( )ˆjP c | x

( ) ( )a = ∏…1 2 n j k jk

P a ,a , , | c P a | c

Page 41: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 41

Naïve Bayes Algorithm

• Simple (Naïve) Bayes Assumption

• Simple (Naïve) Bayes Classifier

• Algorithm Naïve-Bayes-Learn– FOR each target value cj

FOR each attribute value ak of each attribute x

– RETURN

( ) ( )a =∏…1 2 n j k jk

P a ,a , , |c P a |c

( ) ( )∈

= ∏j

NB j k jc C kc arg max P c P a | c

( ) ( )ˆ ⎡ ⎤← ⎣ ⎦j jP c estimate P c

( ) ( )ˆ ⎡ ⎤← ⎣ ⎦k j k jP a | c estimate P a |c

( ){ }ˆ k jP a | c

Page 42: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 42

Example

• Concept: PlayTennis ( ) ( )ˆ ˆ ak∈=

⎡ ⎤= ⎢ ⎥

⎣ ⎦∏

j

n

NB j k jc Cc argmax P c P | c

1

Day Outlook Temperature Humidity Wind PlayTennis?1 Sunny Hot High Light No2 Sunny Hot High Strong No3 Overcast Hot High Light Yes4 Rain Mild High Light Yes5 Rain Cool Normal Light Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Light No9 Sunny Cool Normal Light Yes10 Rain Mild Normal Light Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Light Yes14 Rain Mild High Strong No

Page 43: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 43

Example• Application of Naïve Bayes: to computation

– P(PlayTennis = {Yes, No}) 2 cases

– P(Outlook = {Sunny, Overcast, Rain} | PT = {Yes, No}) 6 cases

– P(Temp = {Hot, Mild, Cool} | PT = {Yes, No}) 6 cases

– P(Humidity = {High, Normal} | PT = {Yes, No}) 4 cases

– P(Wind = {Light, Strong} | PT = {Yes, No}) 4 cases

• Query: New Example x = <Sunny, Cool, High, Strong, ?>– Desired inference: P(PlayTennis = Yes | x) = 1 - P(PlayTennis = No | x)

– P(PlayTennis = Yes) = 9/14 = 0.64 P(PlayTennis = No) = 5/14 = 0.36

– P(Outlook = Sunny | PT = Yes) = 2/9 P(Outlook = Sunny | PT = No) = 3/5

– P(Temperature = Cool | PT = Yes) = 3/9 P(Temperature = Cool | PT = No) = 1/5

– P(Humidity = High | PT = Yes) = 3/9 P(Humidity = High | PT = No) = 4/5

– P(Wind = Strong | PT = Yes) = 3/9 P(Wind = Strong | PT = No) = 3/5

Page 44: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 44

Example

• Inference– P(PlayTennis = Yes, <Sunny, Cool, High, Strong>) =

P(Yes) P(Sunny | Yes) P(Cool | Yes) P(High | Yes) P(Strong | Yes) ≈

0.0053

– P(PlayTennis = No, <Sunny, Cool, High, Strong>) =P(No) P(Sunny | No) P(Cool | No) P(High | No) P(Strong | No) ≈ 0.0206

– So, vNB = No

– By normalization:

0.0206 /(0.0053 + 0.0206 ) ≈ 0.795

( ) ( )∈

= ∏j

NB j k jc C kc arg max P c P a |c

Page 45: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 45

Naïve Bayes: Two classes

• Naïve Bayes method gives a method for predicting .• In the case of two classes, c∈{0,1} we predict that

c=1 iff:

( ) ( )∈

= ∏j

NB j k jc C kc arg m ax P c P a | c

0k

k

=

=

= • =>

= • =∏∏

nj k j1

nj k j1

P(c 1) P(a | c 1)1

P(c 0) P(a | c )

Page 46: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 46

Naïve Bayes: Two classes

1

1

Denote:

, and 0

0 , and 0 0

0

n

kn

k

=

=

= = = = = =

= = = =

= = = =

= • =

= • =

= •=

∏∏

k

k k j k k j

k j k k j k

k j k k j k

j k j

j k j

aj k k

p P(a 1 | c 1), q P(a 1 | c 0)

Now, P(a 1 | c 1) = p P(a | c 1) = ( 1 - p )

P(a 1 | c ) = q P(a | c ) = ( 1 - q )

So,

P(c 1) P(a | c 1)

P(c 0) P(a | c )

P(c 1) p (1 - p1

1

)

)

n

kn

k

=

=

>= •

∏∏

k

k k

1-a

a 1-aj k k

1P(c 0) q (1 - q

( ) ( )∈

= ∏j

NB j k jc C kc arg m ax P c P a | c

Page 47: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 47

Naïve Bayes: Two classes

11

1 1

)( ))

) )( )

nn

kkn nk k

==

= =

= •= •= >

= • = •

∏∏∏ ∏

ak

k k

ak k k

ka 1-a j k

j k k ka 1-a

kj k k j kk

pP(c 1) (1 - pP(c 1) p (1 - p 1 - p 1qP(c 0) q (1 - q P(c 0) (1 - q1 - q

=+ + −

=

=

>∑ ∑j k k k

k k

j

kk kj k

P(c 1) 1 - plog log ( ) 0P

Take logarithm;

p qlog log1(

we predict c 1

- p 1 - q

c 0

) 1 - q

iff :

a

So, the naive Bayes is a linear separator with

)p

= =

=

=

k kk

k k

k

k k

k k

k k

q q 1- pw log ( 1- q

if p q then w 0 and the feature i

p

s

log lo

irre

g1- q

le1- p

vant

( ) ( )∈

= ∏j

NB j k jc C kc arg m ax P c P a | c

Page 48: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 48

Naïve Bayes: Two classes

• Which is simply the sigmoid function used in the neural network representation.

exp( )= =

= − ⇒ = −= =∑ ∑j j

k k k kk kj j

P(c 1 | x ) P(c 1 | x )log w a b w a b

P(c 0 | x ) P(c 0 | x )

• In the case of two classes we have that:

1= = =j jP(c 0 | x ) 1 - P(c | x )• but since

= =+ +∑j

k kk

1P(c 1 | x )1 exp(- w a b)

• Thus, we get:

Page 49: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 49

Naïve Bayes as a Perceptron

Σsum

a1 weight

“-1”

b: Initial threshold,

sume−+11

)= k kk

k k

p 1 - qw log ( q 1 - p

)= 1 11

1 1

p 1 - qw log ( q 1 - p

a2

ak

an

Page 50: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 50

Naïve Bayes: Zero Probability

• If we never see something (0-time)in the train set:

, ; called equivalent sample size. It can be interpreted

as augmenting the n acttual observations by a

m-estimate:

n

( |

)

"

k

k

k j

where is a prior distribute of ais

additio

n mpP a c

m virtual sam

np

mnal

m+

=+

" ples distributed according to p

( | ) 0 !kk j

nP a cn

= =

• We can smooth them by m-estimation:

• How should we deal with them?

Page 51: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 51

Document Classification/ Categorization

Assign labels to each document or web-page:• Labels are most often topics such as Sina-categories

e.g., “体育,” “财经,” “教育”…

• Labels may be genrese.g., "editorials" "movie-reviews" "news“

• Labels may be opinione.g., “like”, “hate”, “neutral”

• Labels may be domain-specific binarye.g., "interesting-to-me" : "not-interesting-to-me”e.g., “spam” : “not-spam”

Page 52: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 52

Learning to Classify Text

• Naïve Bayes has been used heavily on text classification.• Instance space D: Text documents• Text Categories (the set of targets): C• Training Set: a set of examples (instances with label)• How to classify a new text?

Document representation

Group A Group B

Page 53: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 53

Document representation

1. Attributes: Each word position is an attribute that takes the value corresponding to the word in that position. For example if a document has 100 words, then there are 100 attributes.

2. Attributes: Consider the number of words in the dictionary ~50,000. Consider each of these words an attribute and simply count how many times they appear in the document.

Page 54: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 54

Attribute-1• Attributes: One attribute for each word position; • Number of attributes = L (length of longest document);• Type of attribute = N (number of words);• An instance: a list of length L;• What are the probabilities needed to estimate ?

= ∀i k j kP(a word | c ) word

• Too many probabilities: ( C x L x N ~ C x 100 x 50,000)• New assumption: see the word in the document does not

depend on its position= = = ∀i k m k jP(a word | c) P(a word | c ) i,m

• This reduces dimensionality to one attribute for each word.• The number of probabilities to estimate is: C x N

Page 55: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 55

Attribute-2

• Attributes: all the N words appearing in the training set; Adocument will be represented as a bag of its words;

• Boolean: An instance is a list of length N, 0-word is not in x; 1-word is in X;

• Problem: examples are too long• List only the active features• How many probabilities to estimate:

– C x N

Page 56: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 56

Estimating Probabilities

• How to estimate k jP(w | c )

= =k j kk j

j

Num of times word occurs in training texts with label c nP(w | c ) Total number of times all words occurs in training texts with label c n

• Sparsity of data is a problem• if n is small, the estimate is not accurate• if nk is 0, it will dominate the estimate: we will never predict

what if a word that never appeared in training set with label cj but appears in the test data?

Page 57: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 57

Smoothing

• There are many ways to smooth;• An empirical issue.

1

Original: ( | )

m-estimate: ( | )

1: ( |

| |,

)|

:

|

kk j

kk j

kk j

if mp and m V

nP w cn

n mpP w cn m

nLaplace P w c

ocabulary we ha

n Vocabu

v

lar

e

y

=

+=

+

+=

+

= =

Page 58: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 58

Algorithm Learn-Naïve-Bayes-Text (D, V )

– 1. Collect all words, punctuation, and other tokens that occur in document set D

• Vocabulary ← {all distinct words, tokens occurring in any document x ∈ D}

– 2. Calculate required P(cj) and P(xi = wk | cj) probability terms

• FOR each target value cj ∈ C DO

– Docs [ j ] ← {documents x ∈ D & c(x) = cj }

– Text [ j ] ← Concatenation (docs [ j ] ) // form a single document

– n ← total number of distinct word positions in text [ j ]

– FOR each word wk in Vocabulary

» nk ← number of times word wk occurs in text [ j ]

– 3. RETURN <{P(cj)}, {P(wk | cj)}>

( ) [ ]=j

docs jP c

D

( ) +=

+k

k jnP w |c

n Vocabulary 1

Page 59: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 59

Applying Naïve Bayes to Classify Text

• Function Classify-Naïve-Bayes-Text (x, Vocabulary)

– Positions ← {word positions in document x that contain tokens found in

Vocabulary}

– RETURN

• Purpose of Classify-Naïve-Bayes-Text

– Returns estimated target value for new document

– ai: denotes word found in the ith position within x

( ) ( )a∈

= ∏j

NB j i jc C i Positionsc arg max P c P |c

Page 60: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 60

Word Sense Disambiguation

• Problem: to look at the words around an ambiguous wordin a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier combines the evidence from all features.

• The Naive Bayes is useful although the assumption is incorrect in the context of text processing:– The structure and linear ordering of words is ignored:

bag of words model.– The presence of one word is independent of another,

which is clearly untrue in text.

Page 61: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 61

Word Sense Disambiguation• Problem Definition

– Given: m sentences, each containing a usage of a particular ambiguous word

– Example: “The can will rust.” (auxiliary verb versus noun)

– Label: cj ≡ s ≡ correct word sense (e.g., s ∈ {auxiliary verb, noun})

– Representation: m examples (labeled attribute vectors <(w1, w2, …, wn), s>)

– Return: classifier f: X → C that disambiguates new x ≡ (w1, w2, …, wn)

• Solution Approach: Use Naïve Bayes

( ) ( )( )=

= ∏s…n

1 2 n ii

P s |w ,w , , w P P w | s1

Page 62: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 62

Topic Detection

• The task is to identify the most salient topic in a given document

• select a topic, t, from the set of possible topics, T;• compute

( ) ( )1..T i N

w∈

=

= ∏NB itv arg max P t P | t

Page 63: Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 63

Comments on Naïve Bayes

• Tends to work well despite strong assumption of conditional independence.

• Experiments show it to be quite competitive with other classification methods.

• Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases.

• Does not perform any search of the hypothesis space. Directly constructs a hypothesis from parameter estimates that are easilycalculated from the training data.

• Not guarantee consistency with training data.• Typically handles noise well since it does not even focus on

completely fitting the training data.