Naïve Bayes - 计算语言学研究所

2007-4-3 Wang Houfeng, Icler, PKU 1

Naïve Bayes

Wang HoufengInstitute of Computational Linguistics

Peking University


Outline

Introduction

• Maximum Likelihood Estimation

• Naïve Bayesian Classification


Introduction

Bayesian learning enables us to form predictions based on probabilities. It provides a framework for probabilistic reasoning. The advantages:

There may be noise in the data;can provide “prior knowledge” in constructing a

hypothesis;Predictions are probabilistic;The framework provides a view of optimal decision

making


Probability Definition

• A probability is a function from an event space to a real number between 0 and 1 (inclusive), where,– 0 ≤ P(A) ≤ 1,

• 0: indicates impossibility• 1: indicates certainty

– P(Ω) = 1 where, Ω : sample space – P(X) ≤ p(Y) for any X ⊆ Y– If A1, A2,…,An is disjoint (Ai ∩ Aj ) =∅, then

11( ) ( )n n

i iiiP A P A

=== ∑∪


Interpretation of probability

• Relative Frequency– Suppose that an experiment is performed n times

and A occurs f times. The relative frequency for an event A is:

nf

n=

occurs A times ofNumber

• If we let n get infinitely large, nfAP

nlim)(

→∞=


Interpretation of probability

• In frequentist statistics, probabilities are associated only with the data, i.e., outcomes of repeatable observations:

Probability = limiting frequency


Some Rules• For any two events, A and B, the probability of

their union, P(A ∪ B), is

)()()()( BAPBPAPBAP ∩−+=∪

A B

•• A Special CaseA Special Case, When two events A and B are mutually exclusive, mutually exclusive, P(A∩B) = 0 and P(A∪B) = P(A) + P(B).


Independence

• Two events A and B are independentindependent if and only if:

P(B|A) = P(B)P(B|A) = P(B)P(B|A) = P(B)

P(A ∩ B) = P(A) P(B) P(A ∩ B) = P(A) P(B)

P(A|B) = P(A)P(AP(A||B) = P(A)B) = P(A)

or

or


Probabilistic Classification

• Input: x = [x1,x2]T ,Output: C ∈{-1,1}• Prediction:

1 2

1 2 1 2

1 if ( 1| ) 0 5 choose

-1 otherwiseor equivalently

1 if ( 1| ) ( 1| ) choose

1, otherwise

C P C x ,x .C

C P C x ,x P C x ,x C

= = >⎧⎨ =⎩

= = > = −⎧⎨ = −⎩


Basic of Probability Learning

• Goal: find the best hypothesis from some space H of hypotheses, given the observed data D;

• Define best to be: most probable hypothesis in H;• In order to do that, we need to assume a probability

distribution over the H;• In addition, we need to know something about the

relation between the data observed and the hypotheses.


Bayes TheoremHypothesis space : H, Dataset: D. Four probabilities are introduced as follows:

• P(h): the prior probability of h∈H before data is observed. Reflects background knowledge; If no information, uniform distribution is chosen.

• P(D): the probability of seeing data D, it is evidence. (No knowledge of the hypothesis)

• P(D|h): is the probability of the data given h. It is called the likelihood of h with respect to D.

• P(h|D): The posterior probability of h after having seen data D. The probability h is the target.


Bayes Theorem

Bayes theorem relates the posterior probability of a hypothesis given the data with the three probabilities mentioned before:

Posterior probability Prior probabilityLikelihood

Evidence

( | ) ( )( | )( )

P D h P hP h DP D

=i


Hypotheses in Bayesian

• Hypotheses h refers to processes that could have generated the data D

• Bayesian inference provides a distribution over these hypotheses, given D

• P(D|h) is the probability of D being generated by the process identified by h

• Hypotheses h are mutually exclusive: only one process could have generated D


The origin of Bayes’ rule

• For any two random variables:

( , ) ( ) ( | )( , ) ( ) ( | )

p A B p A p B Ap A B p B p A B

==

)|()()|()( ABpApBApBp =

)()|()()|(

BpABpApBAp =


Bayes’ rule in odds form

D: datah1, h2: modelsP(h1|D): posterior probability h1 generated the data P(D|h1): likelihood of data under model h1P(h1): prior probability h1 generated the data

1

1 1

22 2

1 1

2 2

( , )( | ) ( , )( )

( , )( | ) ( , )( )

( | ) ( ) ( | ) ( )

P h DP h D P h Dp D

P h DP h D P h Dp D

p D h p hp D h p h

= =

=

likelihood ratio


Comparing two hypotheses

D: HHTHTHypotheses h1:“fair coin”; Hypotheses h2: “always heads”P(D|h1) = 1/25 P(h1) = 999/1000

P(D|h2) = 0 P(h2) = 1/1000

P(h1|D) / P(h2|D) = infinity

1 1 1

2 2 2

( | ) ( | ) ( ) ( | ) ( | ) ( )

P h D p D h p hP h D p D h p h

=



D: HHHHHHypotheses h1:“fair coin”; Hypotheses h2: “always heads”

P(D|h1) = 1/25 P(h1) = 999/1000 P(D|h2) = 1 P(h2) = 1/1000

P(h1|D) / P(h2|D) ≈ 30

1 1 1

2 2 2

( | ) ( | ) ( ) ( | ) ( | ) ( )


=



D: HHHHHHHHHHh1, h2: “fair coin”, “always heads”P(D|h1) = 1/210 P(h1) = 999/1000 P(D|h2) = 1 P(h2) = 1/1000

P(h1|D) / P(h2|D) ≈ 1

1 1 1

2 2 2

( | ) ( | ) ( ) ( | ) ( | ) ( )


=


h1 vs. h2

• Hypotheses h1:“fair coin”; Hypotheses h2: “always heads”

• Which one is better?


For K (>2) Classes

( ) ( ) ( )( )

( ) ( )

( ) ( )1

||

|

|

i ii

i iK

k kk

P D h P hP h D

P D

P D h P h

P D h P h=

=

=

∑

( ) ( )

( ) ( )1

0 and 1

choose if | max |

K

i ii

i i k k

P h P h

h P h D P h D=

≥ =

=

∑


Outline

• Introduction

Maximum Likelihood Estimation

• Naïve Bayesian Classification


Maximum A Posteriori

• Now, attempts to find the most probable one h ∈H, given the observed data.

• A method that looks for the hypothesis with maximum P(h|D) is called a maximum a posteriori method or MAP.

( | )

( | ) ( )( )

( | ) ( )

, ( )

Argmax

Argmax

Argmax

MAPh H

h H

h H

h P h D

P D h P hP D

P D h P h

where P D is independent of h

∈

∈

∈

=

=

=

i

i


Example: QA

• Query: Does patient have cancer or not?• Two hypothesis: the patient has cancer, ⊕, the patient doesn’t

have cancer, ;• Prior knowledge: over the entire population of people .008 have cancer;• The lab test returns a correct positive result in 98% of the cases

in which cancer is actually present and a correct negative in 97% of the cases in which cancer is actually not present

• P(cancer) = .008, P(¬cancer) = .992• P(⊕|cancer) = .98, P( |cancer) = .02• P(⊕|¬cancer)=.03, P( |¬cancer)=.97• So given a new patient with a positive lab test, should we

diagnose the patient as having cancer or not??


Example: QA

• Which is the MAP hypothesis?

( )( )

| 0.98

| 0.02

P Cancer

P Cancer

⊕ =

=

( )( )

| 0.03

| 0.97

P Cancer

P Cancer

⊕ ¬ =

¬ =

( )( )

0.008

0.992

P Cancer

P Cancer

=

¬ =

P(⊕|cancer)P(cancer) = (.98).008 = .0078P(⊕|¬cancer)P(¬cancer)=(.03).992=.0298

( | ) ( ) .0078( | ) .21( ) .0078 .0298

P cancer P cancerP cancerP

⊕⊕ = = =

⊕ +

• So, the MAP hypothesis : hMAP = ¬ cancer


Maximum Likelihood hypothesis

• Assume that a priori, hypotheses are equally probable.

( ) ( ), ,i j i jP h P h h h H= ∀ ∈

• Maximum Likelihood hypothesis can be got:

( | )argmaxMLh P D h∈

=h H

• Now, just need to look for the hypothesis that best explains the data.


Maximum Likelihood Estimator

1 2 nLet D={x ,x ,...x } is training set,

( ) ( | ) ( | )argmax argmaxM L ih h P D h P x h∈ ∈

= = ∏ixh H h H

( )Let 0MLh hh

∂=

∂


An simple example• Assuming

– A coin has a probability p of heads, 1-p of tails.– Observation: We toss a coin N times, and the result is a

set of Hs and Ts, and there are M Hs.• What is the value of p based on MLE, given the

observation? ( ) log ( | ) log[ (1 ) ]

log ( ) log(1 )

M N ML P D p pM p N M p

θ θ −= = −= + − −

( ) ( log ( )log(1 )) 01

dL d M p N M p M N Mdp dp p p

θ + − − −= = − =

−

p= M/N


Bayesian Learning : Unbiased Coin

• Coin Flip– Sample space: Ω = {Head, Tail}

– Scenario: given coin is either fair or has a 60% bias in favor of Head

• h1 ≡ fair coin: P(Head) = 0.5

• h2 ≡ 60% bias towards Head: P(Head) = 0.6

– Objective: to decide between the hypotheses

• A Priori Distribution on H– P(h1) = 0.75, P(h2) = 0.25



• Collection of Evidence– First piece of evidence: d=a single coin toss, comes up Head

– Q: What does the agent believe now?

– A: Compute P(d) = P(d | h1) P(h1) + P(d | h2) P(h2)

• Bayesian Inference: Compute P(d) = P(d | h1) P(h1) + P(d | h2) P(h2)– P(Head) = 0.5 • 0.75 + 0.6 • 0.25 = 0.375 + 0.15 = 0.525

– This is the probability of the observation d = Head



• Bayesian Learning– Now apply Bayes’s Theorem

• P(h1 | d) = P(d | h1) P(h1) / P(d) = 0.375 / 0.525 = 0.714

• P(h2 | d) = P(d | h2) P(h2) / P(d) = 0.15 / 0.525 = 0.286

• Belief has been revised downwards for h1, upwards for h2

• The agent still thinks that the fair coin is the more likely hypothesis

– Suppose we were to use the ML approach (i.e., assume equal priors)

• Belief is revised upwards from 0.5 for h1

• Data then supports the bias coin better



• More Evidence: Sequence D of 100 coins with 70 heads and 30 tails– P(D) = (0.5)50 • (0.5)50 • 0.75 + (0.6)70 • (0.4)30 • 0.25

– Now P(h1 | d) << P(h2 | d)


Brute Force MAP Hypothesis Learner

• Intuitive Idea: Produce Most Likely h Given Observed D

• Algorithm Find-MAP-Hypothesis (D)

– 1. FOR each hypothesis h ∈ H

Calculate the conditional (i.e., posterior) probability:

– 2. RETURN the hypothesis hMAP with the highest conditional

probability

( | ) ( )( | )( )

P D h P hP h DP D

=

( | )Argmax MAPh H

h P h D∈

=


Outline

• Introduction

• Maximum Likelihood Estimation

Naïve Bayes Classification


Bayesian Classification

• Framework– Find most probable classification (as opposed to MAP hypothesis)

– f: X → C (domain ≡ instance space, range ≡ finite set of values)

– Instances x ∈ X can be described as a collection of features x ≡ (a1, a2, …, an)

– Performance element: Bayesian classifier

• Given: an example

• Output: the most probable value cj ∈ C

( ) ( )( ) ( )

∈ ∈

∈

= =

=

…

…j j

j

MAP j j 1 2 nc C c C

1 2 n j jc C

v arg max P c | x arg max P c | a ,a , , a

arg max P a ,a , , a |c P c


Bayesian Classification

• Parameter Estimation Issues– Easily estimating P(cj): only count the frequency(cj) in D = {<x, c(x)>}

– But infeasible to estimate P(a1,a2, …, an | cj): too many 0 values

– Need to make assumptions that allow us to estimate P(x | c)

• Intuitive Idea– hMAP(x) is not necessarily the most probable classification!

– Example• Three possible hypotheses: P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3

• Suppose that for new instance x, h1(x) = +, h2(x) = –, h3(x) = –

• What is the most probable classification of x?


Bayes Optimal Classification (BOC)

• Example:• P(h1 | D) = 0.4, P(– | h1) = 0, P(+ | h1) = 1

• P(h2 | D) = 0.3, P(– | h2) = 1, P(+ | h2) = 0

• P(h3 | D) = 0.3, P(– | h3) = 1, P(+ | h3) = 0

• Result:

( ) ( )∈

∈

⎡ ⎤= = ⋅⎣ ⎦∑j

i

BOC j i ic C h Hc* c arg max P c |h P h | D

( ) ( )

( ) ( )

| | 0 .4

| | 0 .6i

i

i ih H

i ih H

P h P h D

P h P h D∈

∈

+ ⋅ =⎡ ⎤⎣ ⎦

− ⋅ =⎡ ⎤⎣ ⎦

∑

∑

( ) ( )* argmax | |j

i

BOC j i ic C h H

c c P c h P h D∈

∈

⎡ ⎤= = ⋅⎣ ⎦ −=∑

h

P(h)


New Issue

• BOC Computationally expensive. It computes the posterior prob for every h∈H and combines the predictions of each hypothesis to classify each new instance.

• Solution:– Choose a hypothesis h (eg. MAP)from H at random,

according to the posterior probability distribution over H, and use it to predict the novel instances(Gibbs Classifier)!

– We expect to simplify this estimation: Naïve Bayes


Naïve Bayes: Characterics

• Very difficult to compute the likelihood P( a1,a2, …,an | cj) • To Simplify the assumption(Naïve Bayes):

– attribute values x independent given target value v

( ) ( )( ) ( )

a

a

where, a

∈ ∈

∈

= =

=

…

…

…

j j

j

MAP j j 1 2 nc C c C

1 2 n j jc C

1 2 n

c arg max P c | x arg max P c | a , a , ,

arg max P a , a , , | c P c

x = (a , a , , )

( ) ( )a =∏…1 2 n j i ji

P a ,a , , |c P a |c

( ) ( )∈

= ∏j

NB j i jc C ic arg max P c P a | c

• Results comparable to ANN and decision trees in some domains• Moderate or large training set available• Successful Applications: Classifying text documents


Naïve Bayes: Independence Issue

• Conditional Independence Assumption Often Violated

– CI assumption:

– However, it works well surprisingly well anyway

– Note

• Don’t need estimated conditional probabilities to be correct

• Only need( ) ( )

( ) ( )

ˆ ˆ∈

=

∈

⎡ ⎤= ⎢ ⎥

⎣ ⎦⎡ ⎤= ⎣ ⎦

∏

…

j

j

n

NB j k jc C k

j 1 2 n jc C

c argmax P c P a | c

argmax P c P a , a , , a | c1

( )ˆjP c | x

( ) ( )a = ∏…1 2 n j k jk

P a ,a , , | c P a | c


Naïve Bayes Algorithm

• Simple (Naïve) Bayes Assumption

• Simple (Naïve) Bayes Classifier

• Algorithm Naïve-Bayes-Learn– FOR each target value cj

FOR each attribute value ak of each attribute x

– RETURN

( ) ( )a =∏…1 2 n j k jk

P a ,a , , |c P a |c

( ) ( )∈

= ∏j

NB j k jc C kc arg max P c P a | c

( ) ( )ˆ ⎡ ⎤← ⎣ ⎦j jP c estimate P c

( ) ( )ˆ ⎡ ⎤← ⎣ ⎦k j k jP a | c estimate P a |c

( ){ }ˆ k jP a | c


Example

• Concept: PlayTennis ( ) ( )ˆ ˆ ak∈=

⎡ ⎤= ⎢ ⎥

⎣ ⎦∏

j

n

NB j k jc Cc argmax P c P | c

1

Day Outlook Temperature Humidity Wind PlayTennis?1 Sunny Hot High Light No2 Sunny Hot High Strong No3 Overcast Hot High Light Yes4 Rain Mild High Light Yes5 Rain Cool Normal Light Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Light No9 Sunny Cool Normal Light Yes10 Rain Mild Normal Light Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Light Yes14 Rain Mild High Strong No


Example• Application of Naïve Bayes: to computation

– P(PlayTennis = {Yes, No}) 2 cases

– P(Outlook = {Sunny, Overcast, Rain} | PT = {Yes, No}) 6 cases

– P(Temp = {Hot, Mild, Cool} | PT = {Yes, No}) 6 cases

– P(Humidity = {High, Normal} | PT = {Yes, No}) 4 cases

– P(Wind = {Light, Strong} | PT = {Yes, No}) 4 cases

• Query: New Example x = <Sunny, Cool, High, Strong, ?>– Desired inference: P(PlayTennis = Yes | x) = 1 - P(PlayTennis = No | x)

– P(PlayTennis = Yes) = 9/14 = 0.64 P(PlayTennis = No) = 5/14 = 0.36

– P(Outlook = Sunny | PT = Yes) = 2/9 P(Outlook = Sunny | PT = No) = 3/5

– P(Temperature = Cool | PT = Yes) = 3/9 P(Temperature = Cool | PT = No) = 1/5

– P(Humidity = High | PT = Yes) = 3/9 P(Humidity = High | PT = No) = 4/5

– P(Wind = Strong | PT = Yes) = 3/9 P(Wind = Strong | PT = No) = 3/5


Naïve Bayes: Two classes

• Naïve Bayes method gives a method for predicting .• In the case of two classes, c∈{0,1} we predict that

c=1 iff:

( ) ( )∈

= ∏j

NB j k jc C kc arg m ax P c P a | c

0k

k

=

=

= • =>

= • =∏∏

nj k j1

nj k j1

P(c 1) P(a | c 1)1

P(c 0) P(a | c )



1

1

Denote:

, and 0

0 , and 0 0

0

n

kn

k

=

=

= = = = = =

= = = =

= = = =

= • =

= • =

= •=

∏∏

k

k k j k k j

k j k k j k

k j k k j k

j k j

j k j

aj k k

p P(a 1 | c 1), q P(a 1 | c 0)

Now, P(a 1 | c 1) = p P(a | c 1) = ( 1 - p )

P(a 1 | c ) = q P(a | c ) = ( 1 - q )

So,

P(c 1) P(a | c 1)

P(c 0) P(a | c )

P(c 1) p (1 - p1

1

)

)

n

kn

k

=

=

>= •

∏∏

k

k k

1-a

a 1-aj k k

1P(c 0) q (1 - q

( ) ( )∈

= ∏j




11

1 1

)( ))

) )( )

nn

kkn nk k

==

= =

= •= •= >

= • = •

∏∏∏ ∏

ak

k k

ak k k

ka 1-a j k

j k k ka 1-a

kj k k j kk

pP(c 1) (1 - pP(c 1) p (1 - p 1 - p 1qP(c 0) q (1 - q P(c 0) (1 - q1 - q

=+ + −

=

=

>∑ ∑j k k k

k k

j

kk kj k

P(c 1) 1 - plog log ( ) 0P

Take logarithm;

p qlog log1(

we predict c 1

- p 1 - q

c 0

) 1 - q

iff :

a

So, the naive Bayes is a linear separator with

)p

= =

=

−

=

k kk

k k

k

k k

k k

k k

q q 1- pw log ( 1- q

if p q then w 0 and the feature i

p

s

log lo

irre

g1- q

le1- p

vant

( ) ( )∈

= ∏j



Naïve Bayes as a Perceptron

Σsum

a1 weight

“-1”

b: Initial threshold,

sume−+11

)= k kk

k k

p 1 - qw log ( q 1 - p

)= 1 11

1 1

p 1 - qw log ( q 1 - p

a2

ak

an


Naïve Bayes: Zero Probability

• If we never see something (0-time)in the train set:

, ; called equivalent sample size. It can be interpreted

as augmenting the n acttual observations by a

m-estimate:

n

( |

)

"

k

k

k j

where is a prior distribute of ais

additio

n mpP a c

m virtual sam

np

mnal

m+

=+

" ples distributed according to p

( | ) 0 !kk j

nP a cn

= =

• We can smooth them by m-estimation:

• How should we deal with them?


Document Classification/ Categorization

Assign labels to each document or web-page:• Labels are most often topics such as Sina-categories

e.g., “体育,” “财经,” “教育”…

• Labels may be genrese.g., "editorials" "movie-reviews" "news“

• Labels may be opinione.g., “like”, “hate”, “neutral”

• Labels may be domain-specific binarye.g., "interesting-to-me" : "not-interesting-to-me”e.g., “spam” : “not-spam”


Learning to Classify Text

• Naïve Bayes has been used heavily on text classification.• Instance space D: Text documents• Text Categories (the set of targets): C• Training Set: a set of examples (instances with label)• How to classify a new text?

Document representation

Group A Group B


Document representation

1. Attributes: Each word position is an attribute that takes the value corresponding to the word in that position. For example if a document has 100 words, then there are 100 attributes.

2. Attributes: Consider the number of words in the dictionary ~50,000. Consider each of these words an attribute and simply count how many times they appear in the document.


Attribute-1• Attributes: One attribute for each word position; • Number of attributes = L (length of longest document);• Type of attribute = N (number of words);• An instance: a list of length L;• What are the probabilities needed to estimate ?

= ∀i k j kP(a word | c ) word

• Too many probabilities: ( C x L x N ~ C x 100 x 50,000)• New assumption: see the word in the document does not

depend on its position= = = ∀i k m k jP(a word | c) P(a word | c ) i,m

• This reduces dimensionality to one attribute for each word.• The number of probabilities to estimate is: C x N


Attribute-2

• Attributes: all the N words appearing in the training set; Adocument will be represented as a bag of its words;

• Boolean: An instance is a list of length N, 0-word is not in x; 1-word is in X;

• Problem: examples are too long• List only the active features• How many probabilities to estimate:

– C x N


Estimating Probabilities

• How to estimate k jP(w | c )

= =k j kk j

j

Num of times word occurs in training texts with label c nP(w | c ) Total number of times all words occurs in training texts with label c n

• Sparsity of data is a problem• if n is small, the estimate is not accurate• if nk is 0, it will dominate the estimate: we will never predict

what if a word that never appeared in training set with label cj but appears in the test data?


Smoothing

• There are many ways to smooth;• An empirical issue.

1

Original: ( | )

m-estimate: ( | )

1: ( |

| |,

)|

:

|

kk j

kk j

kk j

if mp and m V

nP w cn

n mpP w cn m

nLaplace P w c

ocabulary we ha

n Vocabu

v

lar

e

y

=

+=

+

+=

+

= =


Algorithm Learn-Naïve-Bayes-Text (D, V )

– 1. Collect all words, punctuation, and other tokens that occur in document set D

• Vocabulary ← {all distinct words, tokens occurring in any document x ∈ D}

– 2. Calculate required P(cj) and P(xi = wk | cj) probability terms

• FOR each target value cj ∈ C DO

– Docs [ j ] ← {documents x ∈ D & c(x) = cj }

– Text [ j ] ← Concatenation (docs [ j ] ) // form a single document

– n ← total number of distinct word positions in text [ j ]

– FOR each word wk in Vocabulary

» nk ← number of times word wk occurs in text [ j ]

– 3. RETURN <{P(cj)}, {P(wk | cj)}>

( ) [ ]=j

docs jP c

D

( ) +=

+k

k jnP w |c

n Vocabulary 1


Applying Naïve Bayes to Classify Text

• Function Classify-Naïve-Bayes-Text (x, Vocabulary)

– Positions ← {word positions in document x that contain tokens found in

Vocabulary}

– RETURN

• Purpose of Classify-Naïve-Bayes-Text

– Returns estimated target value for new document

– ai: denotes word found in the ith position within x

( ) ( )a∈

∈

= ∏j

NB j i jc C i Positionsc arg max P c P |c


Word Sense Disambiguation

• Problem: to look at the words around an ambiguous wordin a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier combines the evidence from all features.

• The Naive Bayes is useful although the assumption is incorrect in the context of text processing:– The structure and linear ordering of words is ignored:

bag of words model.– The presence of one word is independent of another,

which is clearly untrue in text.


Word Sense Disambiguation• Problem Definition

– Given: m sentences, each containing a usage of a particular ambiguous word

– Example: “The can will rust.” (auxiliary verb versus noun)

– Label: cj ≡ s ≡ correct word sense (e.g., s ∈ {auxiliary verb, noun})

– Representation: m examples (labeled attribute vectors <(w1, w2, …, wn), s>)

– Return: classifier f: X → C that disambiguates new x ≡ (w1, w2, …, wn)

• Solution Approach: Use Naïve Bayes

( ) ( )( )=

= ∏s…n

1 2 n ii

P s |w ,w , , w P P w | s1


Topic Detection

• The task is to identify the most salient topic in a given document

• select a topic, t, from the set of possible topics, T;• compute

( ) ( )1..T i N

w∈

=

= ∏NB itv arg max P t P | t


Comments on Naïve Bayes

• Tends to work well despite strong assumption of conditional independence.

• Experiments show it to be quite competitive with other classification methods.

• Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases.

• Does not perform any search of the hypothesis space. Directly constructs a hypothesis from parameter estimates that are easilycalculated from the training data.

• Not guarantee consistency with training data.• Typically handles noise well since it does not even focus on

completely fitting the training data.

Naïve Bayes - 计算语言学研究所

Documents

Transcript of Naïve Bayes - 计算语言学研究所