Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides...

37
Instructor: Wei Xu Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor

Transcript of Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides...

Page 1: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Instructor: Wei Xu

Probability Review and Naïve Bayes

Some slides adapted from Dan Jurfasky and Brendan O’connor

Page 2: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

What is Probability?

• “The probability the coin will land heads is 0.5” – Q: what does this mean?

• 2 Interpretations: – Frequentist (Repeated trials)

• If we flip the coin many times…

– Bayesian • We believe there is equal chance of heads/tails • Advantage: events that do not have long term

frequenciesE.g. What is the probability the polar ice caps will melt by 2050?

Page 3: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Probability Review

Conditional Probability

Chain Rule

Page 4: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Probability Review

Disjunction / Union:

Negation:

Page 5: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Bayes Rule

Generative Model of How Hypothesis Causes Data

Bayesian Inferece

Hypothesis (Unknown)

Data (Observed Evidence)

Bayes Rule tells us how to “flip” the conditional probabilities Reason about effects to causes Useful if you assume a generative model for your data

Page 6: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Bayes Rule

PriorLikelihood

NormalizerPosterior

Bayes Rule tells us how to “flip” the conditional probabilities Reason about effects to causes Useful if you assume a generative model for your data

Page 7: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Bayes Rule

PriorLikelihood

NormalizerPosterior

Bayes Rule tells us how to “flip” the conditional probabilities Reason about effects to causes Useful if you assume a generative model for your data

Page 8: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Bayes Rule

PriorLikelihood

Proportional To (Doesn’t sum to 1)Posterior

Bayes Rule tells us how to “flip” the conditional probabilities Reason about effects to causes Useful if you assume a generative model for your data

Page 9: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Bayes Rule Example

• There is a disease that affects a tiny fraction of the population (0.001%)

• Symptoms include a headache and stiff neck. 50% of patients with the disease have these symptoms

• 5% of the general population has these symptoms.

Q: Assume you have the symptom, what is your probability of having the disease?

Page 10: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Another Bayes Rule Example

• The well-known OJ Simpson murder trial

Page 11: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Another Bayes Rule Example• The prosecution presented evidence that Simpson

had been violent toward his wife, argued that a pattern of spousal abuse reflected a motive to kill.

• The defense attorney, Alan Dershowitz, argued that: - there was only one woman murdered for every

2500 women who were subjected to spousal abuse, and that any history of Simpson being violent toward his wife was irrelevant to the trial.

• In effect, both sides were asking the jury to consider the probability that a man murdered his ex-wife, given that he previously battered her.

What do you think? Discuss with your neighbors

Page 12: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Another Bayes Rule Example• The defense attorney, Alan Dershowitz, argued that: - there was only one woman murdered for every

1000 women who were subjected to spousal abuse, and that any history of Simpson being violent toward his wife was irrelevant to the trial.

• In 1994, 5000 women were murdered, 1500 by their husband. Assuming a population of 100 million women. - P (Murder|⌝Guilt) = 3500/100x106 ≈ 1/30000

What do you have now? Discuss with your neighbors

Page 13: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Text Classification

Page 14: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Is this Spam?

Page 15: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Who wrote which Federalist papers?

• 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton.

• Authorship of 12 of the letters in dispute • 1963: solved by Mosteller and Wallace

using Bayesian methods

James Madison Alexander Hamilton

Page 16: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

What is the subject of this article?

• Antogonists and Inhibitors

• Blood Supply • Chemistry • Drug Therapy • Embryology • Epidemiology • …

MeSH Subject Category HierarchyMEDLINE Article

Page 17: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Positive or negative movie review?

• unbelievably disappointing • Full of zany characters and richly applied

satire, and some great plot twists • this is the greatest screwball comedy ever

filmed • It was pathetic. The worst part about it was

the boxing scenes.

Page 18: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Text Classification: definition

• Input: – a document d – a fixed set of classes C = {c1, c2,…, cJ}

• Output: a predicted class c ∈ C

Page 19: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Classification Methods: Hand-coded rules

• Rules based on combinations of words or other features – spam: black-list-address OR (“dollars” AND “have been

selected”) • Accuracy can be high

– If rules carefully refined by expert • Running time is usually very good and fast • But, building and maintaining these rules is

expensive

Page 20: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Classification Methods: Supervised Machine Learning

• Input: – a document d – a fixed set of classes C = {c1, c2,…, cJ}

– A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

• Output: – a learned classifier γ:d ! c

Page 21: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Classification Methods: Supervised Machine Learning

• Any kind of classifier – Naïve Bayes – Logistic regression – Support-vector machines – k-Nearest Neighbors

– …

Page 22: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Naïve Bayes Intuition

• Simple (“naïve”) classification method based on Bayes rule

• Relies on very simple representation of document: – Bag of words

Page 23: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”
Page 24: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Test document

parserlanguagelabeltranslation…

Bag of words for document classification

...planningtemporalreasoningplanlanguage...

?

Page 25: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Bayes’ Rule Applied to Documents and Classes

•For a document d and a class c

P(c | d) = P(d | c)P(c)P(d)

Page 26: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Naïve Bayes Classifier (I)

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

Page 27: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Naïve Bayes Classifier (II)

cMAP = argmaxc∈C

P(d | c)P(c)

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

Page 28: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Naïve Bayes Classifier (IV)

How often does this class occur?

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

O(|X|n•|C|) parameters

We can just count the relative frequencies in a corpus

Could only be estimated if a very, very large number of training examples was available.

Page 29: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Multinomial Naïve Bayes Independence Assumptions

P(x1, x2,…, xn | c)• Bag of Words assumption: Assume position doesn’t

matter • Conditional Independence: Assume the feature

probabilities P(xi|cj) are independent given the class c.

Page 30: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Multinomial Naïve Bayes Classifier

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

cNB = argmaxc∈C

P(cj ) P(x | c)x∈X∏

Page 31: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Applying Multinomial Naive Bayes to Text Classification

cNB = argmaxcj∈C

P(cj ) P(xi | cj )i∈positions∏

positions ← all word positions in test document

Page 32: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Multinomial Naïve Bayes: Learning

• First attempt: maximum likelihood estimates – simply use the frequencies in the data

P̂(wi | cj ) =count(wi ,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

Page 33: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

• Create mega-document for topic j by concatenating all docs in this topic – Use frequency of w in mega-document

Parameter Estimation

fraction of times word wi appears among all words in documents of

topic cj

P̂(wi | cj ) =count(wi ,cj )count(w,cj )

w∈V∑

Page 34: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Problem with Maximum Likelihood

• What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?

cMAP = argmaxc P̂(c) P̂(xi | c)i∏

• Zero probabilities cannot be conditioned away, no matter the other evidence!

Page 35: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Laplace (add-1) smoothing for Naïve Bayes

=count(wi ,c)+1

count(w,cw∈V∑ )

⎝⎜⎜

⎠⎟⎟ + V

P̂(wi | c) =count(wi ,c)count(w,c)( )

w∈V∑

Page 36: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Multinomial Naïve Bayes: Learning

• Calculate P(cj) terms

– For each cj in C do

docsj ← all docs with class =cj

P(cj )←| docsj |

| total # documents|

Page 37: Probability Review and Naïve Bayes - Wei Xu · Probability Review and Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor. ... Bayes Rule tells us how to “flip”

Multinomial Naïve Bayes: Learning

P(wk | cj )←nk +α

n+α |Vocabulary |

• Calculate P(wk | cj) terms

• Textj ← single doc containing all docsj

• For each word wk in Vocabulary

nk ← # of occurrences of wk in Textj

• From training corpus, extract Vocabulary