Lecture 4 Ngrams Smoothing

Lecture 4Ngrams Smoothing

Lecture 4Ngrams Smoothing

Topics Topics Python NLTK N – grams Smoothing

Readings:Readings: Chapter 4 – Jurafsky and Martin

January 23, 2013

CSCE 771 Natural Language Processing

– 2 –CSCE 771 Spring 2013

Last TimeLast Time Slides from Lecture 1 30-

Regular expressions in Python, (grep, vi, emacs, word)?Eliza

Morphology

TodayToday Smoothing N-gram models Laplace (plus 1) Good Turing Discounting Katz Backoff Neisser-Ney


ProblemProblem

Let’s assume we’re using N-gramsLet’s assume we’re using N-grams

How can we assign a probability to a sequence where How can we assign a probability to a sequence where one of the component n-grams has a value of zeroone of the component n-grams has a value of zero

Assume all the words are known and have been seenAssume all the words are known and have been seen Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else


SmoothingSmoothing

Smoothing - reevaluating some of the zero and low Smoothing - reevaluating some of the zero and low probability N-grams and assigning them non-zero probability N-grams and assigning them non-zero valuesvalues

Add-One (Laplace) Add-One (Laplace)

Make the zero counts 1., really start counting at 1Make the zero counts 1., really start counting at 1

Rationale: They’re just events you haven’t seen yet. If Rationale: They’re just events you haven’t seen yet. If you had seen them, chances are you would only you had seen them, chances are you would only have seen them once… so make the count equal to have seen them once… so make the count equal to 1.1.


Add-One SmoothingAdd-One Smoothing

TerminologyTerminology

N – Number of total wordsN – Number of total words

V – vocabulary size == number of distinct wordsV – vocabulary size == number of distinct words

Maximum Likelihood estimateMaximum Likelihood estimate

ii

xx wc

wcwP

)(

)()(


Adjusted counts “C*”Adjusted counts “C*”

TerminologyTerminology

N – Number of total wordsN – Number of total words

V – vocabulary size == number of distinct wordsV – vocabulary size == number of distinct words

VN

Ncc ii

)1(*

Adjusted count C*Adjusted count C*

N

cp ii *

Adjusted probabilitiesAdjusted probabilities


Discounting ViewDiscounting View

Discounting – lowering some of Discounting – lowering some of the larger non-zero counts to the larger non-zero counts to get the “probability” to assign get the “probability” to assign to the zero entriesto the zero entries

ddcc – the discounted counts – the discounted counts

The discounted probabilities The discounted probabilities can then be directly calculatedcan then be directly calculated

c

cdc

*

VN

cp ii

1*


Original BERP Counts (fig 4.1)Original BERP Counts (fig 4.1)

Berkeley Restaurant Project dataBerkeley Restaurant Project data

V = 1616V = 1616


Figure 4.5 Add one counts (Laplace)Figure 4.5 Add one counts (Laplace)CountsCounts

ProbabilitiesProbabilities

– 10 –CSCE 771 Spring 2013

Figure 6.6 Add one counts & prob.Figure 6.6 Add one counts & prob.CountsCounts

ProbabilitiesProbabilities

– 11 –CSCE 771 Spring 2013

Add-One Smoothed bigram countsAdd-One Smoothed bigram counts

Think about the occurrence of an unseen item (Think about the occurrence of an unseen item (

– 12 –CSCE 771 Spring 2013

Good-Turing DiscountingGood-Turing Discounting

Singleton - an word that occurs only onceSingleton - an word that occurs only once

Good-TuringGood-Turing: Estimate probability of word that occur : Estimate probability of word that occur zero times with the probability of a singletonzero times with the probability of a singleton

Generalize words to bigrams, trigrams … eventsGeneralize words to bigrams, trigrams … events

– 13 –CSCE 771 Spring 2013

Calculating Good-TuringCalculating Good-Turing

– 14 –CSCE 771 Spring 2013

Witten-BellWitten-Bell

Think about the occurrence of an unseen item Think about the occurrence of an unseen item (word, bigram, etc) as an event.(word, bigram, etc) as an event.

The probability of such an event can be measured The probability of such an event can be measured in a corpus by just looking at how often it in a corpus by just looking at how often it happens.happens.

Just take the single word case first.Just take the single word case first.

Assume a corpus of N tokens and T types.Assume a corpus of N tokens and T types.

How many times was an as yet unseen type How many times was an as yet unseen type encountered?encountered?

– 15 –CSCE 771 Spring 2013

Witten BellWitten Bell

First compute the probability of an unseen eventFirst compute the probability of an unseen event

Then distribute that probability mass equally among the Then distribute that probability mass equally among the as yet unseen eventsas yet unseen events That should strike you as odd for a number of reasons In the case of words… In the case of bigrams

– 16 –CSCE 771 Spring 2013


In the case of bigrams, not all conditioning events are In the case of bigrams, not all conditioning events are equally promiscuousequally promiscuous P(x|the) vs P(x|going)

So distribute the mass assigned to the zero count So distribute the mass assigned to the zero count bigrams according to their promiscuitybigrams according to their promiscuity

– 17 –CSCE 771 Spring 2013


Finally, renormalize the whole table so that you still Finally, renormalize the whole table so that you still have a valid probabilityhave a valid probability

– 18 –CSCE 771 Spring 2013

Original BERP Counts; Original BERP Counts;

Now the Add 1 counts

– 19 –CSCE 771 Spring 2013

Witten-Bell Smoothed and ReconstitutedWitten-Bell Smoothed and Reconstituted

– 20 –CSCE 771 Spring 2013

Add-One Smoothed BERPReconstitutedAdd-One Smoothed BERPReconstituted

Lecture 4 Ngrams Smoothing

Documents

Transcript of Lecture 4 Ngrams Smoothing