Lecture 4 Ngrams Smoothing
description
Transcript of Lecture 4 Ngrams Smoothing
Lecture 4Ngrams Smoothing
Lecture 4Ngrams Smoothing
Topics Topics Python NLTK N – grams Smoothing
Readings:Readings: Chapter 4 – Jurafsky and Martin
January 23, 2013
CSCE 771 Natural Language Processing
– 2 –CSCE 771 Spring 2013
Last TimeLast Time Slides from Lecture 1 30-
Regular expressions in Python, (grep, vi, emacs, word)?Eliza
Morphology
TodayToday Smoothing N-gram models Laplace (plus 1) Good Turing Discounting Katz Backoff Neisser-Ney
– 3 –CSCE 771 Spring 2013
ProblemProblem
Let’s assume we’re using N-gramsLet’s assume we’re using N-grams
How can we assign a probability to a sequence where How can we assign a probability to a sequence where one of the component n-grams has a value of zeroone of the component n-grams has a value of zero
Assume all the words are known and have been seenAssume all the words are known and have been seen Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else
– 4 –CSCE 771 Spring 2013
SmoothingSmoothing
Smoothing - reevaluating some of the zero and low Smoothing - reevaluating some of the zero and low probability N-grams and assigning them non-zero probability N-grams and assigning them non-zero valuesvalues
Add-One (Laplace) Add-One (Laplace)
Make the zero counts 1., really start counting at 1Make the zero counts 1., really start counting at 1
Rationale: They’re just events you haven’t seen yet. If Rationale: They’re just events you haven’t seen yet. If you had seen them, chances are you would only you had seen them, chances are you would only have seen them once… so make the count equal to have seen them once… so make the count equal to 1.1.
– 5 –CSCE 771 Spring 2013
Add-One SmoothingAdd-One Smoothing
TerminologyTerminology
N – Number of total wordsN – Number of total words
V – vocabulary size == number of distinct wordsV – vocabulary size == number of distinct words
Maximum Likelihood estimateMaximum Likelihood estimate
ii
xx wc
wcwP
)(
)()(
– 6 –CSCE 771 Spring 2013
Adjusted counts “C*”Adjusted counts “C*”
TerminologyTerminology
N – Number of total wordsN – Number of total words
V – vocabulary size == number of distinct wordsV – vocabulary size == number of distinct words
VN
Ncc ii
)1(*
Adjusted count C*Adjusted count C*
N
cp ii *
Adjusted probabilitiesAdjusted probabilities
– 7 –CSCE 771 Spring 2013
Discounting ViewDiscounting View
Discounting – lowering some of Discounting – lowering some of the larger non-zero counts to the larger non-zero counts to get the “probability” to assign get the “probability” to assign to the zero entriesto the zero entries
ddcc – the discounted counts – the discounted counts
The discounted probabilities The discounted probabilities can then be directly calculatedcan then be directly calculated
c
cdc
*
VN
cp ii
1*
– 8 –CSCE 771 Spring 2013
Original BERP Counts (fig 4.1)Original BERP Counts (fig 4.1)
Berkeley Restaurant Project dataBerkeley Restaurant Project data
V = 1616V = 1616
– 9 –CSCE 771 Spring 2013
Figure 4.5 Add one counts (Laplace)Figure 4.5 Add one counts (Laplace)CountsCounts
ProbabilitiesProbabilities
– 10 –CSCE 771 Spring 2013
Figure 6.6 Add one counts & prob.Figure 6.6 Add one counts & prob.CountsCounts
ProbabilitiesProbabilities
– 11 –CSCE 771 Spring 2013
Add-One Smoothed bigram countsAdd-One Smoothed bigram counts
Think about the occurrence of an unseen item (Think about the occurrence of an unseen item (
– 12 –CSCE 771 Spring 2013
Good-Turing DiscountingGood-Turing Discounting
Singleton - an word that occurs only onceSingleton - an word that occurs only once
Good-TuringGood-Turing: Estimate probability of word that occur : Estimate probability of word that occur zero times with the probability of a singletonzero times with the probability of a singleton
Generalize words to bigrams, trigrams … eventsGeneralize words to bigrams, trigrams … events
– 13 –CSCE 771 Spring 2013
Calculating Good-TuringCalculating Good-Turing
– 14 –CSCE 771 Spring 2013
Witten-BellWitten-Bell
Think about the occurrence of an unseen item Think about the occurrence of an unseen item (word, bigram, etc) as an event.(word, bigram, etc) as an event.
The probability of such an event can be measured The probability of such an event can be measured in a corpus by just looking at how often it in a corpus by just looking at how often it happens.happens.
Just take the single word case first.Just take the single word case first.
Assume a corpus of N tokens and T types.Assume a corpus of N tokens and T types.
How many times was an as yet unseen type How many times was an as yet unseen type encountered?encountered?
– 15 –CSCE 771 Spring 2013
Witten BellWitten Bell
First compute the probability of an unseen eventFirst compute the probability of an unseen event
Then distribute that probability mass equally among the Then distribute that probability mass equally among the as yet unseen eventsas yet unseen events That should strike you as odd for a number of reasons In the case of words… In the case of bigrams
– 16 –CSCE 771 Spring 2013
Witten-BellWitten-Bell
In the case of bigrams, not all conditioning events are In the case of bigrams, not all conditioning events are equally promiscuousequally promiscuous P(x|the) vs P(x|going)
So distribute the mass assigned to the zero count So distribute the mass assigned to the zero count bigrams according to their promiscuitybigrams according to their promiscuity
– 17 –CSCE 771 Spring 2013
Witten-BellWitten-Bell
Finally, renormalize the whole table so that you still Finally, renormalize the whole table so that you still have a valid probabilityhave a valid probability
– 18 –CSCE 771 Spring 2013
Original BERP Counts; Original BERP Counts;
Now the Add 1 counts
– 19 –CSCE 771 Spring 2013
Witten-Bell Smoothed and ReconstitutedWitten-Bell Smoothed and Reconstituted
– 20 –CSCE 771 Spring 2013
Add-One Smoothed BERPReconstitutedAdd-One Smoothed BERPReconstituted