Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force...
Transcript of Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force...
![Page 1: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/1.jpg)
Language Modeling 2
CS287
![Page 2: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/2.jpg)
Review: Windowed Setup
Let V be the vocabulary of English and let s score any window of size
dwin = 5, if we see the phrase
[ the dog walks to the ]
It should score higher by s than
[ the dog house to the ]
[ the dog cats to the ]
[ the dog skips to the ]
...
![Page 3: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/3.jpg)
Review: Continuous Bag-of-Words (CBOW)
y = softmax((∑i x
0i W
0
dwin − 1)W1)
I Attempt to predict the middle word
[ the dog walks to the ]
Example: CBOW
x =v(w3) + v(w4) + v(w6) + v(w7)
dwin − 1
y = δ(w5)
W1 is no longer partitioned by row (order is lost)
![Page 4: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/4.jpg)
Review: Softmax Issues
Use a softmax to force a distribution,
softmax(z) =exp(z)
∑c∈C
exp(zc)
log softmax(z) = z− log ∑c∈C
exp(zc)
I Issue: class C is huge.
I For C&W, 100,000, for word2vec 1,000,000 types
I Note largest dataset is 6 billion words
![Page 5: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/5.jpg)
Quiz
We have trained a depth-two balanced soft-max tree. Give an algorithm
and time-complexity for:
I Computing the optimal class decision?
I Greedily approximating the optimal class decision.
I Appropriately sampling a word.
I Computing the full distribution over classes?
![Page 6: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/6.jpg)
Answers I
I Computing the optimal class decision i.e. arg maxy p(y = y |x)?
I Requires full enumeration O(|V|)
arg maxy
p(y = y |x) = arg maxy
p(y|C , x)p(C |x)
I Greedily approximating the optimal class decision.
I Walk tree O(√|V|)
arg maxy
p(y = y |x)
c∗ = arg maxy
p(C |x)
y∗ = arg maxy
p(y|C = c∗, x)
![Page 7: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/7.jpg)
Answers II
I Appropriately sampling a word y ∼ p(y = δ(c)|x)?I Walk tree O(
√|V|)
y ∼ p(y = δ(c)|x)
c ∼ p(C |x)
y ∼ p(y|C = c , x)
I Computing the full distribution over classes?
I Enumerate O(|V|)
p(y = δ(c)|x)p(y|C = c , x)
![Page 8: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/8.jpg)
Contents
Language Modeling
NGram Models
Smoothing
Language Models in Practice
![Page 9: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/9.jpg)
Language Modeling Task
Given a sequence of text give a probability distribution over the next
word.
The Shannon game. Estimate the probability of the next letter/word
given the previous.
THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG
READING LAMP ON THE DESK SHED GLOW ON
POLISHED
![Page 10: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/10.jpg)
Language Modeling
Shannon (1948) Mathematical Model of Communication
We may consider a discrete source, therefore, to be
represented by a stochastic process. Conversely, any
stochastic process which produces a discrete sequence of
symbols chosen from a finite set may be considered a discrete
source. This will include such cases as:
1. Natural written languages such as English, German,
Chinese. ...
![Page 11: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/11.jpg)
Shannon’s Babblers I
4. Third-order approximation (trigram structure as in English).
IN NO 1ST LAT WHEY CRATICT FROURE BIRS GROCID
PONDENOME OF DEMONSTURES OF THE REPTAGIN IS
REGOACTIONA OF CRE
5. First-Order Word Approximation. Rather than continue
with tetragram, ... , II-gram structure it is easier and better to
jump at this point to word units. Here words are chosen
independently but with their appropriate frequencies.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR
COME CAN DIFFERENT NATURAL HERE HE THE A IN
CAME THE TO OF TO EXPERT GRAY COME TO
FURNISHES THE LINE MESSAGE HAD BE THESE.
![Page 12: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/12.jpg)
Shannon’s Babblers II
6. Second-Order Word Approximation. The word transition
probabilities are correct but no further structure is included.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
’RITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS
THAT THE TIME OF WHO EVER TOLD THE PROBLEM
FOR AN UNEXPECTED
The resemblance to ordinary English text increases quite
noticeably at each of the above steps.
![Page 13: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/13.jpg)
Smoothness Image/Language
It is a capital mistake to theorize before one has data.
Insensibly one begins to twist facts to suit theories, instead of
theories to suit facts. -Sherlock Holmes, A Scandal in Bohemia
![Page 14: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/14.jpg)
Repair Image / Language (Chatterjee et al., 2009)
It is a capital mistake to theorize before one has . . .
![Page 15: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/15.jpg)
Repair Image / Language (Chatterjee et al., 2009)
108 938 285 28 184 29 593 219 58 772 . . .
![Page 16: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/16.jpg)
Language Modeling
Crucially important for:
I Speech Recognition
I Machine Translation
I Many deep learning applications
I Captioning
I Dialogue
I Summarization
I . . .
![Page 17: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/17.jpg)
Statistical Model of Speech
How permanent are their records?
![Page 18: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/18.jpg)
Statistical Model of Speech
![Page 19: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/19.jpg)
Statistical Model of Speech
Transcription: Help peppermint on their records
![Page 20: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/20.jpg)
Statistical Model of Speech
Transcription: How permanent are their records
![Page 21: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/21.jpg)
Language Modeling Formally
Goal: Compute the probability of a sentence,
I Factorization:
p(w1, . . . ,wn) =n
∏t=1
p(wt |w1, . . . ,wt−1)
k Estimate the probability of the next word, conditioned on prefix,
p(wt |w1, . . . ,wt−1; θ)
![Page 22: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/22.jpg)
Machine Learning Setup
Multi-class prediction problem,
(x1, y1), . . . , (xn, yn)
I yi ; the one-hot next word
I xi ; representation of the prefix (w1, . . . ,wt−1)
Challenges:
I How do you represent input?
I Smoothing is crucially important.
I Output space is very large (next class)
![Page 23: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/23.jpg)
Machine Learning Setup
Multi-class prediction problem,
(x1, y1), . . . , (xn, yn)
I yi ; the one-hot next word
I xi ; representation of the prefix (w1, . . . ,wt−1)
Challenges:
I How do you represent input?
I Smoothing is crucially important.
I Output space is very large (next class)
![Page 24: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/24.jpg)
Problem Metric
Previously, used accuracy as a metric.
Language modeling uses of version average negative log-likelihood
I For test data w1, . . . , wn
I
NLL = −1
n
n
∑i=1
log p(wi |w1, . . . ,wi−1)
Actually report perplexity,
perp = exp(−1
n
n
∑i=1
log p(wi |w1, . . . ,wi−1))
Requires modeling full distribution as opposed to argmax (hinge-loss)
![Page 25: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/25.jpg)
Perplexity: Intuition
I Effective uniform distribution size.
I If words were uniform: perp = |V| = 10000
I Using unigram : perp ≈ 400
I log2 perp gives average number of bits needed per word.
![Page 26: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/26.jpg)
Contents
Language Modeling
NGram Models
Smoothing
Language Models in Practice
![Page 27: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/27.jpg)
n-gram Models
In practice representation doesn’t use all (w1, . . . ,wt−1)
p(wi |w1, . . . ,wi−1) ≈ p(wi |wi−n+1, . . . ,wi−1; θ)
I We call this an n-gram model.
I wi−n+1, . . . ,wi−1; the context
![Page 28: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/28.jpg)
Bigram Models
Back to count-based multinomial estimation,
I Feature set F : previous words
I Input vector x is sparse
I Count matrix F ∈ RV×V
I Counts come from training data:
Fc,w = ∑i
1(wi−1 = c,wi = w)
pML(wi |wi−1; θ) =Fc,wFc,•
![Page 29: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/29.jpg)
Trigram Models
I Feature set F : previous two words (conjunction)
I Input vector x is sparse
I Count matrix F ∈ RV×V ,V
Fc,w = ∑i
1(wi−2:i−1 = c ,wi = w)
pML(wi |wi−2:i−1 = c ; θ) =Fc,wFc,•
![Page 30: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/30.jpg)
Notation
I c = wi−n+1:i−1; context
I c ′ = wi−n+2:i−1; context without first word
I c ′′ = wi−n+3:i−1; context without first two words
I [x ]+ = max{0, x}; positive part (ReLU)
I F•,w = ∑c Fc,w ; sum-over-dimension
I Maximum-likelihood,
pML(w |c) = Fc,w/Fc,•
I Non-zero count words, for all c ,w
Nc,w = 1(Fc,w > 0)
![Page 31: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/31.jpg)
NGram Models
It is common to go up to 5-grams,
Fc,w = ∑i
1(wi−5+1 . . .wi−1 = c ,wi = w)
Matrix becomes very sparse at 5-grams.
![Page 32: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/32.jpg)
Google 1T
Number of token 1,024,908,267,229
Number of sentences 95,119,665,584
Size compressed (counts only) 24 GB
Number of unigrams 13,588,391
Number of bigrams 314,843,401
Number of trigrams 977,069,902
Number of fourgrams 1,313,818,354
Number of fivegrams 1,176,470,663
![Page 33: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/33.jpg)
Contents
Language Modeling
NGram Models
Smoothing
Language Models in Practice
![Page 34: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/34.jpg)
NGram Models
I Maximum likelihood models word terribly.
I NGram models are sparse, need to handle unseen cases.
I Presentation follows work of Chen and Goodman (1999)
![Page 35: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/35.jpg)
Count Modifications
I Laplace Smoothing
Fc,w = Fc,w + α for all c ,w
I Good-Turing Smoothing (Good, 1953)
Fc,w = (Fc,w + 1)hist(Fc,w + 1)
hist(Fc,w )for all c ,w
![Page 36: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/36.jpg)
Real Issues N-Gram Sparsity
I Histogram of trigrams
![Page 37: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/37.jpg)
Idea 1: Interpolation
For trigrams:
pinterp(w |c) = λ1pML(w |c) + λ2pML(w |c ′) + λ3pML(w |c ′′)
Ensure that λs form convex combination
∑i
λi = 1
λi ≥ 0 for all i
![Page 38: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/38.jpg)
Idea 1: Interpolation (Jelinek-Mercer Smoothing)
Can write recursively,
pinterp(w |c) = λpML(w |c) + (1− λ)pinterp(w |c ′)
Ensure that λ form convex combination
0 ≤ λ ≤ 1
![Page 39: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/39.jpg)
How to set parameters: Validation Tuning
I Treat λ as hyper-parameters
I Can estimate λ using EM (or gradients directly)
I Alternatively: Grid-search to held-out data.
I Can add more parameters λ(c ,w)
I Language models are often stored with prob and backoff value (λ)
![Page 40: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/40.jpg)
How to set parameters: Witten-Bell
Define λ(c ,w) as a function of c ,w
(1− λ) =Nc,•
Nc,• + Fc,•
I Nc,•; Number of unique word types seen with context
pwb(w |c) =Fc,w +Nc,• × pwb(w |c ′)
Fc,• +Nc,•
Interpolation counts are a new events estimated as proportional to
seeing a new word.
![Page 41: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/41.jpg)
Idea 2: Absolute Discounting
Similar form to interpolation
pabs(w |c) = λ(w , c)pML(w |c) + η(c)pabs(w |c ′)
Delete counts and redistribute:
pabs =[Fc,w −D ]+
Fc,•+ η(c)pabs(w |c ′)
Need for a given c
∑w
λ(w , c)p1(w) + η(c)p2(w) = ∑w
λ(w)p1(w) + η(c) = 1
In-class: To ensure a valid distribution, what are λ and η with D = 1?
![Page 42: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/42.jpg)
Answer
λ(w) =[Fc,w −D ]+
Fc,w
η(w) = 1−∑w
[Fc,w −D ]+Fc,•
= ∑w
Fc,• − [Fc,w −D ]+Fc,•
Total counts minus counts discounted.
η(w) =D ×Nc,•
Fc,•
![Page 43: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/43.jpg)
Lower-Order Smoothing Issue
Consider the example: San Francisco and Los Franciso
Would like:
I p(wi−1 = San,wi = Francisco) to be high
I p(wi−1 = Los,wi = Francisco) to be very low
However, interpolation alone doesn’t ensure this. Why?
(1− λ)p(wi = Francisco)
Could be quite high, from seeing San Francisco many times.
![Page 44: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/44.jpg)
Lower-Order Smoothing Issue
Consider the example: San Francisco and Los Franciso
Would like:
I p(wi−1 = San,wi = Francisco) to be high
I p(wi−1 = Los,wi = Francisco) to be very low
However, interpolation alone doesn’t ensure this. Why?
(1− λ)p(wi = Francisco)
Could be quite high, from seeing San Francisco many times.
![Page 45: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/45.jpg)
Kneser-Ney Smoothing
I Uses Absolute Discounting instead of ML
I However want distribution to match ML on lower-order terms.
I Ensure this by marginalizing out and matching lower-order
distribution.
I Uses a different function for the lower-order term KN’.
I Most commonly used smoothing technique (details follow).
![Page 46: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/46.jpg)
Review: Chain-Rule and Marginalization
Chain rule:
p(X ,Y ) = P(X |Y )P(Y )
Marginalization:
p(X ,Y ) = ∑z∈Z
P(X ,Y ,Z = z)
Marginal matching constraint.
pA(X ,Y ) 6= pB(X ,Y )
∑y
pA(X ,Y = y) = pB(X )
![Page 47: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/47.jpg)
Kneser-Ney Smoothing
Main Idea: match ML marginals for all c ′
∑c1
pKN(c1, c ′,w) = ∑c1
pKN(w |c)pML(c) = pML(c′,w)
∑c1
pKN(w |c)Fc,•
F•,•=
Fc ′w ,•
F•,•
[ c1 c ′ w ]
![Page 48: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/48.jpg)
Kneser-Ney Smoothing
pKN =[Fc,w −D ]+
Fc,•+ η(c)pKN ′(w |c ′)
Fc ′w ,• = ∑c1
Fc,•[pKN(w |c)]
= ∑c1
Fc,•[[Fc,w −D ]+
Fc,•+
D
Fc,•Nc,• × pKN ′(w |c ′)]
= ∑c1 :Nc,w>0
Fc,•Fc,w −D
Fc,•+ ∑
c1
DNc,• × pKN ′(w |c ′)
= Fc ′,w −N•c ′,wD +DN•c ′,• × pKN ′(w |c ′)
![Page 49: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/49.jpg)
Kneser-Ney Smoothing
Final equation for unigram
pKN ′(w) = N•,w/N•,•
I Intuition: Prob of a unique bigrams ending in w .
pKN ′(w |c ′) = N•c ′,w/N•c ′,•
I Intuition: Prob of a unique ngram (with middle c ′) ending in w .
![Page 50: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/50.jpg)
Modified Kneser-Ney
Set D based on Fc,w
D =
0 Fc,w = 0
D1 Fc,w = 1
D2 Fc,w = 2
D3+ Fc,w ≥ 3
I Modifications to the formula are in the paper
![Page 51: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/51.jpg)
Contents
Language Modeling
NGram Models
Smoothing
Language Models in Practice
![Page 52: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/52.jpg)
Language Model Issues
1. LMs become very sparse.
I Obviously cannot use matrices (|V| × |V| × |V| > 1012)
2. LMs are very big
3. Lookup speed is crucial
![Page 53: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/53.jpg)
Trie Data structure
ROOT
a
mouse(c7)cat
meows(c6)
dog
runs(c5)
the
mouse(c4)cat
runs(c3)
dog
sleeps(c2)barks(c1)
Issues?
![Page 54: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/54.jpg)
Reverse Trie Data structure
ROOT
runs(c)
cat(c)
the(c)
dog(c)
a(c)
meows(c)
cat(c)
a(c)
sleeps(c)
dog(c)
the(c)
barks(c)
dog(c)
the(c)
mouse(c)
the(c)
;
Used in several standard language modeling toolkits.
![Page 55: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/55.jpg)
Efficiency: Hash Tables
I KenLM finds it more efficient to directly hash ngrams
I All ngrams of a given weight are kept in a hash table.
I Fast linear-probing hash tables make this work well.
![Page 56: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/56.jpg)
Quantization
I Memory issues are often a concern
I Probabilities are kept in log space
I KenLM quantizes from 32 bits to smaller (others use 8 bits)
I Quatization is done by sorting, binning, and averaging.
![Page 57: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/57.jpg)
Finite State Automata
![Page 58: Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force a distribution, softmax(z) = exp(z) å c2C exp(z c) log softmax(z) = z log å c2C](https://reader033.fdocuments.in/reader033/viewer/2022042804/5f515cd6e5f918157102d8db/html5/thumbnails/58.jpg)
Conclusion
Today,
I Language Modeling
I Various ideas in smoothing
I Tricks for efficient implementation.
Next time,
I Neural Network Language models