ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized...
-
Upload
shuyo-nakatani -
Category
Technology
-
view
2.514 -
download
1
description
Transcript of ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized...
![Page 1: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/1.jpg)
[Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count [Pickhardt+ ACL2014] A Generalized Language Model as the
Comination of Skipped n-grams and Modified Kneser-Ney Smoothing
2014/7/12 ACL Reading @ PFI
Nakatani Shuyo, Cybozu Labs Inc.
![Page 2: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/2.jpg)
Kneser-Ney Smoothing [Kneser+ 1995]
• Discounting & Interpolation
𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1
=max 𝑐 𝑤𝑖−𝑛+1
𝑖 − 𝐷, 0
𝑐 𝑤𝑖−𝑛+1𝑖−1
+𝐷
𝑐 𝑤𝑖−𝑛+1𝑖−1
𝑁1+ 𝑤𝑖−𝑛+1𝑖−1 ∙ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+2
𝑖−1
• where
𝑤𝑚𝑛 = 𝑤𝑚 ⋯𝑤𝑛, 𝑁1+ 𝑤𝑚
𝑛 ⋅ = 𝑤𝑖|𝑐 𝑤𝑚𝑛𝑤𝑖 > 0
Number of Discounting
![Page 3: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/3.jpg)
Modified KN-Smoothing [Chen+ 1999]
𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1
=𝑐 𝑤𝑖−𝑛+1
𝑖 − 𝐷 𝑤𝑖−𝑛+1𝑖
𝑐 𝑤𝑖−𝑛+1𝑖−1
+ 𝛾 𝑤𝑖−𝑛+1𝑖−1 𝑃 𝑤𝑖 𝑤𝑖−𝑛+2
𝑖−1
• where 𝐷 𝑐 = 0 if 𝑐 = 0, 𝐷1 if 𝑐 = 1, 𝐷2 if 𝑐 = 2, _ 𝐷3+ if 𝑐 ≥ 3
𝛾 𝑤𝑖−𝑛+1𝑖−1 =
[amount of discounting]
𝑐 𝑤𝑖−𝑛+1𝑖−1
Weighted Discounting (D_n are estimated by leave-1-out CV)
![Page 4: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/4.jpg)
[Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count
• When each sentence has fractional
weight
– Domain adaptation
– EM-algorithm on word alignment
• Propose KN-smoothing using expected
fractional counts
I’m interested in it!
![Page 5: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/5.jpg)
Model
• 𝒖 means 𝑤𝑖−𝑛+1𝑖−1 , and 𝒖′ means 𝑤𝑖−𝑛+2
𝑖−1
• A sequence 𝒖𝑤 occurs 𝑘 times and each
occurring has probability 𝑝𝑖 (𝑖 = 1,⋯ , 𝑘) as weight,
• then count 𝑐(𝒖𝑤) is distributed according to Poisson Binomial Distribution.
• 𝑝 𝑐 𝑢𝑤 = 𝑟 = 𝑠 𝑘, 𝑟 , where
𝑠 𝑘, 𝑟 =
𝑠 𝑘 − 1, 𝑟 1 − 𝑝𝑘
+ 𝑠 𝑘 − 1, 𝑟 − 1 𝑝𝑘
if 0 ≤ 𝑟 ≤ 𝑘1 if 𝑘 = 𝑟 = 00 otherwise
![Page 6: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/6.jpg)
MLE on this model
• Expectations
– 𝔼 𝑐 𝒖𝑤 = 𝑟 ⋅ 𝑝 𝑐 𝒖𝑤 = 𝑟𝑟
– 𝔼 𝑁𝑟 𝒖 ⋅ = 𝑝 𝑐 𝒖𝑤 = 𝑟𝑤
– 𝔼 𝑁𝑟+ 𝒖 ⋅ = 𝑝 𝑐 𝒖𝑤 ≥ 𝑟𝑤
• Maximize (expected) likelihood
– 𝔼 𝐿 = 𝔼 𝑐 𝒖𝑤 log 𝑝 𝑤 𝒖𝒖𝑤
= 𝔼 𝑐 𝒖𝑤 log 𝑝 𝑤 𝒖𝒖𝑤
– obtain 𝑝MLE 𝑤 𝒖 =𝔼 𝑐 𝒖𝑤
𝔼 𝑐 𝒖⋅
![Page 7: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/7.jpg)
Expected Kneser-Ney
• 𝑐 𝒖𝑤 =
max 0, 𝑐 𝒖𝑤 − 𝐷 + 𝑁1+ 𝒖 ⋅ 𝐷𝑝′(𝑤|𝒖′)
• So, 𝔼 𝑐 𝒖𝑤 = 𝔼 𝑐 𝒖𝑤 − 𝑝 𝑐 𝒖𝑤 > 0 𝐷 +
𝔼 𝑁1+ 𝒖 ⋅ 𝐷𝑝′(𝑤|𝒖′)
– where 𝑝′ 𝑤 𝒖′ = 𝔼 𝑁1+ ⋅𝒖′𝑤
𝔼 𝑁1+ ⋅𝒖′⋅
• then 𝑝 𝑤 𝒖 =𝔼 𝑐 𝒖𝑤
𝔼 𝑐 𝒖⋅
![Page 8: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/8.jpg)
Language model adaptation
• Our corpus consists on
– large general-domain data and
– small specific domain data
• Sentence 𝒘 ‘s weight:
– 𝑝 𝒘 is in − domain =1
1+exp −𝐻 𝒘
– where 𝐻 𝒘 =log 𝑝in 𝒘 −log 𝑝out 𝒘
𝒘,
– 𝑝in:lang. model of in-domain, 𝑝out: out’s one
![Page 9: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/9.jpg)
• Figure 1: On the language model adaptation task, expected KN outperforms all other methods across all sizes of selected subsets. Integral KN is applied to unweighted instances, while fractional WB, fractional KN and expected KN are applied to weighted instances. (via [Zhang+ ACL2014])
from general-domain data
in-domain data - training: 54k - testing: 3k
192
162
156
148
Why isn't there Modified KN as a
baseline?
![Page 10: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/10.jpg)
[Pickhardt+ ACL2014] A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing
• Higher-order n-grams are very sparse
– Especially remarkable on small data(e.g.
domain specific data!)
• Improve performance for small data
by skipped n-grams and Modified KN-
smoothing
– Perplexity reduces 25.7% for very small
training data of only 736KB text
![Page 11: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/11.jpg)
“Generalized Language Models”
• 𝜕3𝑤1𝑤2𝑤3𝑤4 = 𝑤1𝑤2_𝑤4
– “_” means a word placeholder
𝑃GLM 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1 =
𝑐 𝑤𝑖−𝑛+1𝑖 − 𝐷 𝑐 𝑤𝑖−𝑛+1
𝑖
𝑐 𝑤𝑖−𝑛+1𝑖−1
+𝛾high 𝑤𝑖−𝑛+1𝑖−1
1
𝑛 − 1𝑃 GLM
𝑛−1
𝑗=1
𝑤𝑖 𝜕𝑗𝑤𝑖−𝑛+1𝑖−1
𝑃 GLM 𝑤𝑖 𝜕𝑗𝑤𝑖−𝑛+1𝑖−1 =
𝑁1+ 𝜕𝑗𝑤𝑖−𝑛𝑖 − 𝐷 𝑐 𝜕𝑗𝑤𝑖−𝑛+1
𝑖
𝑁1+ 𝜕𝑗𝑤𝑖−𝑛+1𝑖−1 ∗
+𝛾mid 𝜕𝑗𝑤𝑖−𝑛+1𝑖−1
1
𝑛 − 2𝑃 GLM 𝑤𝑖 𝜕𝑗𝜕𝑘𝑤𝑖−𝑛+1
𝑖−1
𝑛−1
𝑘=1,𝑘≠𝑗
![Page 12: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/12.jpg)
• The bold arrows correspond to interpolation of models in traditional modified Kneser-Ney smoothing. The lighter arrows illustrate the additional interpolations introduced by our generalized language models. (via [Pickhardt+ ACL2014])
![Page 13: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/13.jpg)
• shrunk training data sets for the English Wikipedia
small domain specific data
![Page 14: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/14.jpg)
Space Complexity
model size = 9.5GB # of entries = 427M
model size = 15GB # of entries = 742M
![Page 15: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"](https://reader034.fdocuments.in/reader034/viewer/2022042613/5463d72ab4af9f623f8b46e1/html5/thumbnails/15.jpg)
References
• [Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count
• [Pickhardt+ ACL2014] A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing
• [Kneser+ 1995] Improved backing-off for m-gram language modeling
• [Chen+ 1999] An Empirical Study of Smoothing Techniques for Language Modeling