Post on 11-Apr-2018
Lecture 4: n-grams in NLP
LING 1330/2330: Introduction to Computational Linguistics
Na-Rae Han
Objectives
Frequent n-grams in English
n-grams and statistical NLP
n-grams and conditional probability
Large n-gram resources
2/2/2017 2
For fun: most frequent bigrams?
2/2/2017 3
2551888 of the
1887475 in the
1041011 to the
861798 on the
676658 and the
648408 to be
578806 for the
561171 at the
498217 in a
479627 do n't
455367 with the
451460 from the
443547 of a
395939 that the
362176 is a
361879 going to
335255 by the
330828 as a
319846 with a
317431 I think
Source: http://www.ngrams.info/download_coca.asp
Most frequent trigrams?
2/2/2017 4
198630 I do n't
140305 one of the
129406 a lot of
117289 the United States
79825 do n't know
76782 out of the
75015 as well as
73540 going to be
61373 I did n't
61132 to be a
Source: http://www.ngrams.info/download_coca.asp
n-grams and statistical NLP
2/2/2017 5
You have a good intuition as a native speaker.
Beyond intuition, it is possible to obtain a highly detailed & accurate set of n-gram statistics.
How? Through corpus data.
Corpus-sourced, large-scale n-grams are one of the biggest contributors to the recent advancement of statistical natural language processing (NLP) technologies.
Used for: spelling correction, machine translation, speech recognition, information extraction...
JUST ABOUT ANY NLP APPLICATION
n-grams vs. conditional probability
6
Suppose 'is' is the current word. What is the most likely next word?
How likely are 'you' and 'your' as the next word?
Questions of conditional probability
Can be answered through n-gram data
*Source: http://norvig.com/ngrams/count_1w.txt **Source: http://norvig.com/ngrams/count_2w.txt
is a ** (2) 476718990
is the 306482559
is not 276753375
is an 98762170
is to 97276807
…
is your (3) 17051576
…
is you (4) 1826931
'is' occurs 4,705,743,816 times (1)*
'a' is the most likely next word with (2) / (1) = 0.10 probability.
'your' as the next word has (3) / (1) = 0.0036 probability.
'you' as the next word has (4) / (1) = 0.000388 probability.
Extremely large
2/2/2017 7
"All our N-gram are Belong to You"
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Google Web 1T 5-Gram, released in August 2006 through LDC (Linguistic Data Consortium)
1-5 grams
Compiled from 1-trillion words of running web text
24 GB of compressed text
Source of Norvig's 1- and 2-gram frequency lists
Publication of this data triggered huge advances in NLP technologies and applications.
Even larger
2/2/2017 8
Google Books Ngram Corpus
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
Basis for Google Books Ngram Viewer
1-5grams
Freely downloadable (for those who can)
Compiled from over 5 million books, published up to 2008
Data has publication dates; good for charting historical trend
Books were digitized using OCR
In multiple languages
American/British English, Chinese, French, German, Hebrew, Italian, Russian, Spanish
Large-ish
2/2/2017 9
COCA n-gram lists
http://www.ngrams.info/download_coca.asp
Word 2-5 grams, each containing top ~1 million entries
Based on COCA (The Corpus of Contemporary American English) (http://corpus.byu.edu/coca/), 520 million words as of Jan 2017
COCA's full unigram list is not free.
COCA's top 5000 words/lemmas
http://www.wordfrequency.info/free.asp
Contains lemma and POS of top 5,000 words
Excerpted, manageable
2/2/2017 10
Natural Language Corpus Data: Beautiful Data
by Peter Norvig
http://norvig.com/ngrams/
Has lists of large-scale English n-gram data: character ( 1- & 2-grams) and word level (1, 2, 3 grams)
Data derived/excerpted from Google Web 1T 5-Gram corpus
¼ million most frequent bigrams
Google's original data is 315 mil
1-grams/word list: Norvig vs. ENABLE
11
count_1w.txt enable1.txt the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816 on 3750423199 that 3400031103 by 3350048871 this 3228469771 with 3183110675 i 3086225277 goofel 12711 gooek 12711 gooddg 12711 gooblle 12711 gollgo 12711 golgw 12711
aa aah aahed aahing aahs aal aalii aaliis aals aardvark aardvarks aardwolf aardwolves aargh aarrgh zymotic zymurgies zymurgy zyzzyva zyzzyvas
Total # of entries: 333K
vs. 173K
Usefulness?
Overlap?
2-grams: Norvig vs. COCA
12
count_2w.txt w2_.txt you get 25183570 you getting 430987 you give 3512233 you go 8889243 you going 2100506 you gone 210111 you gonna 416217 you good 441878 you got 4699128 you gotta 668275 you graduate 117698 you grant 103633 you great 450637 you grep 120367 you grew 102321 you grow 398329 you guess 186565 you guessed 295086 you guys 5968988 you had 7305583 you hand 120379 you handle 336799 you hang 144949 you happen 627632 you happy 603963
39509 you get 30 you gets 31 you gettin 861 you getting 263 you girls 24 you git 5690 you give 138 you given 169 you giving 182 you glad 46 you glance 23594 you go 70 you god 54 you goddamn 115 you goin 9911 you going 1530 you gon 262 you gone 444 you good 25 you google 19843 you got
Compiled from: 1 trillion words
vs. 500 million words
2-grams: Norvig vs. COCA
13
count_2w.txt w2_.txt you get 25183570 you getting 430987 you give 3512233 you go 8889243 you going 2100506 you gone 210111 you gonna 416217 you good 441878 you got 4699128 you gotta 668275 you graduate 117698 you grant 103633 you great 450637 you grep 120367 you grew 102321 you grow 398329 you guess 186565 you guessed 295086 you guys 5968988 you had 7305583 you hand 120379 you handle 336799 you hang 144949 you happen 627632 you happy 603963
39509 you get 30 you gets 31 you gettin 861 you getting 263 you girls 24 you git 5690 you give 138 you given 169 you giving 182 you glad 46 you glance 23594 you go 70 you god 54 you goddamn 115 you goin 9911 you going 1530 you gon 262 you gone 444 you good 25 you google 19843 you got
*NOT google's fault! Norvig only took top 0.1% of 315 million.
Total # of entries: ¼ million*
vs. 1 million
Usefulness?
Know your data
2/2/2017 14
When using publicly available resources, you must evaluate and understand the data.
Origin?
Domain & genre?
Size?
Traits?
Merits and limitations?
Fit with your project?