N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)
-
Upload
jonathan-parrish -
Category
Documents
-
view
217 -
download
1
Transcript of N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)
July 30th, 2009 Lexical Knowledge from
Ngrams1
N-gram Search Engine on Wikipedia
Satoshi Sekine (NYU)Kapil Dalwani (JHU)
July 30th, 2009
Lexical Knowledge from Ngrams
2
Hammer : Fast and multi-functional n-gram search engine
2
ngrams
Search ngram:
FAST
INPUT: token, POS, chunk, NE
OUTPUT: frequency to text
July 30th, 2009
Lexical Knowledge from Ngrams
3
Characteristics
• Search up to 7 grams with wildcards• Multi-level input
– Token, POS, chunk, NE, combinations– NOT, OR for POS, chunk, NE
•Multi-level output– Token, POS, chunk, NE– document information– Original sentences, KWIC, ngram
•Display– Show the results in the order of frequency
•Running Environment– Single CPU, PC-Linux, 400MB process, 500GB disk
3
July 30th, 2009
Lexical Knowledge from Ngrams
4
Demo
• http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2
July 30th, 2009
Lexical Knowledge from Ngrams
5
Available for you
• Web system– At NYU
• http://nlp.cs.nyu.edu/nsearch
– At JHU?
• USB Hard drive
July 30th, 2009
Lexical Knowledge from Ngrams
6
1. Search candidates
2. Filtering3. Display
Implementation: Overview
Wikipedia text
WikipediaPOS, chunk, NE
N-gram data
Inverted index for n-gram data
Suffix arrayfor text
POS, chunk, NEfor
N-gram data
Searchrequest
July 30th, 2009
Lexical Knowledge from Ngrams
7
1. Search candidates
Implementation: Overview
Wikipedia text
WikipediaPOS, chunk, NE
N-gram data
Inverted index for n-gram data
Suffix arrayfor text
POS, chunk, NEfor
N-gram data
Searchrequest
July 30th, 2009
Lexical Knowledge from Ngrams
8
• Example: 3-grams
•Posting list
From n-gram to Inverted Index
Ngram ID Position=1 Position=2 Position=3
1 A B C
2 A B B
3 B A C
3A pos=2
1 2A pos=1
3B pos=1
1 2B pos=2
2B pos=3
1 3C pos=3
July 30th, 2009
Lexical Knowledge from Ngrams
9
Posting list
• Wide variation of posting list size (in 7-gram: 1.27B)– “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672)– conscipcuous, consiety, Mizuk, (1)
• 3 types for faster speed and smaller index size– Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list)
– List of ngramID
– Encoded into pointer (freq=1)
1 3C pos=3
1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1
C pos=3 5
July 30th, 2009
Lexical Knowledge from Ngrams
10
Search
• Given an n-gram request (A B C)– Get posting lists for A, B and C– Search intersections of posting lists– Use “look ahead” to speed up the search
• Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996)
4 33 34 55 76 80 89 92 99
4 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98
SKIP
July 30th, 2009
Lexical Knowledge from Ngrams
11
1 Search candidates.
2. Filtering
Implementation: Overview
Wikipedia text
WikipediaPOS, chunk, NE
N-gram data
Inverted index for n-gram data
Suffix arrayfor text
POS, chunk, NEfor
N-gram data
Searchrequest
July 30th, 2009
Lexical Knowledge from Ngrams
12
Filtering
• Not all candidate ngramID’s match the request
• We need frequency, sentence information to matched n-grams
• POS, chunk and NE information is presented as ID– Reduce the index more than 200GB
NN
VB
PERSON
LOC
A BFreq=123
Freq=10Freq=5
July 30th, 2009
Lexical Knowledge from Ngrams
13
1. Search candidates
3. Display2. Filtering
Implementation: Overview
Wikipedia text
WikipediaPOS, chunk, NE
N-gram data
Inverted index for n-gram data
Suffix arrayfor text
POS, chunk, NEfor
N-gram data
Searchrequest
July 30th, 2009
Lexical Knowledge from Ngrams
14
Display
• N-gram will be displayed in the descending order of frequency– N-gram ID is ordered by the frequency
• Sentences are searched using suffix array• POS, chunk, NE are displayed with sentence,
KWIC, ngram• Doc ID, title of Wikipedia (and possible
features of doc) is displayed with sentences and KWIC
July 30th, 2009
Lexical Knowledge from Ngrams
15
Size of data
Wikipedia text
WikipediaPOS, chunk, NE
N-gram data
Inverted index for n-gram data
Suffix arrayFor text
POS, chunk, NEfor
N-gram data
108 GB
6 GB
8 GB
8 GB
260 GB
100 GB
Others
40 GB
Text 1.7 G words 200M sentences 2.4M articles
Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B
Total530GB
July 30th, 2009
Lexical Knowledge from Ngrams
16
Future Work
• Other information (ex: parse, coref, relation, genre, discourse…)
• Longer n-gram• Compress index, dictionary• Ease the indexing load
– Now we need a big memory machine– Distributing indexing
• Union operation for tokens
July 30th, 2009
Lexical Knowledge from Ngrams
17
Available for you
• Web demo– At NYU
• http://nlp.cs.nyu.edu/nsearch
– At JHU?
• USB Hard drive