N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)

July 30th, 2009 Lexical Knowledge from

Ngrams1

N-gram Search Engine on Wikipedia

Satoshi Sekine (NYU)Kapil Dalwani (JHU)

July 30th, 2009

Lexical Knowledge from Ngrams

2

Hammer : Fast and multi-functional n-gram search engine

2

ngrams

Search ngram:

FAST

INPUT: token, POS, chunk, NE

OUTPUT: frequency to text

July 30th, 2009


3

Characteristics

• Search up to 7 grams with wildcards• Multi-level input

– Token, POS, chunk, NE, combinations– NOT, OR for POS, chunk, NE

•Multi-level output– Token, POS, chunk, NE– document information– Original sentences, KWIC, ngram

•Display– Show the results in the order of frequency

•Running Environment– Single CPU, PC-Linux, 400MB process, 500GB disk

3

July 30th, 2009


4

Demo

• http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2

July 30th, 2009


5

Available for you

• Web system– At NYU

• http://nlp.cs.nyu.edu/nsearch

– At JHU?

• USB Hard drive

July 30th, 2009


6

1. Search candidates

2. Filtering3. Display

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009


7



Wikipedia text


N-gram data



POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009


8

• Example: 3-grams

•Posting list

From n-gram to Inverted Index

Ngram ID Position=1 Position=2 Position=3

1 A B C

2 A B B

3 B A C

3A pos=2

1 2A pos=1

3B pos=1

1 2B pos=2

2B pos=3

1 3C pos=3

July 30th, 2009


9

Posting list

• Wide variation of posting list size (in 7-gram: 1.27B)– “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672)– conscipcuous, consiety, Mizuk, (1)

• 3 types for faster speed and smaller index size– Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list)

– List of ngramID

– Encoded into pointer (freq=1)

1 3C pos=3

1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1

C pos=3 5

July 30th, 2009


10

Search

• Given an n-gram request (A B C)– Get posting lists for A, B and C– Search intersections of posting lists– Use “look ahead” to speed up the search

• Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996)

4 33 34 55 76 80 89 92 99

4 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98

SKIP

July 30th, 2009


11

1 Search candidates.

2. Filtering


Wikipedia text


N-gram data



POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009


12

Filtering

• Not all candidate ngramID’s match the request

• We need frequency, sentence information to matched n-grams

• POS, chunk and NE information is presented as ID– Reduce the index more than 200GB

NN

VB

PERSON

LOC

A BFreq=123

Freq=10Freq=5

July 30th, 2009


13


3. Display2. Filtering


Wikipedia text


N-gram data



POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009


14

Display

• N-gram will be displayed in the descending order of frequency– N-gram ID is ordered by the frequency

• Sentences are searched using suffix array• POS, chunk, NE are displayed with sentence,

KWIC, ngram• Doc ID, title of Wikipedia (and possible

features of doc) is displayed with sentences and KWIC

July 30th, 2009


15

Size of data

Wikipedia text


N-gram data


Suffix arrayFor text

POS, chunk, NEfor

N-gram data

108 GB

6 GB

8 GB

8 GB

260 GB

100 GB

Others

40 GB

Text 1.7 G words 200M sentences 2.4M articles

Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B

Total530GB

July 30th, 2009


16

Future Work

• Other information (ex: parse, coref, relation, genre, discourse…)

• Longer n-gram• Compress index, dictionary• Ease the indexing load

– Now we need a big memory machine– Distributing indexing

• Union operation for tokens

July 30th, 2009


17

Available for you

• Web demo– At NYU

• http://nlp.cs.nyu.edu/nsearch

– At JHU?

• USB Hard drive

N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)

Documents

Transcript of N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)