TinyLex: Static N-Gram Index Pruning with Perfect
RecallDerrick Coetzee, Microsoft Research
CC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to all content in this presentation.
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
2
Consider searching for a subsequence in a collection of genome sequences:…gcaagctttatagtgacaacaataaggtatcactcggtt…
N-gram inverted indexes are the traditional solution, but have 10-100 times more terms than ordinary word-based inverted indexes
TinyLex indexes achieve similar query performance with 7-17 times less terms
TinyLex provides good worst-case query performance
Motivation
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
3
1. Each wife had seven sacks, 2. Each sack had seven cats, 3. Each cat had seven kits. 4. Kits, cats, sacks, and wives.
Inverted indexes
each: {1, 2, 3}had: {1, 2, 3}seven: {1, 2, 3}wife: {1, 4}
sack: {1, 2, 4}cat: {2, 3, 4}kit: {3, 4}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
4
1. Each wife had seven sacks, 2. Each sack had seven cats, 3. Each cat had seven kits. 4. Kits, cats, sacks, and wives.
Inverted indexes
Query: sack and cat sack: {1, 2, 4} cat: {2, 3, 4} {1, 2, 4} ∩ {2, 3, 4} = {2, 4}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
5
Partial word or punctuation queries◦ Searching a dictionary for all words ending in
“ment”◦ Searching for <b> in HTML files◦ Searching for "%s" in C source files◦ Searching for x^2/2 in LaTeX source files
Searching East Asian language text◦ No spaces, word extraction is complex
Phrase searching
Limitations of inverted indexes
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
6
Genome sequences: 1. gcaagctttatagtgacaac... 2. aataaggtatcactcggtta... 3. caattacccccacttcccct... 4. cattataaagaaatgatcaa...
Example query:Documents containing subsequence “cact”
Limitations of inverted indexes
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
7
Simplified example: Two-letter alphabet 1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
N-gram inverted indexes
aaa: {2}aab: {2, 3, 4}aba: {1, 2, 3}abb: {1, 2, 4}
baa: {2, 3, 4}bab: {1, 2, 3}bba: {1, 4}bbb: {1, 4}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
8
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
N-gram inverted indexesQuery: aaba
aaba aab and aba
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
9
1. babbbbabab 2. aababaaabb 3. babababaab (false positive) 4. bbbbaabbbb
N-gram inverted indexesQuery: aaba aab and abaaab: {2, 3, 4}aba: {1, 2, 3}{2, 3, 4} ∩ {1, 2, 3} = {2, 3}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
10
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
Selecting n-gram length
a: {1, 2, 3, 4}b: {1, 2, 3, 4}
Small number of termsSlow queries• Long posting lists• Too many false positives
length = 1
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
11
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
Selecting n-gram length
aababa: {2}aabbbb: {4}abaaab: {2}ababaa: {2,3}ababab: {3}abbbba: {1}baaabb: {2}baabbb: {4}babaaa: {2}
babaab: {3}bababa: {3}babbbb: {1}bbaabb: {4}bbabab: {1}bbbaab: {4}bbbaba: {1}bbbbaa: {4}bbbbab: {1}
Fast queriesToo many termsQueries must be ≥6 characters
length = 6
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
12
Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions
Overview
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
13
Goal: less terms without sacrificing query performance
Consider the n-grams “juggl” and “uggle”◦ Almost exactly the same posting list in a typical
English language collection◦ Just put the n-gram “uggl” in the index, and leave
out “juggl” and “uggle”
TinyLex
juggl: {2, 7, 33}uggle: {2, 7, 33}
uggl: {2,7,33}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
14
Insight: The more false positives a term produces when it is queried for, the more information it adds when it is added to the index.
Choose a false positive threshold t and choose the smallest possible set of index terms that satisfies it.
Allow variable-length n-grams.
TinyLex
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
15
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
TinyLex: Example
aa: {2, 3, 4}bb: {1, 2, 4}aaa: {2}aba: {1, 2, 3}bab: {1, 2, 3}
bba: {1, 4}bbb: {1, 4}aaba: {2}baab: {3, 4}babb: {1}
In this example t = 1. At most 1 false positive is allowed for any query.Only 10 terms!
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
16
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
TinyLex: Example
Query: abaab aba and baababa: {1, 2, 3}baab: {3, 4}{1, 2, 3} ∩ {3, 4} = {3}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
17
The construction guarantees that if the query term occurs in the collection, it will have at most t – 1 false positives (zero in this case).
If we observe t false positives, we can halt immediately.
TinyLex: Nonoccurring terms
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
18
TinyLex: Nonoccurring terms
Query: bbbbb bbb and bbb and bbbbbb: {1, 4}{1, 4} ∩ {1, 4} ∩ {1, 4} = {1, 4}
1. babbbbabab (false positive)
...can’t happen unless the query result is empty. Halt.
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
19
Achieve similar query performance to classical n-gram indexes with a much larger number of terms
Worst-case bound on number of false positives
Query can be any length
TinyLex: Benefits
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
20
Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions
Overview
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
21
The problem:◦ Input: a set of documents, a threshold t◦ Output: a list of terms such that any query for a
term occurring in the collection will have at most t – 1 false positives
Constructing a TinyLex index
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
22
Basic construction: For each n-gram length from 1 to max:
◦ Make a list of all n-grams in the collection and what documents they occur in.
◦ Perform a query on each term using the partially constructed index.
◦ If a term has too many false positives, add it to the index.
Constructing a TinyLex index
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
23
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
Construction: Example
(index empty)
1-grams
Query result
Actual
a {1,2,3,4}
{1,2,3,4}
b {1,2,3,4}
{1,2,3,4}
t = 1
If the difference between the query result size and the actual posting list size is at least 1, add it to the index.
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
24
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
Construction: Example2-grams
Query result
Actual
aa {1,2,3,4}
{2,3,4}
ab {1,2,3,4}
{1,2,3,4}
ba {1,2,3,4}
{1,2,3,4}
bb {1,2,3,4}
{1,2,4}(index empty)
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
25
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb
Construction: Example2-grams
Query result
Actual
aa {1,2,3,4}
{2,3,4}
ab {1,2,3,4}
{1,2,3,4}
ba {1,2,3,4}
{1,2,3,4}
bb {1,2,3,4}
{1,2,4}
aa: {2,3,4}bb: {1,2,4}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
26
1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111
Construction: Example
aa: {2,3,4}bb: {1,2,4}
3-grams
Query result
Actual
aaa {2,3,4} {2}
aab {2,3,4} {2,3,4}
aba {1,2,3,4}
{1,2,3}
abb {1,2,4} {1,2,4}
baa {2,3,4} {2,3,4}
bab {1,2,3,4}
{1,2,3}
bba {1,2,4} {1,4}
bbb {1,2,4} {1,4}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
27
1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111
Construction: Example
aa: {2,3,4}bb: {1,2,4}aaa: {2}aba: {1,2,3}bab: {1,2,3}bba: {1,4}bbb: {1,4}
3-grams
Query result
Actual
aaa {2,3,4} {2}
aab {2,3,4} {2,3,4}
aba {1,2,3,4}
{1,2,3}
abb {1,2,4} {1,2,4}
baa {2,3,4} {2,3,4}
bab {1,2,3,4}
{1,2,3}
bba {1,2,4} {1,4}
bbb {1,2,4} {1,4}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
28
1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111
Construction: Example4-grams
Query result
Actual
aaab {2} {2}
aaba {2,3} {2}
aabb {2,4} {2,4}
abaa {2,3} {2,3}
abab {1,2,3} {1,2,3}
abbb {1,4} {1,4}
baaa {2} {2}
baab {2,3,4} {3,4}
baba {1,2,3} {1,2,3}
babb {1,2} {1}
bbaa {4} {4}
bbab {1} {1}
bbba {1,4} {1,4}
bbbb {1,4} {1,4}
aa: {2,3,4}bb: {1,2,4}aaa: {2}aba: {1,2,3}bab: {1,2,3}bba: {1,4}bbb: {1,4}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
29
1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111
Construction: Example
aa: {2,3,4}bb: {1,2,4}aaa: {2}aba: {1,2,3}bab: {1,2,3}bba: {1,4}bbb: {1,4}
aaba: {2}baab: {3,4}babb: {1}
4-grams
Query result
Actual
aaab {2} {2}
aaba {2,3} {2}
aabb {2,4} {2,4}
abaa {2,3} {2,3}
abab {1,2,3} {1,2,3}
abbb {1,4} {1,4}
baaa {2} {2}
baab {2,3,4} {3,4}
baba {1,2,3} {1,2,3}
babb {1,2} {1}
bbaa {4} {4}
bbab {1} {1}
bbba {1,4} {1,4}
bbbb {1,4} {1,4}
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
30
Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions
Overview
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
31
Results Test set: 100MB TREC WSJ collection
37000 documents, English text Same query performance with 7-17 times less
terms
1E+3 1E+4 1E+5 1E+60
100
200
300
400
500
600TinyLex index
Classical n-gram index
Number of terms
Mean
qu
ery
tim
e
(ms)
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
32
Results Overall compressed index size 2-20% less TinyLex index has more information per term
0 25 50 75 1000
100
200
300
400
500
600TinyLex index
Classical n-gram index
Index size (MB)
Mean
qu
ery
tim
e
(ms)
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
33
Results Dramatic 50x improvement in worst-case query
performance for long queries
0 10 20 30 40 50 60 70 80 900
5001000150020002500300035004000
6-grams
TinyLex index of same size
Query length in characters
Wo
rst
qu
ery
tim
e
(ms)
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
34
Applications to phrase searching using variable-length word n-grams
Making the construction more efficient Performance on genome sequences Empirical evaluation of scaling
See paper for
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
35
Suffix arrays (Manber and Myers 1991)◦ Faster queries, but indexes 3-10 times larger
agrep and GLIMPSE (Wu and Manber 1994)◦ More general queries, but relies on a word
concept n-Gram/2L (Kim et al 2005)
◦ Orthogonal; examines less document offsets “Growing an n-gram language model”
◦ (Siivola and Pellom 2005)◦ Similar idea applied to language modeling
Related work
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
36
Faster construction time◦ Currently about 10 times slower to construct than
a classical n-gram index. Queries for nonoccurring terms are more
expensive than with classical n-gram indexes (t documents must be read).
Generalize to dynamic collections
Future work
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
37
N-gram indexes enable practical queries for subsequences
TinyLex indexes achieve similar query performance to classical n-gram indexes with 7-17 times less terms
TinyLex yields good worst-case query performance by placing an upper bound on the number of false positives
Conclusions
TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
38
Questions?
Top Related