A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
-
date post
22-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
![Page 1: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/1.jpg)
A Fast Regular Expression Indexing
Engine
Junghoo “John” Cho (UCLA)
Sridhar Rajagopalan (IBM)
![Page 2: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/2.jpg)
Junghoo "John" Cho (UCLA Computer Science) 2
Problem
How can we match a regular expression fast? Large text-corpus Several days to match a simple regular
expression! Our solution
Use an index!
![Page 3: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/3.jpg)
Junghoo "John" Cho (UCLA Computer Science) 3
Motivation
Advanced search interface What is the middle name of Thomas Edison?
State-of-the-art: Keyword-based Thomas Edison
Regular expression Thomas [a-z]+ Edison
Data extraction [Brin 98]
![Page 4: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/4.jpg)
Junghoo "John" Cho (UCLA Computer Science) 4
Outline
Index key selection Useful gram Algorithm for key selection Other issues Experiments
![Page 5: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/5.jpg)
Junghoo "John" Cho (UCLA Computer Science) 5
Motivating example
All mp3 URLs on the Web:<a href=(“|’)?.*\.mp3(“|’)?>Every matching string contains mp3.
Questions: Should we index “mp3”? Should we index “<a href=”?
![Page 6: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/6.jpg)
Junghoo "John" Cho (UCLA Computer Science) 6
What index entires?
Solution 1: Inverted index (English words) Cannot handle many regular expressions
Solution 2: k-grams for k = 1, 2, …, 10 Index too large (10 times as large!)
Our solution: multigram
![Page 7: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/7.jpg)
Junghoo "John" Cho (UCLA Computer Science) 7
Main idea
“mp3” is helpful. Not many pages have it.
“<a href=” is not. All pages have it.
We index only “useful” grams.
![Page 8: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/8.jpg)
Junghoo "John" Cho (UCLA Computer Science) 8
Gram selectivity
Sel(x): selectivity of gram xSel(x) = M(x)/NM(x): number of pages containing gram
xN: total number of pages
C-useful gram: All grams with Sel(x) < C C: system parameter
random access vs. sequential access time We index only “C-useful” grams
![Page 9: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/9.jpg)
Junghoo "John" Cho (UCLA Computer Science) 9
Minimal useful gram
“Unix is great” If “Unix” is useful “Unix i”, “Unix is”, “Unix is g”, … are all useful.
“Unix” is the minimal useful gram. We index only the minimal useful gram.
![Page 10: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/10.jpg)
Junghoo "John" Cho (UCLA Computer Science) 10
Advantages
Versatile We can look up “Unix” for all grams like “Unix i”, “Unix is g”, etc.
Easy to find Reduction to “A priori” algorithm
Index size guarantee
![Page 11: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/11.jpg)
Junghoo "John" Cho (UCLA Computer Science) 11
Algorithm
Main idea: If “abcde” is minimal useful gram, then “abcd” is
not useful. If “abcd” is not useful, then “a”, “ab”, “abc” is not
useful. Minimal useful gram identification is
equivalent to useless gram identification.
![Page 12: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/12.jpg)
Junghoo "John" Cho (UCLA Computer Science) 12
A priori algorithm
Useless gram identification Find all sequences of characters that occur in
more than k pages A priori algorithm
Find all sets of items that occur in more than k baskets
Less than 4 scans of the corpus to find all minimal useful grams.
![Page 13: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/13.jpg)
Junghoo "John" Cho (UCLA Computer Science) 13
Prefix free set
A set of grams X is prefix free ifno x X is a prefix of any other x’ Xe.g.) X = {ab, ac, abc} is not prefix free.
A set of minimal useful grams is a prefix free set.
![Page 14: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/14.jpg)
Junghoo "John" Cho (UCLA Computer Science) 14
Size of a prefix free set
Let X be a set of grams extracted from corpus D and is prefix free. Then
|X| |D||X|: number of grams in X|D|: number of characters in D
The size of an index with minimal useful grams does not exceed the size of the corpus!
![Page 15: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/15.jpg)
Junghoo "John" Cho (UCLA Computer Science) 15
Shortest suffix gram
<a href=“k If =“k is useful, then <a href=“k, a href=“k, href=“k,etc are all useful.
=“k: shortest suffix gram We index only the shortest suffix gram.
Pre-suf shell
![Page 16: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/16.jpg)
Junghoo "John" Cho (UCLA Computer Science) 16
Other issues
Given a regular expression how to find an index entry to look up?
Optimization?
![Page 17: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/17.jpg)
Junghoo "John" Cho (UCLA Computer Science) 17
Experiments
Half million Web documents Comparison
Raw scanning Multigram index Complete: k-grams for k = 1,2, …, 10
Benchmark queries No standard Collected from IBM Almaden researchers
![Page 18: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/18.jpg)
Junghoo "John" Cho (UCLA Computer Science) 18
Example queries (simplified)
MP3 URLs: <a href=.*\.mp3> Invalid HTML: <[^>]*< Phone numbers:
(\d\d\d) \d\d\d-\d\d\d\d PowerPC chip number:
(xpc|mpc)[0-9]+[0-9a-z]+ Middle name of Clinton:
William [a-z]+ Clinton
![Page 19: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/19.jpg)
Junghoo "John" Cho (UCLA Computer Science) 19
Evaluation metrics
Index construction time Index size Matching time
Overall throughput Response time for first 10 matches
![Page 20: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/20.jpg)
Junghoo "John" Cho (UCLA Computer Science) 20
Construction time & Index size
Complete Multigram
Construction Time 63 hours 6 hours
No of Keys 103,151,302 64,656
No of Postings 18,193,048,399 820,396,717
An order of magnitude reduction in index size
![Page 21: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/21.jpg)
Junghoo "John" Cho (UCLA Computer Science) 21
Matching time
On average, Complete is faster than Multigram only by 33%
Query Scanning Complete Multigram
mp3 573 sec 11 sec 15 sec
PowerPC 548 sec 1 sec 2 sec
phone 540 sec 540 sec 540 sec
![Page 22: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/22.jpg)
Junghoo "John" Cho (UCLA Computer Science) 22
Result size & Improvement
100%
1000%
10000%
100000%
1 10 100 1000 10000 100000 1000000
Result size
Improvement from Scanning
![Page 23: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/23.jpg)
Junghoo "John" Cho (UCLA Computer Science) 23
Related work
Suffix tree Beaza-Yates et al., JACM,1998
Main-memory based
Disk-based string index Cooper et al., VLDB, 2001
Good for exact string matching
Inverted index English words
![Page 24: A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)](https://reader030.fdocuments.in/reader030/viewer/2022032523/56649d7a5503460f94a5ea2b/html5/thumbnails/24.jpg)
Junghoo "John" Cho (UCLA Computer Science) 24
Conclusion
Fast matching of regular expressions Multigram index
Small size Significant improvement in matching time
Future work Optimization?