Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di...
-
Upload
carlos-moreno -
Category
Documents
-
view
215 -
download
0
Transcript of Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di...
Paolo Ferragina, Università di Pisa
XML Compression and Indexing
Paolo FerraginaDipartimento di Informatica, Università di Pisa
[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]
The Future of Web SearchBarcelona, May 2006
Under patenting byPisa-Rutgers Univ.
Paolo Ferragina, Università di Pisa
Compressed Permuterm Index
Paolo Ferragina, Rossano VenturiniDipartimento di Informatica, Università di Pisa
Under Y!-patenting
Paolo Ferragina, Università di Pisa
A basic problemGiven a dictionary D of strings, having variable length, design
a compressed data structure that supports
1) string id
2) Prefix(): find all strings in D that are prefixed by
3) Suffix(): find all strings in D that are suffixed by
4) Substring(): find all strings in D that contain
5) PrefixSuffix() = Prefix() Suffix()
IR book of Manning-Raghavan-Schutze
Tolerant Retrieval Problem (wildcards)Prefix() = *
Suffix() = *Substring() = **
PrefixSuffix() = *
Paolo Ferragina, Università di Pisa
A basic problemGiven a dictionary D of strings, having variable length, design
a compressed data structure that supports
1) string id
2) Prefix(): find all s in D that are prefixed by
3) Suffix(): find all s in D that are suffixed by
4) Substring(): find all s in D that contain
5) PrefixSuffix() = Prefix() Suffix()
Hashing
Not exact searches
Paolo Ferragina, Università di Pisa
A basic problemGiven a dictionary D of strings, having variable length, design
a compressed data structure that supports
1) string id
2) Prefix(): find all s in D that are prefixed by
3) Suffix(): find all s in D that are suffixed by
4) Substring(): find all s in D that contain
5) PrefixSuffix() = Prefix() Suffix()
(Compacted) Trie
Two versions: for D and for DR + Intersect answers No substring search (unless using Suffix Trie)
Need to store D for resolving edge-labels
Paolo Ferragina, Università di Pisa
A basic problemGiven a dictionary D of strings, having variable length, design
a compressed data structure that supports
1) string id
2) Prefix(): find all s in D that are prefixed by
3) Suffix(): find all s in D that are suffixed by
4) Substring(): find all s in D that contain
5) PrefixSuffix() = Prefix() Suffix()
Front coding...
Paolo Ferragina, Università di Pisa
Two versions: for D and for DR + Intersect answers Need some extra data structures for bucket identification
No substring search
Front-coding
http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html...
0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html
0 http://checkmate.com/All/Natural/Washcloth.html...
3035%
bzip ≈ 10%Be back on this, later on!
uk-2002 crawl ≈250Mb
Paolo Ferragina, Università di Pisa
A basic problemGiven a dictionary D of strings, having variable length,
compress them in a way that we can efficiently support
1) string id
2) Prefix(): find all s in D that are prefixed by
3) Suffix(): find all s in D that are suffixed by
4) Substring(): find all s in D that contain by
5) PrefixSuffix() = Prefix() Suffix()
Permuterm Index (Garfield, 76)
Reduce any query to a “prefix query” over a larger dictionary
Paolo Ferragina, Università di Pisa
Premuterm Index [Garfield, 1976]
Take a dictionary D={yahoo,google} 1. Append a special char $ to the end of each string2. Generate all rotations of these strings
yahoo$ahoo$yhoo$yaoo$yaho$yaho$yahoogoogle$oogle$gogle$gogle$goole$googe$googl$google
Prefix(ya) = Prefix($ya)
Suffix(oo) = Prefix(oo$)
Substring(oo) = Prefix(oo)
PrefixSuffix(y,o)= Prefix(o$y)
Any query on D reduces to a prefix-query on P[D]
PermutermDictionary
Space problems
Paolo Ferragina, Università di Pisa
Compressed Permuterm Index
It deploys two ingredients: Permuterm index Compressed full-text index
Theoretically: Query ops take optimal time: proportional to pattern
length
Space occupancy is |D| Hk(D) + o(|D| log ||) bits
Technically:A simple reduction step: Permuterm Compressed
index
Re-use known machinery on compressed indexes
Achieve bzip-compression at Front-coding speed
SIGIR ‘07
Paolo Ferragina, Università di Pisa
pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i
issippi#mis s
mississippi #ississippi# m
The Burrows-Wheeler Transform (1994)
Take the text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
#mississipp ii#mississip pippi#missis s
L
T
F
Paolo Ferragina, Università di Pisa
Compressing L is effective
Key observation: L is locally
homogeneousL is highly compressible
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
Paolo Ferragina, Università di Pisa
The FM-index
The result: Count(P): O(p) time
Locate(P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time
Space occupancy: |T| Hk(T) + o(|T| log ||) bits
[Ferragina-Manzini, JACM ‘05]
Survey of Navarro-Makinencontains many other indexes
New concept: The FM-index is an opportunistic data structure
The main idea is to reduce substring search tosome basic operations over arrays of
symbolsCompressed Permuterm index
builds upon the best two featuresof the FM-index
Paolo Ferragina, Università di Pisa
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
# mississipp ii #mississip pi ppi#missis s
F L
Take two equal L’s chars
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Rotate rightward their rows
Same relative order !!
unknown
First ingredient: L F mapping
Paolo Ferragina, Università di Pisa
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
First ingredient: L F mapping
# mississipp ii #mississip pi ppi#missis s
F Lunknown
The oracleRank( s , 9 ) = 3
FM-index is actuallyRank ds over BWT
O(1) time and Hk-space
1
2
67
9
Paolo Ferragina, Università di Pisa
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
# mississipp ii #mississip pi ppi#missis s
F Lunknown
Second ingredient: Backward step
Backward step(i):
Return LF[i], in O(1) time
LFLF
T scanned backward
by using LF-
mappingi...s...
s
Paolo Ferragina, Università di Pisa
frocc=2[lr-fr+1]
Third ingredient: substring search
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
ipssm#pissii
P = si
lr
unknown L
Count(P[1,p]):
Finds <fr,lr> in O(p) time
Paolo Ferragina, Università di Pisa
The Comprressed Permuterm
Some queries are trivial...
Prefix() = Substring search($) within Z
Suffix() = Substring search($) within Z
Substr() = Substring search() within Z
Z = $hat$hip$hop$hot$#
Build FM-index to support substring searches
Lexicographically sorted
Paolo Ferragina, Università di Pisa
PrefixSuffix search
Key property:
Last char of si is at L[i+1]
Cyclic-LF[i]
If (i > #D) return LF[i]
else return LF[i+1]
LF[3]
i=3
CLF[3]
unknown
Paolo Ferragina, Università di Pisa
PrefixSuffix(ho,p)
PrefixSuffix(P):
Search FM-index of Z using Cyclic-LF instead of LF
No change in time/space boundsof compressed indexes
unknown
$ho LFCLF
Paolo Ferragina, Università di Pisa
Rank and Select of strings
Z = $hat$hip$hop$hot$#
Other queries...
Rank(s) = row of $s$
Select(i)= backw from L[i+1]
unknown
Paolo Ferragina, Università di Pisa
Experiments
Three dictionaries: Term dictionary: Trec WT10G Host dictionary (reversed): UK-2005 Url dictionary (host reversed): first 190Mb of UK-2005
Term Host Url
size 118 Mb 34 Mb 190 Mb
# strings 10 Mil 2 Mil 3 Mil
FC 40% 45% 30%
bzip 33% 25% 10%
PrefixSuffix search needs *2
Paolo Ferragina, Università di Pisa
Paolo Ferragina, Università di Pisa
A test on URLs
• Time of 2060 sec/char, and space close to bzip
• Time close to Front-Coding (4 sec/char), but <50% of its space
MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”.
Choose yourtrade-off
Now, they mention CPI
Trad
e-off
% dict-size
Paolo Ferragina, Università di Pisa
We proposed an approach for dictionary storage:
+ Theory: optimal time and entropy-bounds for space
+ Practice: trades time vs space, thus fitting user needs