Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di...

Paolo Ferragina, Università di Pisa

XML Compression and Indexing

Paolo FerraginaDipartimento di Informatica, Università di Pisa

[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]

The Future of Web SearchBarcelona, May 2006

Under patenting byPisa-Rutgers Univ.


Compressed Permuterm Index

Paolo Ferragina, Rossano VenturiniDipartimento di Informatica, Università di Pisa

Under Y!-patenting


A basic problemGiven a dictionary D of strings, having variable length, design

a compressed data structure that supports

1) string id

2) Prefix(): find all strings in D that are prefixed by

3) Suffix(): find all strings in D that are suffixed by

4) Substring(): find all strings in D that contain

5) PrefixSuffix() = Prefix() Suffix()

IR book of Manning-Raghavan-Schutze

Tolerant Retrieval Problem (wildcards)Prefix() = *

Suffix() = *Substring() = **

PrefixSuffix() = *




1) string id

2) Prefix(): find all s in D that are prefixed by

3) Suffix(): find all s in D that are suffixed by

4) Substring(): find all s in D that contain


Hashing

Not exact searches




1) string id





(Compacted) Trie

Two versions: for D and for DR + Intersect answers No substring search (unless using Suffix Trie)

Need to store D for resolving edge-labels




1) string id





Front coding...


Two versions: for D and for DR + Intersect answers Need some extra data structures for bucket identification

No substring search

Front-coding

http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html...

0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html

0 http://checkmate.com/All/Natural/Washcloth.html...

3035%

bzip ≈ 10%Be back on this, later on!

uk-2002 crawl ≈250Mb


A basic problemGiven a dictionary D of strings, having variable length,

compress them in a way that we can efficiently support

1) string id



4) Substring(): find all s in D that contain by


Permuterm Index (Garfield, 76)

Reduce any query to a “prefix query” over a larger dictionary


Premuterm Index [Garfield, 1976]

Take a dictionary D={yahoo,google} 1. Append a special char $ to the end of each string2. Generate all rotations of these strings

yahoo$ahoo$yhoo$yaoo$yaho$yaho$yahoogoogle$oogle$gogle$gogle$goole$googe$googl$google

Prefix(ya) = Prefix($ya)

Suffix(oo) = Prefix(oo$)

Substring(oo) = Prefix(oo)

PrefixSuffix(y,o)= Prefix(o$y)

Any query on D reduces to a prefix-query on P[D]

PermutermDictionary

Space problems


Compressed Permuterm Index

It deploys two ingredients: Permuterm index Compressed full-text index

Theoretically: Query ops take optimal time: proportional to pattern

length

Space occupancy is |D| Hk(D) + o(|D| log ||) bits

Technically:A simple reduction step: Permuterm Compressed

index

Re-use known machinery on compressed indexes

Achieve bzip-compression at Front-coding speed

SIGIR ‘07


pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

The Burrows-Wheeler Transform (1994)

Take the text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

#mississipp ii#mississip pippi#missis s

L

T

F


Compressing L is effective

Key observation: L is locally

homogeneousL is highly compressible

Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !


The FM-index

The result: Count(P): O(p) time

Locate(P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time

Space occupancy: |T| Hk(T) + o(|T| log ||) bits

[Ferragina-Manzini, JACM ‘05]

Survey of Navarro-Makinencontains many other indexes

New concept: The FM-index is an opportunistic data structure

The main idea is to reduce substring search tosome basic operations over arrays of

symbolsCompressed Permuterm index

builds upon the best two featuresof the FM-index


p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

# mississipp ii #mississip pi ppi#missis s

F L

Take two equal L’s chars

How do we map L’s onto F’s chars ?

... Need to distinguish equal chars in F...

Rotate rightward their rows

Same relative order !!

unknown

First ingredient: L F mapping



i ssippi#mis s


First ingredient: L F mapping


F Lunknown

The oracleRank( s , 9 ) = 3

FM-index is actuallyRank ds over BWT

O(1) time and Hk-space

1

2

67

9



i ssippi#mis s



F Lunknown

Second ingredient: Backward step

Backward step(i):

Return LF[i], in O(1) time

LFLF

T scanned backward

by using LF-

mappingi...s...

s


frocc=2[lr-fr+1]

Third ingredient: substring search

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

ipssm#pissii

P = si

lr

unknown L

Count(P[1,p]):

Finds <fr,lr> in O(p) time


The Comprressed Permuterm

Some queries are trivial...

Prefix() = Substring search($) within Z

Suffix() = Substring search($) within Z

Substr() = Substring search() within Z

Z = $hat$hip$hop$hot$#

Build FM-index to support substring searches

Lexicographically sorted


PrefixSuffix search

Key property:

Last char of si is at L[i+1]

Cyclic-LF[i]

If (i > #D) return LF[i]

else return LF[i+1]

LF[3]

i=3

CLF[3]

unknown


PrefixSuffix(ho,p)

PrefixSuffix(P):

Search FM-index of Z using Cyclic-LF instead of LF

No change in time/space boundsof compressed indexes

unknown

$ho LFCLF


Rank and Select of strings

Z = $hat$hip$hop$hot$#

Other queries...

Rank(s) = row of $s$

Select(i)= backw from L[i+1]

unknown


Experiments

Three dictionaries: Term dictionary: Trec WT10G Host dictionary (reversed): UK-2005 Url dictionary (host reversed): first 190Mb of UK-2005

Term Host Url

size 118 Mb 34 Mb 190 Mb

# strings 10 Mil 2 Mil 3 Mil

FC 40% 45% 30%

bzip 33% 25% 10%

PrefixSuffix search needs *2


A test on URLs

• Time of 2060 sec/char, and space close to bzip

• Time close to Front-Coding (4 sec/char), but <50% of its space

MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”.

Choose yourtrade-off

Now, they mention CPI

Trad

e-off

% dict-size


We proposed an approach for dictionary storage:

+ Theory: optimal time and entropy-bounds for space

+ Practice: trades time vs space, thus fitting user needs

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di...

Documents

Transcript of Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di...