Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di...

25
Paolo Ferragina, Università di P isa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] The Future of Web Search Barcelona, May 2006 Under patenting by Pisa-Rutgers Univ.

Transcript of Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di...

Page 1: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

XML Compression and Indexing

Paolo FerraginaDipartimento di Informatica, Università di Pisa

[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]

The Future of Web SearchBarcelona, May 2006

Under patenting byPisa-Rutgers Univ.

Page 2: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

Compressed Permuterm Index

Paolo Ferragina, Rossano VenturiniDipartimento di Informatica, Università di Pisa

Under Y!-patenting

Page 3: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

A basic problemGiven a dictionary D of strings, having variable length, design

a compressed data structure that supports

1) string id

2) Prefix(): find all strings in D that are prefixed by

3) Suffix(): find all strings in D that are suffixed by

4) Substring(): find all strings in D that contain

5) PrefixSuffix() = Prefix() Suffix()

IR book of Manning-Raghavan-Schutze

Tolerant Retrieval Problem (wildcards)Prefix() = *

Suffix() = *Substring() = **

PrefixSuffix() = *

Page 4: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

A basic problemGiven a dictionary D of strings, having variable length, design

a compressed data structure that supports

1) string id

2) Prefix(): find all s in D that are prefixed by

3) Suffix(): find all s in D that are suffixed by

4) Substring(): find all s in D that contain

5) PrefixSuffix() = Prefix() Suffix()

Hashing

Not exact searches

Page 5: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

A basic problemGiven a dictionary D of strings, having variable length, design

a compressed data structure that supports

1) string id

2) Prefix(): find all s in D that are prefixed by

3) Suffix(): find all s in D that are suffixed by

4) Substring(): find all s in D that contain

5) PrefixSuffix() = Prefix() Suffix()

(Compacted) Trie

Two versions: for D and for DR + Intersect answers No substring search (unless using Suffix Trie)

Need to store D for resolving edge-labels

Page 6: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

A basic problemGiven a dictionary D of strings, having variable length, design

a compressed data structure that supports

1) string id

2) Prefix(): find all s in D that are prefixed by

3) Suffix(): find all s in D that are suffixed by

4) Substring(): find all s in D that contain

5) PrefixSuffix() = Prefix() Suffix()

Front coding...

Page 7: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

Two versions: for D and for DR + Intersect answers Need some extra data structures for bucket identification

No substring search

Front-coding

http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html...

0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html

0 http://checkmate.com/All/Natural/Washcloth.html...

3035%

bzip ≈ 10%Be back on this, later on!

uk-2002 crawl ≈250Mb

Page 8: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

A basic problemGiven a dictionary D of strings, having variable length,

compress them in a way that we can efficiently support

1) string id

2) Prefix(): find all s in D that are prefixed by

3) Suffix(): find all s in D that are suffixed by

4) Substring(): find all s in D that contain by

5) PrefixSuffix() = Prefix() Suffix()

Permuterm Index (Garfield, 76)

Reduce any query to a “prefix query” over a larger dictionary

Page 9: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

Premuterm Index [Garfield, 1976]

Take a dictionary D={yahoo,google} 1. Append a special char $ to the end of each string2. Generate all rotations of these strings

yahoo$ahoo$yhoo$yaoo$yaho$yaho$yahoogoogle$oogle$gogle$gogle$goole$googe$googl$google

Prefix(ya) = Prefix($ya)

Suffix(oo) = Prefix(oo$)

Substring(oo) = Prefix(oo)

PrefixSuffix(y,o)= Prefix(o$y)

Any query on D reduces to a prefix-query on P[D]

PermutermDictionary

Space problems

Page 10: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

Compressed Permuterm Index

It deploys two ingredients: Permuterm index Compressed full-text index

Theoretically: Query ops take optimal time: proportional to pattern

length

Space occupancy is |D| Hk(D) + o(|D| log ||) bits

Technically:A simple reduction step: Permuterm Compressed

index

Re-use known machinery on compressed indexes

Achieve bzip-compression at Front-coding speed

SIGIR ‘07

Page 11: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

The Burrows-Wheeler Transform (1994)

Take the text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

#mississipp ii#mississip pippi#missis s

L

T

F

Page 12: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

Compressing L is effective

Key observation: L is locally

homogeneousL is highly compressible

Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

Page 13: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

The FM-index

The result: Count(P): O(p) time

Locate(P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time

Space occupancy: |T| Hk(T) + o(|T| log ||) bits

[Ferragina-Manzini, JACM ‘05]

Survey of Navarro-Makinencontains many other indexes

New concept: The FM-index is an opportunistic data structure

The main idea is to reduce substring search tosome basic operations over arrays of

symbolsCompressed Permuterm index

builds upon the best two featuresof the FM-index

Page 14: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

# mississipp ii #mississip pi ppi#missis s

F L

Take two equal L’s chars

How do we map L’s onto F’s chars ?

... Need to distinguish equal chars in F...

Rotate rightward their rows

Same relative order !!

unknown

First ingredient: L F mapping

Page 15: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

First ingredient: L F mapping

# mississipp ii #mississip pi ppi#missis s

F Lunknown

The oracleRank( s , 9 ) = 3

FM-index is actuallyRank ds over BWT

O(1) time and Hk-space

1

2

67

9

Page 16: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

# mississipp ii #mississip pi ppi#missis s

F Lunknown

Second ingredient: Backward step

Backward step(i):

Return LF[i], in O(1) time

LFLF

T scanned backward

by using LF-

mappingi...s...

s

Page 17: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

frocc=2[lr-fr+1]

Third ingredient: substring search

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

ipssm#pissii

P = si

lr

unknown L

Count(P[1,p]):

Finds <fr,lr> in O(p) time

Page 18: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

The Comprressed Permuterm

Some queries are trivial...

Prefix() = Substring search($) within Z

Suffix() = Substring search($) within Z

Substr() = Substring search() within Z

Z = $hat$hip$hop$hot$#

Build FM-index to support substring searches

Lexicographically sorted

Page 19: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

PrefixSuffix search

Key property:

Last char of si is at L[i+1]

Cyclic-LF[i]

If (i > #D) return LF[i]

else return LF[i+1]

LF[3]

i=3

CLF[3]

unknown

Page 20: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

PrefixSuffix(ho,p)

PrefixSuffix(P):

Search FM-index of Z using Cyclic-LF instead of LF

No change in time/space boundsof compressed indexes

unknown

$ho LFCLF

Page 21: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

Rank and Select of strings

Z = $hat$hip$hop$hot$#

Other queries...

Rank(s) = row of $s$

Select(i)= backw from L[i+1]

unknown

Page 22: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

Experiments

Three dictionaries: Term dictionary: Trec WT10G Host dictionary (reversed): UK-2005 Url dictionary (host reversed): first 190Mb of UK-2005

Term Host Url

size 118 Mb 34 Mb 190 Mb

# strings 10 Mil 2 Mil 3 Mil

FC 40% 45% 30%

bzip 33% 25% 10%

PrefixSuffix search needs *2

Page 23: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

Page 24: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

A test on URLs

• Time of 2060 sec/char, and space close to bzip

• Time close to Front-Coding (4 sec/char), but <50% of its space

MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”.

Choose yourtrade-off

Now, they mention CPI

Trad

e-off

% dict-size

Page 25: Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Paolo Ferragina, Università di Pisa

We proposed an approach for dictionary storage:

+ Theory: optimal time and entropy-bounds for space

+ Practice: trades time vs space, thus fitting user needs