Compressing and Searching XML Data Via Two Zips

Post on 12-Jan-2016

30 views 0 download

description

Compressing and Searching XML Data Via Two Zips. Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan]. Opportunistic Data Structures with Applications P. Ferragina, G. Manzini. Survey by Navarro-Makinen cites more - PowerPoint PPT Presentation

Transcript of Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

Compressing and Searching XML Data Via

Two Zips

Paolo FerraginaDipartimento di Informatica, Università di Pisa

[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]

Paolo Ferragina, Università di Pisa

Six years ago... [now, J. ACM 05]

Opportunistic Data Structures with Applications

P. Ferragina, G. Manzini

Survey by Navarro-Makinen cites more than 50 papers on the subject !!

Paolo Ferragina, Università di Pisa

An XML excerpt

<dblp> <book>

<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>

</book> <article>

<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>

</article>

...</dblp>

It is verbose !

Paolo Ferragina, Università di Pisa

A tree interpretation...

XML document exploration Tree navigation XML document search Labeled subpath

searches

Subset of XPath [W3C]

Paolo Ferragina, Università di Pisa

The Problem

Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches

XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression

We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations:

Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree

XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file

XML-native search engines

might exploit this tool as a core block for

query optimization and (compressed) storage

Theoretically do exist many solutions, starting from [Jacobson, IEEE

Focs ’89] no subpath/content searches, and poor performance on labeled

trees

Paolo Ferragina, Università di Pisa

A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05]

We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings (do you know bzip !?).

The XBW linearizes the tree T in 2 arrays s.t.:

the compression of T reduces to use any k-th order entropy compressor (gzip, bzip,...) over these two arrays

the indexing of T reduces to implement simple rank/select query operations over these two arrays

Paolo Ferragina, Università di Pisa

The XBW-TransformC

B A B

D c

c a

b a D

c

D a

b

CBDcacAb aDcBDba

S

CB CD B CD B CB CCA CA CA CD A CCB CD B CB C

S

upward labeled paths

Permutationof tree nodes

Step 1.Visit the tree in pre-order. For each node, write down its label and the labels on its upward path

Paolo Ferragina, Università di Pisa

The XBW-TransformC

B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

S

upward labeled paths

Step 2.Stably sort according to S

Paolo Ferragina, Università di Pisa

XBW takes optimal t log || + 2t bits

1001010 10011011

The XBW-TransformC

B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

S

Step 3.Add a binary array Slast marking the

rows corresponding to last children

Slast

XBW

XBW can be built and inverted

in optimal O(t) time

Key fact

Nodes correspond to items in <Slast,S>

Paolo Ferragina, Università di Pisa

XBzip – a simple XML compressor

Pcdata

Tags, Attributes and symbol =

XBW is compressible:

S and Spcdata are locally homogeneous

Slast has some structure

Paolo Ferragina, Università di Pisa

XBzip = XBW + PPMd

String compressors are not so bad: within 5%

0%

5%

10%

15%

20%

25%

DBLP Pathways News

gzip bzip2 ppmdi xmill + ppmdi scmppm XBzip

Paolo Ferragina, Università di Pisa

1001010 10011011

Some structural properties

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

SSlast

XBW

C

B A B

D c

c a

b a D

c

D a

b

C

B A B

D c

c a

b a D

c

D a

b

Two useful properties:

• Children are contiguous and delimited by 1s

• Children reflect the order of their parents

B

Paolo Ferragina, Università di Pisa

1001010 10011011

XBW is navigational

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

SSlast

XBW

C

B A B

D c

c a

b a D

c

D a

b

C

A B

D c

c a

b a D

c

D a

b

XBW is navigational:

• Rank-Select data structures on Slast and S

• The array C of || integers

B

Get_children

Rank(B,S)=2

Select in Slast the 2° item 1from here...

A 2B 5C 9D 12

C

Paolo Ferragina, Università di Pisa

1001010 10011011

XBW is searchable (count subpaths)

C

B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

SSlast

XBW-index

Inductive step:

Pick the next char in [i+1]i.e. ‘D’

Search for the first and last ‘D’ in S[fr,lr]

Jump to their children

fr

lr

= B D

[i+1]

Rows whoseS starts with ‘B’

Their childrenhave upwardpath = ‘D B’

A 2B 5C 9D 12

lr

fr

XBW is searchable:

• Rank-Select data structures on Slast and S

• Array C of || integers

C

2 occurrences of

because of two 1s

Paolo Ferragina, Università di Pisa

XBzipIndex: XBW + FM-index

Upto 36% improvement in compression ratioQuery (counting) time 8 ms, Navigation time 3 ms

0%

10%

20%

30%

40%

50%

60%

DBLP Pathways News

Huffword XPress XQzip XBzipIndex XBzip

DBLP: 1.75 bytes/node, Pathways: 0.31 bytes/node, News: 3.91 bytes/node

Paolo Ferragina, Università di Pisa

Indexing[Kosaraju, Focs ‘89]

[IEEE Focs ’05][WWW ’06]

The overall picture on Compressed Indexing...

Data type

CompressedIndexing Strong connection

[IEEE Focs ’00][J. ACM ’05]

This is a powerful paradigm to design compressed indexes:

1. Transform the input in few arrays (via BWT or XBW)

2. Index (+ Compress) the arrays to support rank/select ops

Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1)Experimental: Wea ’06 (2)

http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl