TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck...

TopX 2.0TopX 2.0——

A (Very) Fast Object-Store for A (Very) Fast Object-Store for Top-k XPath Query ProcessingTop-k XPath Query Processing

Martin TheobaldMax-Planck Institute

Ralf SchenkelMax-Planck Institute

Mohammed AbuJarourHasso-Plattner Institute

“Native XML data base systems can store schemaless data ... ”

“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”

“XML-QL: A Query Language for XML.”

“Native XML Data Bases.”

“Proc. Query Languages Workshop, W3C,1998.”

“XML queries with an expressive power similar to that of Datalog …”

sec

article

sec

par

bib

par

title “Current Approaches to XML Data Manage-ment”

itempar

title inproc

title

//article[about(.//bib//item, “W3C”)] //sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]

“What does XML add for retrieval? It adds formal ways …”

“w3c.org/xml”

sec

article

sec

par “Sophisticated technologies developed by smart people.”

par

title “The

XML Files”

par

title “TheOntology Game”

title“TheDirty LittleSecret”

bib

“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”

title

item

url“XML”

RANKINGRANKINGRANKINGRANKING

VAGUENESSVAGUENESSVAGUENESSVAGUENESS

EARLY PRUNINGEARLY PRUNINGEARLY PRUNINGEARLY PRUNING

From the INEX ’03-’05 IEEE Collection

Ontology/Large Thesaurus

WordNet,OpenCyc, etc.

Ontology/Large Thesaurus

WordNet,OpenCyc, etc.

SASA

Relational DBMS BackendUnified Text & XML Schema

Relational DBMS BackendUnified Text & XML Schema

Random Access

Top-kQueueTop-kQueue

Scan Threads

Scan Threads

CandidateQueue

CandidateQueue

Indexer/Crawler Indexer/Crawler

Frontends• Web Interface • Web Service • API

Frontends• Web Interface • Web Service • API

• Selectivities• Histograms• Correlations

• Selectivities• Histograms• Correlations

Index MetadataIndex Metadata

TopX 1.0 Query Processor

TopX 1.0 Query Processor

Sequential Access

SASA SASA

• Path Conditions• Phrases & Proximity• Other Full-Text Op’s

• Path Conditions• Phrases & Proximity• Other Full-Text Op’s

Expensive PredicatesExpensive Predicates

RARA

Probabilistic Candidate

Pruning

Probabilistic Candidate

Pruning

Probabilistic Index AccessScheduling

Probabilistic Index AccessScheduling

Dynamic Query

Expansion

Dynamic Query

Expansion

Non-conjunctiveTop-k XPath

Query Processing

Non-conjunctiveTop-k XPath

Query Processing

RARA

JDBCJDBC

2.0

Data Model

XML trees (no XLink/ID/IDRef) Pre-/postorder ranges for the structural index Redundant full-content text nodes

<article>

<title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec></article>

“xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“

“native xml data base native xml data base system store schemaless data“

“xml data manage”

articlearticle

titletitle absabs secsec

“xml manage system vary wide

expressivepower“

“native xml data base”

“native xml data base system store schemaless data“

titletitle parpar

1 6

2 1 3 2 4 5

5 3 6 4

“xml data manage xml manage system vary

wide expressive power native xml native

xml data base system store schemaless data“

ftf (“xml”, article1 ) = 4ftf (“xml”, article1 ) = 4

ftf (“xml”, sec4 ) = 2ftf (“xml”, sec4 ) = 2

“native xml data base native xml data base system store schemaless data“

Scoring Model [INEX ‘05/’06/’07/’08]

XML-specific variant of Okapi BM25 (originating from probabilistic IR on unstructured text)

Content Index (Tag-Term Pairs) Element Freq. Element Statistics

author[“gates”]vs.

section[“gates”]

author[“gates”]vs.

section[“gates”]

TopX 1.0: Relational Schema Precompute & materialize scoring model into combined inverted index over tag-term pairs Supports sorted access (by descending MaxScore) and random access (by DocID)

sec[“xml”]

Select DocID, Pre, Post, Score From TagTermIndex Where tag=‘sec’ and term=‘xml’ Order by MaxScore desc, DocID desc

Pre asc, Post Desc

SASA

Select Pre, Post, Score From TagTermIndex Where DocID=3 and tag=‘sec’ and term=‘xml’ Order by Pre Asc, Post Desc

RARA

Typically two B+trees in a DBMS

Top-k XPath over a Relational Schema[TopX, VLDB ’05 & VLDB-J(1) ’08]

• Content-only (CO) & “structure enriched” queries: //sec[about(.//, “XML”) and about(.//title, “native”]//par[about(.//,

“retrieval”)]

Sequentially scan each index list in descending order of MaxScore Hash-join element blocks by DocID in-memory Do “some” incremental XPath evaluation using Pre/Post indices Aggregate Score along connected path fragments Use variant of Fagin’s threshold algorithm for top-k-style early termination

sec[“xml”] title[“native”] par[“retrieval”]

article

RARA

Expensive predicate probes (RA) to the structure index (3rd B+tree)

Non-conjunctive XPath evaluations Dynamically relax content- & structure-related query conditions

(top-k results entirely driven by score aggregations for content & structure cond.’s)

• Content-and-structure (CAS) queries: //article//sec[about(.//, “XML”)]

Select Pre, Post From TagIndex Where DocID=2123 and Tag=‘article’Order by Pre asc, Post desc

sec[“xml”]

SASA

1.0

Top-k XPath over a Relational Schema[TopX, VLDB ’05 & VLDB-J(1) ’08]

Relational Schema (ct’d)

20,810,942 distinct tag-term pairs for the 4.38 GB Wikipedia

collection

20,810,942 distinct tag-term pairs for the 4.38 GB Wikipedia

collection

sec[“xml”] article

No shredding into DTD-specific relational schema! No DTD at all for INEX Wikipedia!

1,107 distinct tags1,107 distinct tags

TopX 1.0: Top-k XPath over a Relational Schema

2-dimensional source of redundancy Full-content scoring model (red. factor ≈ avg. depth of a text node 6.7 for INEX-Wiki) De-normalized relational schema, many redundant attributes

High overhead in the architecture (Java->JDBC->DBMS & back) Element-block sizes are data-driven, not easy to control layout on disk Hashing too slow compared to very efficient in-memory merge-joins

Content Index Structure Index

(4+4+4+4+4+4+4) bytes X 567,262,445 tag-term pairs

≈ 16 GB

(4+4+4+4) bytes X 52,561,559 tags

≈ 0.85 GB

TopX 2.0: Object-Oriented Storage

2 15 0.92DocID

10 8 0.5

23 48 0.8

45 87 0.2

MaxSore

1MaxSore

DocID

sec[“xml”]

0

title[“xml”]

122,564

…

par[“xml”]

432,534

(4+4+4+4+4+4+4) X 567,262,445

Relational: ≈16 GB

4 X 456,466,649+ (4+4+4) X 567,262,445

Object-oriented: ≈ 8.6 GB(still uncompressed)

(+ (4+4) X 20,810,942 = 166 MB for the offset index)

B

17

3B

L

L

Binary file

B – Element block separatorL – Index list separator

ElementBlock

Group element blocks with similar MaxScore into document blocks of bounded length (e.g. < 256KB)

Sort element blocks within each document block by DocID

Supports Sorted access by MaxScore Merge-joins by DocID

Raw disk access

Object-Oriented Storage w/Block-Mergingsec[“xml”]

0

title[“xml”]

122,564L

BB

2

1

B

5B

…

…BB

3

6B

Doc

umen

t Blo

ck <

256

KB

MaxSore

MaxSore

ElementBlock

SASA

Merging Document BlocksIncrementally

Sequential access and efficient merge-joins on top of large document blocks

sec[“xml”]

BB

2

1

B

5B

…BB

3

6B

…

par[“retrieval”]

BB

5

2

B

7B

BB

6

9B

//sec[about(.//, “XML”)] //par[about(.//, “retrieval”)]

SASA

1.0

0.8

Max(MaxScore): 0.9

0.6

Compressed Number Encoding Multi-attribute (=4) double-nested block-index structure

Delta encoding only works for DocID (and to some extent for Pre) No specific assumptions on distributions of Pre/Post or Score

No Unary or Huffman coding (prefix-free but additional coding table)

Sophisticated compression schemes may be expensive to decode No Zip, etc.; not even PFor-Delta (needs second pass for each attribute type)

But have known number ranges DocID [1, 659,388] -> 3 bytes (2543 = 16,387,064, lossless) Pre/Post [1, 43,114] -> 2 bytes (2562 = 64,516, lossless) Score [0,1] -> rounded to 1 byte (256 buckets, lossy)

Variable-length byte encoding w/leading length-indicator byte

4+1=5 bytes

9+1=10 bytes

Some more tricks… Dump leading histogram blocks into index list headers

Histograms only for index lists that exceed one document block (<5% of all lists) Supports probabilistic pruning and cost-based index access scheduling [IO-

Top-K, VLDB ’06]

Incrementally read & process precomputed memory images for fast top-k queries on top of large disk blocks

36 b

ytes

10

sec[“xml”]

score

freq

DB1 (256 KB)

…

DB2 (256 KB) DBl (256 KB)

… … …1.0

0.9

0.8

0.8

1.0

0.9

0.9

0.2

1.0

0.9

0.7

0.6

SA Scheduling Look-ahead Δi through precomputed

score histograms Knapsack-based optimization of

Score Reduction

RA Scheduling 2-phase probing:

Schedule RAs “late & last”

i.e., cleanup the queue if

Extended probabilistic cost model for integrated SA & RA scheduling

Block Access Scheduling [IO-Top-K, VLDB ’06]

Inverted Block-Index(256KB doc-blocks)

Δ3,3 = 0.2Δ3,3 = 0.2Δ1,3 = 0.8Δ1,3 = 0.8

SA

SASA

SA SA

SA

RARA

Object Storage Summary

• 567,262,445 tag-term pairs• 20,810,942 distinct tag-term pairs• 20,815,884 document blocks (<256KB)• 456,466,649 element blocks

• 3,729,714,594 total bytes (3.47GB) (6.57 bytes/tag-term pair on avg.)

• 52,561,559 tags (elements)• 1,107 distinct tags• 2,323 document blocks (<256KB)• 8,999,193 element blocks

• 205,021,938 total bytes (195MB)(3.9 bytes/tag on avg.)

From 4.38 GB Wikipedia XML sources

Structure IndexCont

ent I

ndex

(incl

. his

togr

ams)

Efficiency Track Results – Focused, All

566/568 efficiency topics (CO & CAS)

iP[0.0] iP[0.01] iP[0.05] iP[0.10] MAiP AVG MS

SUMSEC

CO-15 0.48 0.41 0.28 0.20 0.07 49.79 28.18

CO-150 0.50 0.45 0.37 0.31 0.12 85.96 48.65

CO-1500 0.50 0.46 0.37 0.33 0.14 239.73 135.69

CAS-15 0.46 0.39 0.26 0.19 0.07 90.99 51.50

CAS-150 0.47 0.43 0.35 0.29 0.11 112.32 63.57

CAS-1500 0.48 0.44 0.36 0.31 0.12 253.42 143.43

All experiments: AMD Opteron quad-core 2.6 GHz,16 GB RAM, RAID 5, Windows Server 2003

Efficiency Track Results – Focused, Type (A)

538/540 type (A) efficiency topics (CO & CAS)

MAiP AVG MS

SUMSEC

CO-15 0.07 18.88 11.16

CO-150 0.12 49.12 26.43

CO-1500 0.14 191.27 102.90

CAS-15 0.06 48.84 26.28

CAS-150 0.11 61.25 32.95

CAS-1500 0.12 165.53 89.06

Efficiency Track Results – Focused, Type (B)

MAiP AVG MS

SUMSEC

CO-15 0.09 844.67 17.74

CO-150 0.11 1038.90 21.82

CO-1500 0.11 1468.67 30.84

CAS-15 0.09 1044.71 21.94

CAS-150 0.11 1074.66 22.57

CAS-1500 0.11 1479.33 31.07

21/21 type (B) efficiency topics (CO & CAS)

Efficiency Track Results – Focused, Type (C)

7/7 type (C) efficiency topics (CO & CAS)

MAiP AVG MS

SUMSEC

CO-15 n/a 41.00 0.29

CO-150 n/a 58.86 0.41

CO-1500 n/a 277.57 1.94

CAS-15 n/a 469.42 3.29

CAS-150 n/a 1150.14 8.05

CAS-1500 n/a 3330.71 23.36

Efficiency Track Results – Thorough, All

566/568 efficiency topics (CO & CAS)

MAP AVG MS

SUMSEC

CO-15 0.006 70.91 40.13

CAS-15 0.005 89.31 50.55

Note: top-15 only!

Conclusions & Outlook Scalable and efficient XML-IR with vague search

TopX 1.0 our mature system, default engine for INEX topic development & interactive tracks [VLDB-J Special Issue on DB&IR Integration ‘08]

Brand-new TopX 2.0 prototype Efficient reimplementation in C++, object-oriented XML storage,

moderate compression rates 20—30 times better sequential throughput than relational Can do CAS in 0.05 sec avg. & CO in 0.02 sec avg. (classic ad-hoc topics)

and CAS in 0.09 sec avg. & CO in 0.05 sec avg. (incl. difficult topics)

More features Generalized proximity search, graph top-k Updates (gaps within document blocks) XQuery Full-Text (top-k-style bounds over IF, For-Let) …

http://www.inex.otago.ac.nz/efficiency/efficiency.asp

TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck...

Documents

Transcript of TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck...