Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China...

101
Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010

Transcript of Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China...

Page 1: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Benchmarking Holistic Approaches to XML TPQ Processing

Jiaheng Lu

Renmin University of China

BenchmarX 2010

Page 2: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

2

A little bit of history

Database world 1970 relational databases 1990 object oriented database 1995 semi-structured databases

Document world 1974 SGML (Structured Generalized Markup

Language) 1990 HTML (Hypertext MarkupLanguage) 1992 URL (Universal Resource Locator)

1996 XML (eXtensible Markup Language)

Page 3: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

3

What is XML

The eXtensible Markup Language (XML) is the universal format for structured documents and data on the Web.

Advantages of XML: Human- and machine-readable format More flexible than HTML, not so complicated

as SGML Unlike relational table, XML can describe tree

and graph structural data

Page 4: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

4

What is XML

Basic Specification: XML 1.0, W3C Recommendation Feb’98

<book year=“1967”> <title>The politics of experience</title> <author> <firstname>Ronald</firstname> <lastname>Laing</lastname> </author></book>

Page 5: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

5

XML Tree

An XML document is commonly modeled as a rooted, ordered tree.

book

@year title author

“1967” firstname lastname“The politics…”

“Lazing”

“Ronald”

“year” is an attribute

Page 6: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

6

XML query language

Major standards for querying XML data XPath and XQuery

“XPath is a language for addressing parts of an XML document ” XPath 1.0 W3C, Nov 1999 E.g. paper [title=“XML”]/author

“XQuery is an XML query language which provide features for retrieving and interpreting information from XML documents. ” XQuery 1.0 Nov 2005

Page 7: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

7

An XQuery example

XQuery:<results> { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return <result> { $t } { $a } </result> } </results>

Create a flat list of all the title-author pairs for every book in bibliography.

Page 8: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

8

XML Twig Pattern

XML Twig Pattern Query (TPQ) is a core operation in XPath and XQuery

Definition of XML twig pattern : an XML twig pattern is a small tree whose nodes are tags, attributes or text values; and edges are either parent-child (P-C) or ancestor-descendant (A-D) relationships

Page 9: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

9

An XML twig pattern example

XQuery:<results> { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return <result> { $t } { $a } </result> } </results>

$b

$t: $a:

To answer the XQuery, we need to first match the following XML twig pattern:

bib

book

title author

Create a flat list of all the title-author pairs for every book in bibliography.

Page 10: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

10

Research Problem

Given an XML twig pattern Q, and an XML database D, we need to find ALL the matches of Q on D efficiently.

E.g. Consider the following twig pattern and document:

Twig pattern:

section

title figure

An XML tree:

s1

s2

f1

p1

t1

t2

Query solutions:

(s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

Page 11: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

11

Why research XML twig pattern match

An XML query includes two parts: value match and twig match.

Twig Match:New challenge!

XPath: paper [title=“XML”]/author

Value (content) match

paper

title author

Page 12: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

12

Approach Overview

(1) Labeling: Assign each element in the XML document tree an integer label to capture the structural information of documents

(2) Computing: Use labels to answer the twig pattern without traversing the original document

Page 13: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

13

Related work graph

XML TPQ AlgorithmsXML TPQ

Algorithms

Containment scheme

[SIGMOD’01]

Containment scheme

[SIGMOD’01]

Labeling schemes

Computing algorithms

Stack-merge [ICDE ’02]

Stack-merge [ICDE ’02]Dewey scheme [

SIGMOD’02 ]

Dewey scheme [ SIGMOD’02 ]

TwigStack [SIGMOD ’02]TwigStack [SIGMOD ’02]

Twig2Stack [VLDB’06]Twig2Stack [VLDB’06]

TJFast [VLDB ’05]TJFast [VLDB ’05]

XPath-SQL [SIGMOD ’02]

XPath-SQL [SIGMOD ’02]

TreeMatch[ TKDE’2010]TreeMatch[ TKDE’2010]

Dynamic Dewey scheme

[ SIGMOD’09 ]

Dynamic Dewey scheme

[ SIGMOD’09 ]

Page 14: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

14

Approach Overview

(1) Labeling Region encoding (or called containment) labeling

scheme (start,end,level)

An example XML tree with region encoding labels

s1

s2

f1

p1

t1

t2

(1,12,1)

(2,3,2)

(5,6,3)

(4,11,2)

(7,10,3)

(8,9,4)

Page 15: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

15

Approach Overview

(1) Labeling Dewey (or called prefix) labeling scheme: integer

sequenceAn example XML tree with Dewey labels

s1

s2

f1

p1

t1

t2

0

1.0

1

1.1

1.1.0

ε

Page 16: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

16

Approach Overview

(2) Computing Inverted data list: each data list contains all labels of

elements with the same tag name

Query:

s

An XML tree:

t f

s (1,12,1),

t

f

(2,3,2),

(8,9,4)

Data lists:

s1

s2

f1

p1

t1

t2

(1,12,1)

(2,3,2)

(5,6,3)

(4,11,2)

(7,10,3)

(8,9,4)

(5,6,3)

(4,11,2)

Page 17: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

17

Previous work: TwigStack [1]

(2) Computing TwigStack [1] is a holistic algorithm for XML twig

matching on containment labeling scheme. Two steps in TwigStack :

(1) intermediate path solutions are output to match each query root-to-leaf path; and

(2) these intermediate path solutions are merged to get the final results.

[1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.

Page 18: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

18

Running example: TwigStack algorithm

s

t f

Query:

s (1,12,1)

t

f

(2,3,2)

(8,9,4)

Data streams:

(5,6,3)

(4,11,2)

State of stacks:

Output path intermediate solutions:

(1,12,1) (2,3,2)

s//t:

(1,12,1) (5,6,3)(4,11,2) (5,6,3)

s//f:

(1,12,1) (8,9,4)(4,11,2) (8,9,4)

Final results:

(1,12,1) (2,3,2) (8,9,4)(1,12,1) (5,6,3) (8,9,4)(4,11,2) (5,6,3) (8,9,4)

(1,12,1) (4,11,2)

(2,3,2) (5,6,3)

(8,9,4)

Page 19: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

19

Limitations of TwigStack

(1) TwigStack may output many useless intermediate results for queries with parent-child relationship

(2) TwigStack cannot process XML twig queries with ordered predicates, like “Proceeding”, “Following” in XPath

(3) TwigStack cannot answer queries with wildcards in branching nodes.

E.g. *

B C

The parent of B should be an ancestor of C

Page 20: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

20

Outline

Introduction Holistic algorithms:

TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Benchmark experiments Conclusions and future work

Page 21: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

21

Inefficiency of TwigStack

TwigStack is inefficient to answer twig query with parent-child edges

More than 99% intermediate results are useless, TwigStack wastes too much time to output useless intermediate results! More than 99% intermediate results are useless, TwigStack wastes too much time to output useless intermediate results!

0

10000

20000

30000

40000

50000

60000

70000

80000

Q1 Q2 Q3

UsefulUseless

Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP in Tree Bank data

# o

f inte

rmedia

te p

ath

Page 22: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

22

Example to illustrate the inefficiency of TwigStack for queries with P-C edge

Twig pattern:

A

BD

C

An XML tree:

A1

E1

D1

B1

TwigStack outputs the useless root-to-leaf intermediate path solutions:

(A1, B1, C1), (A1, B2, C1) …… (A1, Bn, Cn)

Bn-1

B2 Bn

……C1 Cn-1

C2 Cn

The reason for the inefficiency of TwigStack :TwigStack assumes that all edges are A-D

relationships in the first step and does not consider level information

The reason for the inefficiency of TwigStack :TwigStack assumes that all edges are A-D

relationships in the first step and does not consider level information

Page 23: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

23

Naïve improvement is incorrect

Twig pattern:

A

BD

C

An XML tree:

A1

E1

D1

B1

Naïve improvement:

because A1 is not the parent of D1 , we do not output the following path solutions

(A1, B1, C1), (A1, B2,C1) …… (A1, Bn, Cn) by considering level information

Bn-1

B2 Bn

……C1 Cn-1

C2 Cn

But this naïve

approach is NOT correct

for some

cases!

Page 24: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

24

Problem of naïve approach

Naïve approach possibly make a wrong decision about whether the current element contributes to final results

Example:

Twig pattern:

A

BC

D

An XML tree:

A1

C1

D1

C2

B1

Cn

D2

When we read A1, B1, C1

and D1, since C1 is not the parent D1 , according to the naïve approach, we decide that C1 and D1 do not belong to query answers.

But it is wrong!

Dm ……

Page 25: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

25

Our solution: Look-ahead

New technique used in our new algorithm called TwigStackList: Look-ahead

Twig pattern:

A

BC

D

An XML tree:

A1

C1

D1

C2

B1

Cn

Dm+1

When we read A1, B1, C1 and D1, we do not hurriedly decide whether C1 or D1 belongs to final solutions, but buffer C1 to Cn in the a main-memory list structure.

Since Cn is the parent, we are sure that (A1, B1, Cn , D1) is a real match.

Dm ……

Why not buffer D1 to Dm? Too many!

Why not buffer D1 to Dm? Too many!

Page 26: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

26

Running example: TwigStackList algorithm

Query:

A (1,11,1)

B

(3,10,2)

Data streams:XML tree:A1

C1

D1

C2

B1

C3

D2

A

BC

D

C

D

(1,11,1)

(2,2,2)

(2,2,2)(4,8,3)

(5,7,4)

(6,6,5)

(9,9,3)

(3,10,2) (4,8,3)(5,7,4)

(6,6,5)(9,9,3)

SA

SB SC

SD

List LC

(5,7,4)

Output path solutions:

(1,11,1) (2,2,2)

A//B A//C/D

(1,11,1) (5,7,4) (6,6,5)

(3,10,2)

(1,11,1) (3,10,2) (9,9,3)

(1,11,1)

(2,2,2)

(3,10,2) (4,8,3)(5,7,4)

(9,9,3)(6,6,5)

Page 27: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

27

Features of TwigStackList

Main memory efficient Size of stack and list is no more than |Depth(Tree)| TwigStackList can process very large documents with

small main memory cost I/O efficient

Each element is scanned once For a large query class, TwigStackList guarantees that

each output path solution is useful to final answers.

Page 28: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

28

Optimal query classes

If an algorithm does not output any useless intermediate path solution for a query Q for all given documents, we call this algorithm is optimal with respective to Q

If an algorithm has a larger optimal query class, this algorithm has better ability to control the size of intermediate results

Page 29: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

29

Optimal query classes

.

Only A-D in branching edgesA

B C

C

A

B

D

D

Optimal Class of TwigStack

Optimal Class of TwigStackList

Only A-D in all edges

Page 30: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

30

Outline

Introduction Holistic algorithms:

TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Benchmark experiments Conclusions

Page 31: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

31

Motivation

TwigStack and TwigStackList cannot handle order-based twig query. XPath and XQuery includes ordered axes such as following, preceding, following-

sibling and preceding-sibling.

A/B[following-sibling::C]

XPath expressionA

B C

<

This symbol shows that B

and C are ordered.

Page 32: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

32

Ordered twig query pattern Ordered XML twig pattern : sibling query nodes should be matched according to their order in the twig query. Example

A

B

C

<

D

A1

B1D1

C1

D2

D3

Only D2 and D3 contribute to final results.

Only D2 and D3 contribute to final results.

Page 33: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

33

OrderedTJ

OrderedTJ, a new algorithm proposed for evaluating ordered twig query pattern. OrderedTJ, which extends TwigStackList, also uses stack and list data structure

What’s the main

modification of OrderedTJ over TwigStackList?

OrderedTJ additionally checks the

order conditions of

elements before

outputting intermediate

paths.

Page 34: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

34

OrderedTJ Before any element is pushed to the stack, OrderedTJ checks the order condition

A

B

C

<

A1

B1D1

DataQuery

A (1,9,1)

B

Data streams:

C

(3,5,2)

(4,4,3)

C1

D2

(1,9,1)

(2,2,2) (3,5,2)

(6,8,2)

(7,7,3)

SA

SB

SD

Output intermediate path solutions:A/B/C

(1,9,1) (3,5,2) (4,4,3)

A//D

(1,9,1) (6,8,2)

D

D3

SC

(4,4,3) D (2,2,2) (6,8,2) (7,7,3)

(1,9,1)

(3,5,2)

(4,4,3)

(1,9,1) (7,7,3)

(6,8,2) (7,7,3)

Page 35: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

35

The optimal query classes of OrderedTJ OrderedTJ can guarantee the optimality for ordered queries with A-D relationships from the second branching edges. In other words, OrderedTJ is optimal for queries with P-C relationship in the first branching edges.

A

B C

<

OrderedTJ is Optimal for Q2

A

B C

TwigStackList is non-optimal

for Q1.

Q1 Q2

Page 36: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

36

Outline

Introduction Holistic algorithms:

TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Benchmark experiments Conclusions and future work

Page 37: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

37

iTwigJoin algorithm

TwigStack and OrderedTJ partition data to streams according to their tag names alone

We propose two new data partition schemes (1) Tag+level scheme (2) Prefix path scheme

Potential benefits: Enlarge the optimal query classes Reduce I/O cost

Page 38: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

38

Data partition scheme

A1

C2

C1

B1

C3

TA A1

TB

TC C1, C2, C3

Tag partition

B1

Tag+Level partition

A1

C2

B1

C1, C3

Prefix Path partition

TA A1

TAB

TAC C2

B1

TABC C1

C3TACC

Tag partition

Tag +levelpartition

Refined

By level

Prefix pathpartition

Refined

By path

T2B

T1A

T2C

T3C

Page 39: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

39

Property of three schemes

1. the number of inverted lists : increasing (CPU cost increase correspondingly)

2. the optimal query classes : enlarging (output cost decrease correspondingly)

3. the number of elements scan : decreasing (input cost decrease correspondingly)

Tag scheme

Tag +levelscheme

Refined

By level

Prefix pathscheme

Refined

By path

Page 40: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

40

The number of inverted lists : increasing

A1

C2

C1

B1

C3

TA A1

TB

TC C1, C2, C3

Tag partition

B1

Tag+Level partition

A1

C2

B1

C1, C3

Prefix Path partition

TA A1

TAB

TAC C2

B1

TABC C1

C3TACC

T2B

T1A

T2C

T3C

Page 41: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

41

The optimal query classes : enlarging

Only A-D in branching edges

and only P-C in all edges and only 1-branching

A

B C

C

A

B

D

D

Optimal class of tag scheme

Optimal Class of tag+level scheme

Only A-D in branching edges

Only A-D in branching edges and only P-C in all edges

A

B C

Optimal Class of prefix path scheme

E

E ED

Page 42: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

42

The number of elements scan : decreasing

TA A1

TB

TC C1, C2

Tag scheme

B1

Tag+Level scheme

A1

C1

B1

C2

Prefix Path scheme

TDA A1

TDAB

TDC C1

B1

TDCC C2

T3B

T2A

T2C

T3C

A

BC

Query

Data

D1

C1

B1

A1

C2

1:

2:

3:

Page 43: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

43

iTwigJoin algorithm

A general algorithm which can be applied on all three schemes

For different schemes, iTwigJoin achieves different performance.

The main technical difficult in designing iTwigJoin is to handle many current nodes for one tag name.

We classify the current visited

elements to three categories:

current-match, current-useless and

current-blocked

Page 44: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

44

Three kinds of elements

Current-match : the element is guaranteed to contribute to final answers with current elements.

Current-useless : the element is guaranteed not to contribute to final answers with current and remaining elements.

Current-blocked: the element is neither current-match nor current-useless.

Current-blockedCurrent-blocked

MatchMatch UselessUseless

Matching data appears

Cannot get any matching data

Page 45: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

45

Example on three kinds of elements

A

BC

A1

A3

B2

B1

C1

A1

B1

Tag+level scheme

C2

B2

Query

A2 C2

Document

A2, A3

1:

2:

3:

C1

Current-blocked : B2,C1

Current-match: A1,B1,C2

Current-useless : A2

T2B

T2A

T3B

T3C

T2C

T1A

Page 46: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

46

Example on three kinds of elements

A

BC

A1

A3

B2

B1

C1

A1

B1

Tag+level scheme

C2

B2

Query

A2 C2

Document

A2, A3

1:

2:

3:

C1

B2 ,C1 are converted from current-blocked to current-match due to the appearance of A3.

B2 ,C1 are converted from current-blocked to current-match due to the appearance of A3.

T1A

T2A

T2B

T3B

T2C

T3C

Page 47: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

47

Main flowchart of iTwigJoin

Is there any current-useless element?

Is there any current-match element?

Choose the smallest current-blocked element and output intermediate path solutions, then advance to the next element

See whether it contributes

to previous match, and advance

to the next element

Output intermediate path solutions, and advance

to the next element

Are all elements scanned? End of the algorithm

N

Y

N

N

Y

Y

Page 48: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

48

Outline

Introduction Holistic algorithms:

TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Benchmark experiments Conclusions and future work

Page 49: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

49

Motivation: new labeling scheme

TwigStackList, OrderedTJ and iTwigJoin are all based on the containment labeling scheme

Why not try Dewey labeling scheme for

XML twig pattern query ?

Oh, it is really a

novel idea!

Page 50: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

50

Original Dewey Labeling Scheme

In Dewey labeling scheme, each element is presented by an integer sequence:

(i) the root is labeled by a empty stringε (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th

child of s. For example:

s1

s2

f1

f2t1

t2

1 2 3

2.1 2.2

ε

Page 51: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

51

Main problem of the original Dewey

If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query node. Thus, this is not a better solution than pervious algorithms.

Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone

Page 52: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

52

Modular function

We need to know some schema information: DTD (Document Type Definitions ) or XML schema

Given DTD information: book → author, title, chapter* Our solution: using modular function, we create a match

between an element tag and an integer number. We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2;

where, Xt is the last integer of the label of tag t.

bookε

0

titleauthor 1

chapter2

chapter

5

Why not 3 as the original Dewey ?

The number of distinct tags under

book

Page 53: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

53

Derive element tag

From a label , we can derive its tag name. book → author, title, chapter* Recall that we define: Xauthor mod 3 = 0 Xtitle mod 3 = 1

Xchapter mod 3 = 2.

bookε

0

titleauthor 1

chapter2

chapter

5

? ? ? ?

Page 54: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

54

More examples for assigning labels

Let us consider a more complicated DTD a → (b | c )*, d?, c+ We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 = 2

(Why do we use mod 3 instead of 4?)

0

db

2c4

c

7

Page 55: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

55

Derive the path from a label

By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label.

For example:

DTD:

book → author, title, chapter*

chapter → (paragraph | section)*

section → (paragraph | section)*

book

chapter

sectionauthor title

book

author

title

chapter

paragraph

section

Mod 3=0

Mod 3=1

Mod 3=2 Mod 2=0

Mod 2=1

Mod 2=0

Mod 2=1

Question: Given a label 5.1.0, what is the corresponding path ?

Document:

FST:

chapter

section

paragraphsection

Page 56: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

56

Derive the path from a label

By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label.

For example:DTD:

book → author, title, chapter*

chapter → (paragraph | section)*

section → (paragraph | section)*

book

chapter

sectionauthor title

Document:chapter

section

paragraphsection

Following the above red path, we get

5.1.0 denotes :

book/ chapter/section/paragraph

book

author

title

chapter

paragraph

section

Mod 3=0

Mod 3=1

Mod 3=2 Mod 2=0

Mod 2=1

Mod 2=0

FST:

Mod 2=1

Page 57: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

57

Two properties of extended Dewey

Find Ancestor Label From a label of any element, we can derive the labels of

its all ancestors. Find Ancestor Name

From a label of any element, we can derive the tag names of its all ancestors.

Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.

Page 58: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

58

A new algorithm: TJFast

For each node n in the query, there exists a corresponding input stream Tn.

Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order.

For each branching node b of twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStackList, what difference? )

During any point of computing, the size of set Sb is bounded by the depth of the XML document.

Page 59: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

59

An example for TJFast algorithm

Document: Query:

A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0TD:

TC:

DTD:

a -> a*,d*, b*

b -> d*, c*

d -> c*

Root0

0.5.0

A set for the branching node A

Why are there only two streams?

{ }

Page 60: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

60

An example for TJFast algorithm

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

0.0.1 a1/a2/d1derive

0.3.2.1 a1/a3/b1/c1derive

By finite state transducer of extended Dewey labeling scheme

TD:

TC:

{ }

Page 61: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

61

An example for TJFast algorithm

Document: Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Both a1 and a3 possibly involve in query answers. (Why not a2 ?)TD:

TC:

{ }

Page 62: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

62

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Then we insert a1, a3 to the set,

Output Path solutions:

A//D A/B//C

(a1, d1) (a3, b1, c1)

TD:

TC:

An example for TJFast algorithm

{a1,a3}

Page 63: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

63

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Move the cursor of TD from d1 to d2

TD:

TC:

An example for TJFast algorithm

Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2)(a3, d2)

{a1,a3}

Page 64: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

64

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Move the cursor of stream TD

from d2 to d3

TD:

TC:

An example for TJFast algorithm

Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2)(a3, d2)(a1, d3)

{a1,a3}

Page 65: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

65

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Move the cursor of stream TC from c1 to c2

TD:

TC:

An example for TJFast algorithm

Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2) (a1, b2, c2)(a3, d2)(a1, d3)

{a1,a3}

Page 66: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

66

Document:

Query:A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

A// D:<a1, d1>, <a1, d2>,<a1, d3>,<a3, d2>

A/B//C:<a1,b2, c2>,<a3, b1,c1>

Phase 1. Intermediate paths

<a1,d1,b2,c2>,<a1,d2, b2,c2>,

<a1,d3,b2,c2>,<a3,d2, b1,c1>,

<A, D, B,C>

Phase 2. Final solutions

Join

Sort and merge-join in TJFast

Page 67: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

67

TJFast+L

Apply extended Dewey labeling scheme on tag+level streaming scheme, we propose TJFast+L algorithm by extending TJFast

Two benefits of TJFast+L over TJFast reduce I/O cost by reading less elements enlarge optimal query classes

Q: Why not apply

extended Dewey on Prefix-path scheme ?

Because by finite state

transducer, we can know the

path information…

Page 68: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

68

Optimal query classes

.

Only P-C in all edges

A

B C

C

A

B

D

D

Optimal Class of TJFast

Optimal Class of TJFast+L

Only A-D in branching edges

Page 69: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

69

Outline

Introduction Holistic algorithms:

TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Benchmark experiments Conclusions and future work

Page 70: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

70

State-of-the-art: XML Query Processing

Path Tree

Holistic Approach

PathStack [Bruno, et. al] TwigStack [Bruno, et. al]

(GTP)

Generalized Tree Pattern

?

Twig2Stack

Page 71: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

71

Processing Generalized Tree Pattern (GTP) Queries

B

A

D

XQuery:FOR $b in //A[E]//B, $d in $b/$DLET $c = $b/CRETURN $b, $c, $d

C

EMandatory Axis

Optional Axis

Return node

Group return node

Page 72: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

72

Motivation: PathStack [Bruno et.al]

Query: //A//B; Data:

Key observation: minimize intermediate results through compact representation of path matches, by

Inter-node: record AD relationship between elements in different query nodes, e.g., b1→a2, b2→a2

Intra-node: record AD relationship between elements within the same query nodes, e.g., b1, b2

TwigStack [Bruno et.al] minimizes intermediate results through: Output only those path matches that are in final twig results However, such optimality cannot be guaranteed [Choi, et.al] Not helpful for processing GTP queries

Question: can we minimize intermediate results for twig queries through compact result encoding (similar to PathStack)?

S[A]a1

S[B]b1b2a2a2

b1

a1

b2

Page 73: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

73

Hierarchical Stack Encoding

Inter-node: //A//B Can still use explicit edges

Intra-node: A Matching elements forms a tree structure as well

Associate each query node with a hierarchical stack Push element e into hierarchical stack HS[E] iff e satisfies the sub-twig

query rooted at E Matching can be determined when entire sub-tree of e seen Require post-order document traversal

a2

a3 a4

a1

HS[A]

a3 a4

a2a1

Page 74: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

74

Twig2Stack: Running Example

C

B

A

D

a2

c1

b2

b1

d1

a1[1,20], 1

[2,15], 2

[3,14], 3

[4,11], 4

[8, 9], 6

[5,10], 5

d2[6,7], 6

c2

[12,13], 4

b3

d3

[16,19], 2

[17,18], 3

HS[B]

b2

HS[C]

c1

b1

HS[A]

a2

HS[D]

d2d1

c2d3

TwigStack needs to enumerate3 matches for //A/B//D and 2 for//A/B//C then join them together.

Twig2Stack requires neither path joins nor path enumeration!

MergingStacks

Page 75: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

75

Not yet done: Memory Usage Hierarchical Stack Encoding could hold entire document in memory in

the worst case Unlike DOM approach, only matches need to be stored

Tag match (Partial) twig match Predicate evaluation

Early result enumeration dramatically reduces the memory usage Enumerate query results before the end of document and release

buffer Main idea: hybrid of top-down (PathStack) and bottom-up (Twig2Stack)

approaches

Page 76: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

76

Outline

Introduction Holistic algorithms:

TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Benchmark experiments Conclusions and future work

Page 77: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

77

TreeMatch (TKDE 2010)

Twig pattern:

A

B C

An XML tree:

A1

C1

B1

A2

B2

C2

It is the real reason

for sub-

optimality!

B1 B2

C1 C2

Matching cross:

Page 78: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

78

Bounded and Unbounded Matching Cross

Twig pattern:

A

B C

An XML tree:

A1

C1

B1

A2

B2n

C2n

B1 B2n

C1 C2n

Unbounded Matching cross:

An

Bn …

Bn+1 C 2n-1

Cn…

……

……

A1 An

C1 C2n

……

……

Bounded Matching cross:

Page 79: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

79

BMC and UMC

Bounded Matching Cross (BMC): Optimal class Store limited number of nodes in main memory

Unbounded Matching Cross (UMC): Sub-optimal class, but not all Cannot guarantee to store limited number of nodes in

main memory, but a sub-class of UMC is still optimal

Page 80: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

80

Unbounded Matching Cross with Mediator

Twig pattern:

(output: node C)

A

B C

An XML tree:

A1

B1

A2 Cn

B1 Bn+1

C1 Cn

Unbounded Matching cross:

Bn …

Bn+1 C 1

……

……

An …

B2n C n-1

Node A is a mediator node and we do not need to

store all Bi in main

memory!

Page 81: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

81

Optimal query classes

Only A-D in branching edgesA

B C

C

A

B

D

D

Optimal Class of TwigStack

Optimal Class of TwigStackList

Only A-D in all edges

C

A

B

Only A-D in non-output branching edges

Optimal Class of TreeMatch

Page 82: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

82

Outline

Introduction Holistic algorithms:

TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Benchmark experiments Conclusions and future work

Page 83: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

83

Experiment Setup

Implementation (Seven algorithms) TwigStack (SIGMOD2002) TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Datasets XMark, DBLP, TreeBank

Metrics Query processing time IO time

Page 84: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

84

Experiments

Benchmarks XMark: Synthetic Data DBLP: Real Data for DBLP database Treebank: Real Data from Wall Street Journal

XMark DBLP Treebank

Data size(MB) 582 130 82

Nodes(million) 8 3.3 2.4

Max/Avg depth 12/5 6/2.9 36/7.8

Page 85: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

85

Tested queries

Source Twig Queries

Q1 DBLP //proceedings//title[.//i]//sup

Q2 DBLP //article[.//sup]//title//sub

Q3 Treebank /S[.//VP/IN]//NP

Q4 Treebank /S/VP/PP[IN]/NP/VBN

Q5 Treebank //VP[DT]//PRP_DOLLAR_

Some tested queries

Page 86: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

86

Tested queries (Cont.)

Q1,Q2,Q3 are based on XMark data and Q4,Q5 Q6 are on TreeBank data.

Page 87: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

87

TwigStackList V.s.TwigStack

Experiment data: TreeBank

Compared to TwigStack, TwigStackList significantly reduces the size of output useless elements. Compared to TwigStack, TwigStackList significantly reduces the size of output useless elements.

0

10000

20000

30000

40000

50000

60000

70000

80000

Q1 Q2 Q3

UsefulTwigStackTwigStsackList

Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP

# o

f inte

rmedia

te p

ath

Page 88: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

88

TwigStackList V.s. OrderedTJ

STW: Straightforward-TwigStack STWL: Straightforward-TwigStackList

02468

101214

Q1 Q2 Q3

Quer i es on XMark

Exec

utio

n ti

me (

s)

STW STWL OrderedTJ

OrderedTJ is significantly better than two straightforward method on XMark and TreeBank data

05

101520253035

Q4 Q5 Q6

Queri es on Tree dataEx

ecut

ion

time

(sec

onds

)

STW STWL OrderedTJ

Page 89: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

89

iTwigJoin

The decrease of the number of elements scanned

0

1

2

3

4

5

Q1 Q2 Q3

XMark data query

Byte

s sc

anne

d (M

)

t ag tag+l evel prefi x path

More refined schemes scan less elements to answer a query.

0

2

4

6

8

10

12

14

Q4 Q5 Q6Treebank data query

Byte

s sc

anne

d (M

)

tag tag+l evel prefi x path

Page 90: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

90

iTwigJoin

Performance of queries for three streaming schemes

0

2

4

6

8

10

Q1 Q2 Q3

XMark quer i es

Exec

utio

n ti

me

Tag Tag+l evel Prefi x path

Prefix path scheme is suitable for large but shallow document, and tag+level scheme generally works well even for complicated recursive documents.

0

10

20

30

40

50

60

Q4 Q5 Q6

Quri es on TreeBankEx

ecut

ion

time

(s)

Tag Tag+l evel Prefi x path

Page 91: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

91

TwigStackList V.S. iTwigJoin

Observation: iTwigJoin scans far less elements than TwigStack and TwigStackList in two twig queries.

TreeBank data

0

200000

400000

600000

800000

1000000

1200000

Q3 Q4 Q5

Numb

er o

f el

emen

ts r

ead

Twi gStack Twi gStackLi st i Twi gJ oi n

Page 92: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

92

TwigStackList V.S. iTwigJoin

0

5

10

15

20

Q3 Q4 Q5

Exec

utio

n ti

me(s

econ

ds)

Twi gStack Twi gStackLi st i Twi gJ oi n

Observation: iTwigJoin has much better performance than that of TwigStack/TwigStackList.

Explanation: iTwigJoin reduces I/O cost by reading less elements

TreeBank data

0

1

2

3

4

5

6

Q1 Q2

Exec

utio

n ti

me (

seco

nds)

Twi gStack Twi gStackLi st i Twi gJ oi n

DBLP data

Page 93: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

93

iTwigJoin, TJFast, Twig2Stack,

00. 5

11. 5

22. 5

33. 5

44. 5

5

1 2

Exec

utio

n ti

me (

s)

i Twi gJ oi n TJ Fast Twi g2Stack

Observation: iTwigJoin/TJFast has better performance than that of Twig2Stack

Reason: iTwigJoin/TJFast reduces I/O cost by reading less elements

TreeBank dataDBLP data

0

2

46

8

10

1214

16

18

1 2 3

Exec

utio

n ti

me (

s)

i Twi gJ oi n TJ Fast Twi g2Stack

Page 94: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

94

Experiments: TJFastL and iTwigJoin

Observation: Both algorithms are based on tag+level scheme. TJFastL has much better performance than iTwigJoin on tag+level scheme.

Explanation: TJFast reduces I/O cost by reading less elements.

0123456789

Q3 Q4 Q5

Exec

utio

n ti

me (

seco

nds)

i Twi gJ oi n TJ FastL

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

Q1 Q2

Exec

utio

n ti

me (

seco

nds)

i Twi gJ oi n TJ FastL

DBLP data TreeBank data

Page 95: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

95

TJFast and TreeMatch

Observation: TreeMatch has much better performance than that of TJFast.

Explanation: TreeMatch reduces I/O cost over TJFast.

00. 050. 1

0. 150. 2

0. 250. 3

0. 350. 4

0. 45

Q1 Q2

Exec

utio

n ti

me(s

econ

ds)

TJ Fast TreeMatch

0

1

2

3

4

5

6

Q3 Q4 Q5

Exec

utio

n ti

me (

seco

nds)

TJ Fast TreeMatch

DBLP data TreeBank data

Page 96: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

96

Conclusions

Efficient processing of twig queries is a core operation in XPath and XQuery

We reviewed and compared seven holistic algorithms TwigStack(SIGMOD 2002) TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)

Comprehensive benchmark experiments show the correctness and efficiency of holistic algorithms

Page 97: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

97

Conclusions (Cont.)

Holistic TPQ processing, I/O cost takes most of time

TJFast reduces input data size

Twig2Stack reduces output size

TreeMatch reduces both input and output data size

Page 98: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

98

Reference works

[1] J. Lu, T. W. Ling,Z. Bao and C. Wang Extended XML Tree Pattern Matching: Theories and Algorithms IEEE TKDE Journal 2010 (to appear)

Propose TreeMatch algorithm [2] J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with

parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004. Propose TwigStackList algorithm [3] J. Lu and T. W. Ling, Labeling and querying dynamic XML trees, In

Proceedings of the Sixth Asia Pacific Web Conference, 2004, 180–189 Propose a new labeling scheme for dynamic XML documents [4] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig pattern matching

using structural indexingtechniques. In SIGMOD, 2005. Propose two new data streaming techniques [5] J. Lu, T. W. Ling, C. Chan, and T. Chen, From region encoding to extended

dewey: On efficient processing of XML twig pattern matching, In Proceedings of VLDB, 2005, pp. 193–204.

Propose TJFast algorithm

Page 99: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

99

Reference works (Cont.)

[6] J. Lu, T. W. Ling, T. Yu, C. Li, and W. Ni, Efficient processing of ordered XML twig pattern matching, Proceedings of DEXA, 2005, pp. 300–309

Propose OrderedTJ algorithm [7] J. Lu, T. W. Ling, and T. Chen, TJFast: Effective processing of XML

twigpattern matching, Proceedings of WWW, 2005, pp. 1118–1119. Propose extended Dewey labeling scheme [8] T. Yu, T. W. Ling, J. Lu: TwigStackListNot: A Holistic Twig Join

Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 249-263

Propose an algorithm for twig queries with NOT predicate [9] J, Lu, R Yang, W. Ling, A. K.H Tung: Efficient XML tree pattern

matching: theory and algorithm Submit to IEEE TKDE Journal Propose a theory and algorithm for extended XML tree pattern

Page 100: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

100

Reference works (Cont.)

[10] S. Al-Khalifa , H.V. Jagadish, J. Patel, Y. Wu N. Koudas, D. Srivastava : Structural Joins: A Primitive for Efficient XML Query Pattern Matching. ICDE 2002 141- 152

Propose StackTree algorithm [11] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig

joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.

Propose TwigStack algorithm [12] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G.

M. Lohman, On supporting containment queries in relational database management systems, In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2001, pp. 425–436.

Propose containment labeling scheme

Page 101: Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Ben

chm

arX 10 K

eyno

te

Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing

101

Reference works (Cont.)

[13] H. Jiang, W Wang and H. Lu Holistic twig joins on indexed XML documents VLDB 2003

Propose TSGeneric algorithm [14] I. Tatarinov, S. Viglas, K. S. Beyer, J. Shanmugasundaram, E. J.

Shekita, and C. Zhang, Storing and querying ordered XML using a relational database system, In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002, pp. 204–215.

Propose Dewey labeling scheme [15] H. Wang, S. park, W Fan and P.S. Yu ViST: A dynamic

index method for querying XML data by tree structures In SIGMOD 2003

Propose ViST system [16] B. Yang M. Fontoura, E.J. Shekita, S. Rajagopalan and K.S.

Beyer Virtual Corsors for XML joins CIKM pages 523-532 2004

Propose Virtual cursor algorithm