Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano...

25
Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto

Transcript of Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano...

Page 1: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access

Attila Barta Mariano P. Consens Alberto O. Mendelzon

University of Toronto

Page 2: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 2

Motivation Growing importance of XML query processing Plethora of implementations:

native XML dbms (e.g. Timber, Niagara, BEA/XQRL, Natix,ToX) XQuery systems (e.g. Galax, IPSI-SQ, XSM, MS-XQuery) XPath processors (e.g. XSQ, SPEX, XPush, Xalan, PathStack) publish/subscribe (e.g.Y-Filter,IndexFilter,WebFilter,NiagaraCQ) twig query processors (e.g. TwigStack, PRIX, TurboXPath)

Our contribution: Apply novel cost-based optimization techniques to XML

query processing that exploit path summaries

Page 3: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 3

Example XQuery and Pattern Tree Pattern Tree (PT)

or Twig Query

for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplierwhere $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return <result> {$x/part_no} {$x/price} {$y/description} </result>

Page 4: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 4

Example XQuery Processingfor $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplierwhere $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return <result> {$x/part_no} {$x/price} {$y/description} </result>

$x = $y

$z = $x

Page 5: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 5

Contributions

• Holistic Path Summary Pruning• Access Order Selection

• Path Summaries as Catalogs

Page 6: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 6

Outline

Introduction Path Summaries in the Optimizer Holistic Path Summary Pruning

Experimental Evaluation Access Order Selection

Experimental Evaluation Future Work

Page 7: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 7

ToXin Path Summary

For each distinct path in document there is a path in ToXin - is an exact path summary – reflects the structure of the document [RM01]

Initially proposed as a back-end - can answer any pattern queries

<suppliers> <supplier> <supplier_no> 1001 </supplier_no> <name> Magna </name> <city> Toronto </city> <province> ON </province> </supplier> <supplier> <supplier_no> 1002 </supplier_no> <name> MEC </name> <city> Vancouver </city> <province> BC </province> </supplier></suppliers>

TT

TI

Page 8: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 8

Augmented ToXin Trees System catalog: schema + data statistics DTD and XML Schema are used for validation, they do not

describe the actual schema of the instances ToXin is an exact path summary actual schema ToXin augmented with statistics system catalog ToXop statistics:

NCARD – no. of instances for an element ICARD – no of distinct value for an element Fan-out – avg. no. of sub-element instances for each sub-

element

Augmented ToXin Tree: existing schema (TT) + statistics + node instances (TI)

Page 9: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 9

Outline

Introduction

Path Summaries in the Optimizer Holistic Path Summary Pruning

Experimental Evaluation Access Order Selection

Experimental Evaluation Future Work

Page 10: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 10

TT

TI

All path summary based query processors perform some path summary pruning specific to the processor Idea: separate path pruning from the processor and encoding Holistic Path Summary Pruning (HPSP):

Holistic Path Summary Pruning

TwigStackScan is one possible HPSP-based Access Method

• evaluate the pattern tree on the actual schema (TT tree)

• compute the twig query using an appropriate algorithm for the particular element encoding

Page 11: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 11

Stack algorithms: PathStack, TwigStack, TwigStackXB [BSK02] Use region algebra encoding:

Telement: [DocID, Term, StartPos, EndPos, LevelNum] - elements

Ttext : [DocID, Term , TextValue, StartPos, LevelNum] - string values Build a stream (noted as T) for all elements having the same label, e.g.

Tauthor encompasses all author elements from the document

Stack Algorithms

Page 12: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 12

TwigStackScan Access Method

Extended region algebra encoding:Telement: [DocID, Term, StartPos, EndPos, LevelNum, TTnodeID] - elementsTtext : [DocID, Term , TextValue, StartPos, LevelNum, TTnodeID] - string values

TwigStackScan = HPSP + TwigStack

Page 13: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 13

Experimental Datasets

Dataset Name

Size (MB)

# of Elements

# of Attributes

# of Text

Total # of Nodes

# of TT nodes

Max-depth

DBLP 130.726 3,332,130 404,276 3,005,848 6,742,254 224 6

SWISSPROT 112.130 2,977,031 2,189,859 2,013,844 7,180,734 303 5

XMARK (1.9) 112.486 2,769,710 726,783 1,478,252 4,974,745 358 10

• DBLP, SWISSPROT: University of Washington XML Repository• Both are large (millions of nodes) and shallow• DBLP – regular in structure (5 structures that repeat)• SWISSPROT – irregular in structure (many one of the kind structures)

• XMARK: • simulates an on-line auction site • xmlgen from 0.01 (0.6 MB) – 2.8 (165.9 MB)• removed the content of ‘Text’ elements 30% reduction in size

Page 14: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 14

TwigStackScan Scale-Up

Q7 scale-up with (XMARK) file size TwigStackScan speedup with (XMARK) file size

Q7: //site/people/person[@id = "person0"]/name – 1 twig match- @id in person, category, item, open_action

Q8: //site/people/person/name – 38,760 twig matches When applicable TwigStackScan yields improvements of one order of

magnitude

0

100

200

300

400

500

600

700

800

5.3 11.8 23.6 41.4 64.9 82.9 112.5

XML file size (MB)

tim

e (m

s)

TwigStackScan

TwigStack

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

0.6 1.2 2.4 5.3 11.8 23.5 41.4 64.9 82.9 112.5

XML file size (MB)

spee

du

p

Q7

Q8

Page 15: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 15

TwigStackScan vs. TwigStackQuery Dataset TwigStack

(ms)TwigStackScan

(ms)Speedup

Q1 //inproceedings[./author="Jim Gray"][./year="1990"]/@key DBLP 7,108 4,779 1.49

Q2 //www[./editor]/url DBLP 3,015 40 75.38

Q3 //book/author[text() ="C.J. Date"] DBLP 430 48 8.96

Q4 //Entry/Keyword[text() = "Rhizomelic chondrodysplasia punctata"]

SWISSPROT 188 183 1.03

Q5 //Entry[PFAM[@prim_id="PF00304"]][.//DISULFID/Descr] SWISSPROT 6,430 752 8.55

Q6 //Entry[./Org="Piroplasmida"]//Author SWISSPROT 6,687 6,891 0.97

Q7 //site/people/person[@id = "person0"]/name XMARK 699 119 5.87

Q8 //site/people/person/name XMARK 5,442 3,804 1.43

Q9 //regions/samerica/item[./location = "United States"AND ./@id ./quantity AND ./payment]/name

XMARK 8,326 470 17.71

Q10 //person[@id = "person217" AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States”] ]/name

XMARK 1,167 124 9.41

Q11 //person[@id AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States”] ]/name

XMARK 4,493 2,520 1.78

• High selectivity twig queries (Q1, Q4, Q6, Q7): speedup 0.97 to 5.87• Low selectivity twig queries (Q8, Q11): speedup 1.43 to 1.78• Scattered twig matches(Q2, Q3, Q5, Q9), grouped twig matches (Q10): speedup 8.96 to 75.38

Page 16: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 16

Outline

Introduction

Path Summaries in the Optimizer

Holistic Path Summary PruningExperimental Evaluation

Access Order Selection Experimental Evaluation

Future Work

Page 17: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 17

Order Selection in Pattern Trees1. Order Selection:

the order in which to evaluate the branches

2. Direction Selection: decide how to evaluate a branch: top/down or bottom/up

• Choosing between top/down and bottom/up is extremely expensive computationally: LORE optimizer [McW99] – for a document with level 7 – millions of possible plans

Page 18: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 18

ToXinScan Access Method Relational optimizers compute a GOOD plan not

THE BEST plan Similarly we use data statistics and heuristics to

compute a good plan The access-order selection strategy:

1. Sort the children according to parent selectivity2. Evaluate the path with the highest selectivity using a

bottom-up evaluation3. Evaluate all other paths, in the selectivity order, using a

top-down evaluation

Page 19: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 19

ToXinScan Scale-Up

0

20

40

60

80

100

120

140

23.6 41.4 64.9 82.9 112.5

XML file size (MB)

spee

pu

p Q8

Q9

Q10

Speedup ToXinScan vs. TwigStack with (XMARK) file size

Q8: //site/people/person[@id = "person0"]/name – 1 twig match Q9: //site/people/person/name – 38,760 twig matches Q10: //regions/samerica/item[./location = "United States" AND

./@id AND ./quantity AND ./payment] /name – 8 twig matches Two-order of magnitude improvements over TwigStack

Page 20: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 20

ToXinScan vs. TwigStackQuery Dataset TwigStack

(ms)ToXinScan

(ms)TwigStack/ToXinScan

Q1 //inproceedings[./author="Jim Gray"] [./year="1990"]/@key DBLP 7,108 130 54.68

Q2 //www[./editor]/url DBLP 3,015 39 77.31

Q3 //book/author[text() ="C.J. Date"] DBLP 386 90 4.29

Q4 //inproceedings[./title/text() = "Semantic Analysis Patterns."] /author DBLP 430 46 9.35

Q5 //Entry/Keyword[text() = "Rhizomelic chondrodysplasia punctata"] SWISSPROT 188 87 2.16

Q6 //Entry[PFAM[@prim_id="PF00304"]] [.//DISULFID/Descr] SWISSPROT 6,430 80 80.37

Q7 //Entry[./Org="Piroplasmida"]//Author SWISSPROT 6,687 131 51.05

Q8 //site/people/person[@id = "person0"]/name XMARK 699 75 9.32

Q9 //site/people/person/name XMARK 5,442 95 57.28

Q10 //regions/samerica/item[./location = "United States" AND ./@id AND ./quantity AND ./payment] /name

XMARK 8,326 68 122.44

Q11 //person[@id = "person217" AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States] ]/name

XMARK 1,167 90 12.97

Q12 //person[@id = "person20125" AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States] ]/name

XMARK 1,816 92 19.74

Q13 //person[@id = "person48027" AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States] ]/name

XMARK 2,746 95 28.80

Q14 //person[@id AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States] ]/name

XMARK 4,493 93 48.31

• High selectivity twig queries (Q3, Q4, Q5, Q8): speedup 2.16 to 9.32• Grouped twig matches (Q11, Q12, Q13): speedup 12.97 to 28.80• Low selectivity (Q2, Q9, Q10, Q14), scattered twig matches (Q1, Q6, Q7): speedup 48.31 to 122.44

Page 21: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 21

ToXinScan vs. Heavier Indexes Pattern indexes (such as PRIX [RM04], ViST [WPF+03])

are the best twig-query processors Indexes are expensive to build (three passes over the

document) and require extensive space ViST uses O(SH) space, S # of sequences, H height of tree

Indexes outperform TwigStack by two-orders of magnitude Good news:

using path summaries and the presented optimization strategy we achieve the same performance improvements as node indexes

path summaries are inexpensive to build (one pass over the document)

Page 22: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 22

Outline

Introduction

Path Summaries in the Optimizer

Holistic Path Summary PruningExperimental Evaluation

Access Order SelectionExperimental Evaluation

Future Work

Page 23: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 23

Future Work Generalize based on the strategy derived from the

TwigStackScan access method Holistic Path Summary Pruning (HPSP) can be used in

conjunction with any twig query evaluation method Can be used with Path summaries other than ToXin

ToXinScan Add a generalized cost model for access methods Enhance the XML statistics used

Propose benchmarks for XML Access methods

Page 24: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

Thank you for your attention!

Attila Barta Mariano P. Consens Alberto O. Mendelzon{ atibarta, consens, mendel }@cs.toronto.edu

University of Toronto

Page 25: Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 25

ToXinScan vs. PRIX

Query Dataset TwigStack/ToXinScan

TwigStack/PRIX[RMo03, RMo04]

Q1 //inproceedings[./author="Jim Gray"] [./year="1990"]/@key

DBLP 54.68 14.01

Q2 //www[./editor]/url DBLP 77.31 145.00

Q6 //Entry[PFAM[@prim_id="PF00304"]] [.//DISULFID/Descr]

SWISSPROT 80.37 43.15

[RMo04] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Proceedings of the 2004 International Conference on Data Engineering, Boston, MA, 2004[RMo03] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Technical Report TR-03-06,Univ. of Arizona, Tucson, 2003

Good news: node indexes (e.g. PRIX) are computationally expensive to build (three passes over the document) while path summaries are un-expensive to build (one pass over the document)