Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano...
-
Upload
toby-pierce -
Category
Documents
-
view
215 -
download
0
Transcript of Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano...
Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access
Attila Barta Mariano P. Consens Alberto O. Mendelzon
University of Toronto
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 2
Motivation Growing importance of XML query processing Plethora of implementations:
native XML dbms (e.g. Timber, Niagara, BEA/XQRL, Natix,ToX) XQuery systems (e.g. Galax, IPSI-SQ, XSM, MS-XQuery) XPath processors (e.g. XSQ, SPEX, XPush, Xalan, PathStack) publish/subscribe (e.g.Y-Filter,IndexFilter,WebFilter,NiagaraCQ) twig query processors (e.g. TwigStack, PRIX, TurboXPath)
Our contribution: Apply novel cost-based optimization techniques to XML
query processing that exploit path summaries
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 3
Example XQuery and Pattern Tree Pattern Tree (PT)
or Twig Query
for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplierwhere $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return <result> {$x/part_no} {$x/price} {$y/description} </result>
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 4
Example XQuery Processingfor $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplierwhere $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return <result> {$x/part_no} {$x/price} {$y/description} </result>
$x = $y
$z = $x
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 5
Contributions
• Holistic Path Summary Pruning• Access Order Selection
• Path Summaries as Catalogs
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 6
Outline
Introduction Path Summaries in the Optimizer Holistic Path Summary Pruning
Experimental Evaluation Access Order Selection
Experimental Evaluation Future Work
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 7
ToXin Path Summary
For each distinct path in document there is a path in ToXin - is an exact path summary – reflects the structure of the document [RM01]
Initially proposed as a back-end - can answer any pattern queries
<suppliers> <supplier> <supplier_no> 1001 </supplier_no> <name> Magna </name> <city> Toronto </city> <province> ON </province> </supplier> <supplier> <supplier_no> 1002 </supplier_no> <name> MEC </name> <city> Vancouver </city> <province> BC </province> </supplier></suppliers>
TT
TI
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 8
Augmented ToXin Trees System catalog: schema + data statistics DTD and XML Schema are used for validation, they do not
describe the actual schema of the instances ToXin is an exact path summary actual schema ToXin augmented with statistics system catalog ToXop statistics:
NCARD – no. of instances for an element ICARD – no of distinct value for an element Fan-out – avg. no. of sub-element instances for each sub-
element
Augmented ToXin Tree: existing schema (TT) + statistics + node instances (TI)
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 9
Outline
Introduction
Path Summaries in the Optimizer Holistic Path Summary Pruning
Experimental Evaluation Access Order Selection
Experimental Evaluation Future Work
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 10
TT
TI
All path summary based query processors perform some path summary pruning specific to the processor Idea: separate path pruning from the processor and encoding Holistic Path Summary Pruning (HPSP):
Holistic Path Summary Pruning
TwigStackScan is one possible HPSP-based Access Method
• evaluate the pattern tree on the actual schema (TT tree)
• compute the twig query using an appropriate algorithm for the particular element encoding
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 11
Stack algorithms: PathStack, TwigStack, TwigStackXB [BSK02] Use region algebra encoding:
Telement: [DocID, Term, StartPos, EndPos, LevelNum] - elements
Ttext : [DocID, Term , TextValue, StartPos, LevelNum] - string values Build a stream (noted as T) for all elements having the same label, e.g.
Tauthor encompasses all author elements from the document
Stack Algorithms
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 12
TwigStackScan Access Method
Extended region algebra encoding:Telement: [DocID, Term, StartPos, EndPos, LevelNum, TTnodeID] - elementsTtext : [DocID, Term , TextValue, StartPos, LevelNum, TTnodeID] - string values
TwigStackScan = HPSP + TwigStack
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 13
Experimental Datasets
Dataset Name
Size (MB)
# of Elements
# of Attributes
# of Text
Total # of Nodes
# of TT nodes
Max-depth
DBLP 130.726 3,332,130 404,276 3,005,848 6,742,254 224 6
SWISSPROT 112.130 2,977,031 2,189,859 2,013,844 7,180,734 303 5
XMARK (1.9) 112.486 2,769,710 726,783 1,478,252 4,974,745 358 10
• DBLP, SWISSPROT: University of Washington XML Repository• Both are large (millions of nodes) and shallow• DBLP – regular in structure (5 structures that repeat)• SWISSPROT – irregular in structure (many one of the kind structures)
• XMARK: • simulates an on-line auction site • xmlgen from 0.01 (0.6 MB) – 2.8 (165.9 MB)• removed the content of ‘Text’ elements 30% reduction in size
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 14
TwigStackScan Scale-Up
Q7 scale-up with (XMARK) file size TwigStackScan speedup with (XMARK) file size
Q7: //site/people/person[@id = "person0"]/name – 1 twig match- @id in person, category, item, open_action
Q8: //site/people/person/name – 38,760 twig matches When applicable TwigStackScan yields improvements of one order of
magnitude
0
100
200
300
400
500
600
700
800
5.3 11.8 23.6 41.4 64.9 82.9 112.5
XML file size (MB)
tim
e (m
s)
TwigStackScan
TwigStack
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
0.6 1.2 2.4 5.3 11.8 23.5 41.4 64.9 82.9 112.5
XML file size (MB)
spee
du
p
Q7
Q8
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 15
TwigStackScan vs. TwigStackQuery Dataset TwigStack
(ms)TwigStackScan
(ms)Speedup
Q1 //inproceedings[./author="Jim Gray"][./year="1990"]/@key DBLP 7,108 4,779 1.49
Q2 //www[./editor]/url DBLP 3,015 40 75.38
Q3 //book/author[text() ="C.J. Date"] DBLP 430 48 8.96
Q4 //Entry/Keyword[text() = "Rhizomelic chondrodysplasia punctata"]
SWISSPROT 188 183 1.03
Q5 //Entry[PFAM[@prim_id="PF00304"]][.//DISULFID/Descr] SWISSPROT 6,430 752 8.55
Q6 //Entry[./Org="Piroplasmida"]//Author SWISSPROT 6,687 6,891 0.97
Q7 //site/people/person[@id = "person0"]/name XMARK 699 119 5.87
Q8 //site/people/person/name XMARK 5,442 3,804 1.43
Q9 //regions/samerica/item[./location = "United States"AND ./@id ./quantity AND ./payment]/name
XMARK 8,326 470 17.71
Q10 //person[@id = "person217" AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States”] ]/name
XMARK 1,167 124 9.41
Q11 //person[@id AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States”] ]/name
XMARK 4,493 2,520 1.78
• High selectivity twig queries (Q1, Q4, Q6, Q7): speedup 0.97 to 5.87• Low selectivity twig queries (Q8, Q11): speedup 1.43 to 1.78• Scattered twig matches(Q2, Q3, Q5, Q9), grouped twig matches (Q10): speedup 8.96 to 75.38
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 16
Outline
Introduction
Path Summaries in the Optimizer
Holistic Path Summary PruningExperimental Evaluation
Access Order Selection Experimental Evaluation
Future Work
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 17
Order Selection in Pattern Trees1. Order Selection:
the order in which to evaluate the branches
2. Direction Selection: decide how to evaluate a branch: top/down or bottom/up
• Choosing between top/down and bottom/up is extremely expensive computationally: LORE optimizer [McW99] – for a document with level 7 – millions of possible plans
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 18
ToXinScan Access Method Relational optimizers compute a GOOD plan not
THE BEST plan Similarly we use data statistics and heuristics to
compute a good plan The access-order selection strategy:
1. Sort the children according to parent selectivity2. Evaluate the path with the highest selectivity using a
bottom-up evaluation3. Evaluate all other paths, in the selectivity order, using a
top-down evaluation
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 19
ToXinScan Scale-Up
0
20
40
60
80
100
120
140
23.6 41.4 64.9 82.9 112.5
XML file size (MB)
spee
pu
p Q8
Q9
Q10
Speedup ToXinScan vs. TwigStack with (XMARK) file size
Q8: //site/people/person[@id = "person0"]/name – 1 twig match Q9: //site/people/person/name – 38,760 twig matches Q10: //regions/samerica/item[./location = "United States" AND
./@id AND ./quantity AND ./payment] /name – 8 twig matches Two-order of magnitude improvements over TwigStack
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 20
ToXinScan vs. TwigStackQuery Dataset TwigStack
(ms)ToXinScan
(ms)TwigStack/ToXinScan
Q1 //inproceedings[./author="Jim Gray"] [./year="1990"]/@key DBLP 7,108 130 54.68
Q2 //www[./editor]/url DBLP 3,015 39 77.31
Q3 //book/author[text() ="C.J. Date"] DBLP 386 90 4.29
Q4 //inproceedings[./title/text() = "Semantic Analysis Patterns."] /author DBLP 430 46 9.35
Q5 //Entry/Keyword[text() = "Rhizomelic chondrodysplasia punctata"] SWISSPROT 188 87 2.16
Q6 //Entry[PFAM[@prim_id="PF00304"]] [.//DISULFID/Descr] SWISSPROT 6,430 80 80.37
Q7 //Entry[./Org="Piroplasmida"]//Author SWISSPROT 6,687 131 51.05
Q8 //site/people/person[@id = "person0"]/name XMARK 699 75 9.32
Q9 //site/people/person/name XMARK 5,442 95 57.28
Q10 //regions/samerica/item[./location = "United States" AND ./@id AND ./quantity AND ./payment] /name
XMARK 8,326 68 122.44
Q11 //person[@id = "person217" AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States] ]/name
XMARK 1,167 90 12.97
Q12 //person[@id = "person20125" AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States] ]/name
XMARK 1,816 92 19.74
Q13 //person[@id = "person48027" AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States] ]/name
XMARK 2,746 95 28.80
Q14 //person[@id AND ./address [./city/text() = "Lubbock" AND ./country/text() = "United States] ]/name
XMARK 4,493 93 48.31
• High selectivity twig queries (Q3, Q4, Q5, Q8): speedup 2.16 to 9.32• Grouped twig matches (Q11, Q12, Q13): speedup 12.97 to 28.80• Low selectivity (Q2, Q9, Q10, Q14), scattered twig matches (Q1, Q6, Q7): speedup 48.31 to 122.44
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 21
ToXinScan vs. Heavier Indexes Pattern indexes (such as PRIX [RM04], ViST [WPF+03])
are the best twig-query processors Indexes are expensive to build (three passes over the
document) and require extensive space ViST uses O(SH) space, S # of sequences, H height of tree
Indexes outperform TwigStack by two-orders of magnitude Good news:
using path summaries and the presented optimization strategy we achieve the same performance improvements as node indexes
path summaries are inexpensive to build (one pass over the document)
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 22
Outline
Introduction
Path Summaries in the Optimizer
Holistic Path Summary PruningExperimental Evaluation
Access Order SelectionExperimental Evaluation
Future Work
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 23
Future Work Generalize based on the strategy derived from the
TwigStackScan access method Holistic Path Summary Pruning (HPSP) can be used in
conjunction with any twig query evaluation method Can be used with Path summaries other than ToXin
ToXinScan Add a generalized cost model for access methods Enhance the XML statistics used
Propose benchmarks for XML Access methods
Thank you for your attention!
Attila Barta Mariano P. Consens Alberto O. Mendelzon{ atibarta, consens, mendel }@cs.toronto.edu
University of Toronto
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 25
ToXinScan vs. PRIX
Query Dataset TwigStack/ToXinScan
TwigStack/PRIX[RMo03, RMo04]
Q1 //inproceedings[./author="Jim Gray"] [./year="1990"]/@key
DBLP 54.68 14.01
Q2 //www[./editor]/url DBLP 77.31 145.00
Q6 //Entry[PFAM[@prim_id="PF00304"]] [.//DISULFID/Descr]
SWISSPROT 80.37 43.15
[RMo04] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Proceedings of the 2004 International Conference on Data Engineering, Boston, MA, 2004[RMo03] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Technical Report TR-03-06,Univ. of Arizona, Tucson, 2003
Good news: node indexes (e.g. PRIX) are computationally expensive to build (three passes over the document) while path summaries are un-expensive to build (one pass over the document)