Post on 27-Mar-2015
On Boosting Holism in XML Twig Pattern Matching Using
Two Data Streaming Techniques
Presenter: Lu Jiaheng
Supervisor: Prof. Ling Tok Wang
Joint work: Chen Ting, Ling Tok Wang
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
2
Outline Background
Define our problem: XML twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching algorithm: iTwigJoin
Experiments Conclusion
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
3
XML Twig Pattern Matching
An XML document is commonly modeled as a rooted, ordered and tagged tree.
book
preface chapter chapter
section
section
figure
paragraph
section
figure
paragraph figure
paragraph
………….
title
title
“XML”“Data”
“Intro”
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
4
Regional Coding Node Label1: (startPos: endPos, LevelNum) E.g.
book (0: 32, 1)
preface (1:3, 2) chapter (4:29, 2) chapter(30:31, 2)
“Intro” (2:2, 3) section (5:28, 3)
section(9:17, 4)
figure (14:15, 6)
paragraph(13:16, 5)
section(18:23, 4)
figure (20:21, 6)
paragraph(19:22, 5)figure (25:26, 5)
paragraph(24:27, 4)title: (6:8, 4)
title: (10:12, 5)
1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
“Data” (7:7, 3)
“XML” (11:11, 3)
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
5
What is a Twig Pattern? A twig pattern is a small tree whose nodes are tags, attributes or text
values and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges.
E.g. Selects Figure elements which are descendants of Paragraph elements which in turn are children of Section elements having child element Title
XPath: Section[Title]/Paragraph//Figure Twig pattern :
Section
Title Paragraph
Figure
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
6
XML Twig Pattern Matching Problem Statement
Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D.
E.g. Consider Query and Document:
Document: s1
s2
f1
p1
t1
t2
Section
title figure
Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
Query:
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
7
XML Twig Pattern Matching Problem Statement
Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D.
E.g. Consider Query and Document:
Document: s1
s2
f1
p1
t1
t2
Section
title figure
Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
Query:
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
8
XML Twig Pattern Matching Problem Statement
Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D.
E.g. Consider Query and Document:
Document:
s1
s2
f1
p1
t1
t2
Section
title figure
Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
Query:
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
9
Outline Background
Define our problem: XML twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching algorithm: iTwigJoin
Experiments Conclusion
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
10
Previous work: TwigStack TwigStack2: a holistic approach
Each element in the document is labeled with region encoding labeling scheme.
The input data is the labels of all elements whose tags occur in the query twig. The output data is the matching solutions with the format of n-tuple, where n is the number of nodes in query.
For each node in the query, there exists a corresponding input stream.
Each label in a stream is scanned only once. That is, the cursor of each stream is not allowed to go back in any time.
2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
11
Previous work: TwigStack
TwigStack2: a holistic approach Two-phase algorithm:
Phase 1 TwigJoin: intermediate root-leaf paths are outputted Phase 2 Merge: merge the intermediate paths to get the final results
2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
12
Previous work: TwigStack
A node q in a twig pattern Q is associated with a stack Sq
Insertion and deletion in a stack Sq
Insertion: An element eq from stream Tq is pushed into its stack Sq if and only if eq has a descendant eqi in each Tqi , where qi is a child of q
Each node eqi recursively has the first property
Deletion: An element eq is popped out from its stack if all matches involving it have been output.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
13
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
14
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
15
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
16
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
17
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
18
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
2:3,2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
19
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
Output path solutions:
<s1, t1>
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
20
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
Output path solutions:
<s1, t1>
4:9,2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
21
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
Output path solutions:
<s1, t1>
4:9,2
5:6,3
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
22
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
Output path solutions:
<s1, t1>, <s1,t2>,<s2,t2>,
4:9,2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
23
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
Output path solutions:
<s1, t1>, <s1,t2>,<s2,t2>,
4:9,2
7:8,3
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
24
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
Output path solutions:
<s1, t1>, <s1,t2>,<s2,t2>,
<s1,f1>,<s2,f1>,
4:9,2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
25
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
Output path solutions:
<s1, t1>, <s1,t2>,<s2,t2>,
<s1,f1>,<s2,f1>
10:11,2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
26
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
1:12,1
Output path solutions:
<s1, t1>, <s1,t2>,<s2,t2>,
<s1,f1>,<s2,f1>,<s1,f2>
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
27
XML Twig Pattern Matching
Document:s1
s2
f1
f2t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,210:11,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3), (10:11,2)
Output path solutions:
<s1, t1>, <s1,t2>,<s2,t2>,
<s1,f1>,<s2,f1>,<s1,f2>
Merge:
<s1,t1,f1>,<s1,t1,f2>, <s1,t2,f1>,<s1,t2,f2>,<s2,t2,f1>
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
28
Sub-optimality of TwigStack
If the query contains any parent-child relationship, TwigStack may output some intermediate path solutions that cannot contribute to final results.
We call that TwigStack is sub-optimal for queries with parent-child relationships.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
29
Example: sub-optimality of TwigStack
Document:s1
s2
f1
t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3)
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
30
Example: sub-optimality of TwigStack
Document:s1
s2
f1
t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3)
1:12,1
Because f1 and t1 are descendants of s1 , s1 is pushed to the stack. Note that f1 is not a child of s1.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
31
Example: sub-optimality of TwigStack
Document:s1
s2
f1
t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3)
1:12,1
2:3,2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
32
Example: sub-optimality of TwigStack
Document:s1
s2
f1
t1
t2
Section
title figure
Query:1:12,1
2:3,2
4:9,2
5:6,3 7:8,3
(1:12,1), (4:9,2)
(2:3,2), (5:6,3)
Section
title
figure
(7:8,3)
1:12,1
Output solution: <s1,t1>.
But it is a useless intermediate solution and do not contribute to any final solution.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
33
TwigStackList The main problem of TwigStack is to assume all
edges are ancestor-descendant relationship in the first phase. So it is not efficient for queries with parent-child relationships.
Alternative: TwigStackList3 [CIKM 2004] TwigStackList3 is an improvement algorithm for
TwigStack, which consider parent-child relationships in the first phase and identify a large query class to be optimal than TwigStack.
3. J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533- 542, 2004.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
34
Optimal class of TwigStack and TwigStackList
TwigStack TwigStackList
Optimal query class
All edges are ancestor-descendant relationships
All edges connecting branching nodes and the children are ancestor-descendant relationship
TwigStack O S STwigStackList O O S
a
b c
O :optimal
S: sub-optimal
a
b c
a
b c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
35
Challenges (1) Although TwigStackList enlarges the optimal
query class of TwigStack, it still shows sub-optimal for a large class of twig query.
For example: two sub-optimal twig queries for TwigStackList :
Section
title figure
Section
title figure
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
36
Challenges (2) In algorithms TwigStack and TwigStackList, to
answer a twig query, they need to read labels for all elements whose tags occur in the query.
Can we accelerate the query processing by reading only parts of them ?
Section
title figure
Query:Document :
s1
f1
t1
f2 fn ……
There is no answer in the document, since no figure elements in level 2. But previous algorithms still need to read all figure elements in Level
3.
Level 1:
Level 2:
Level 3:
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
37
Outline Background
Define our problem: XML twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching algorithm: iTwigJoin
Experiments Conclusion
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
38
Our solution We proposed two data streaming schemes:
tag+level and prefix path streaming. Basic idea: Separate the elements with the same
tag name to different streams Tag+level: elements with the same tag and level
are grouped together Prefix path: elements with the same root-to-node
path are grouped together
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
39
Two Refined Streaming Schemes(1) Tag + Level: elements with the same tag and level are grouped together.
Document
a1a
Level
1:
Level2:
Level1:
2:
3:
a1
a2 a3 b2
d2 b1d3
c2
d1
c1
4:
a2 , a3
b2b
Level3:
Level2:
b1
C1, C2c Level4:
d Level3: d1 ,d2,d3
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
40
Two Refined Streaming Schemes(2) Prefix Path Streaming (PPS): elements with the same root-to-node path are grouped together.
Document
a1a
Level
1:
2:
3:
a1
a2 a3 b2
d2 b1d3
c2
d1
c1
4:
a2 , a3
b2b
a/a/b:
a/b:
b1
C1c
dd1 , d2
a:
a/a:
C2
a/a/b/c:
a/b/d/c:
d3
a/a/d:
a/b/d:
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
41
Two benefits of refined streaming schemes(1) (1) Enlarge the optimal query classes For example, considering the document and query, previous
algorithms: TwigStack and TwigStackList will output one useless solution <s1,t1>.
But based on tag+level, <s1,t1> is not output, since we know there is no figure elements in level 2.
QueryDocument
s1
t1 s2
t2 f1figure
S1
t1
S2Level2:
Level1:
t2
f1
Level3:
Level2:
Level2:
Level
1:
2:
3:
Section
title
figure
title
Section
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
42
Two benefits of refined streaming schemes(2) (2) Skip irrelevant elements For the document and query, since there is no title elements in level 3,
we may skip reading all figure elements in level 3.
Document :
s1
f1
t1
f2 fn ……
Level 1:
Level 2:
Level 3:
Section
title figure
Query:
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
43
Outline Background
Define our problem: XML twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching algorithm: iTwigJoin
Experiments Conclusion
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
44
A general algorithm: iTwigJoin We propose a general algorithm, called iTwigJoin , which can be used on various data streaming schemes.
Our key idea is to classify all current head elements to three classes: Subtree-matching Useless Blocked
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
45
Classifying Head Elements Subtree-Matching Element
Element e of tag E is called a subtree-matching element for query Q e is in a match to QE (QE is the sub-tree of Q rooted at E);
and NOT in any future match to QP where P is the parent of E
in Q Useless Element
Element e is called a useless element if e is not in any future match to QE.
Blocked Element An element which is neither subtree-matching nor useless
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
46
Example 1: Classifying Head Elements (Tag+Level)a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A
D B
C
D:Q1:
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 Subtree-matching
useless a2
blocked
: head element
a
b
c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
47
Example 1: Classifying Head Elements (Tag+Level)a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A
D B
C
D:Q1:
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 Subtree-matching
useless a2
blocked d1
: head element
a
b
c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
48
Example 1: Classifying Head Elements (Tag+Level)a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A
D B
C
D:Q1:
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 Subtree-matching
useless a2
blocked d1,a1,b1,b2,c1
: head element
a
b
c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
49
Example 2: Classifying Head Elements (Tag+Level)a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A
D B
C
D:Q1:
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 Subtree-matching
useless a1,a2
blocked
: head element
a
b
c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
50
Example 2: Classifying Head Elements (Tag+Level)a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A
D B
C
D:Q1:
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 Subtree-matching
useless a1,a2,b2
blocked
: head element
a
b
c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
51
Example 2: Classifying Head Elements (Tag+Level)a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A
D B
C
D:Q1:
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 Subtree-matching
d1
useless a1,a2, b2
blocked
: head element
a
b
c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
52
Example 2: Classifying Head Elements (Tag+Level)a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A
D B
C
D:Q1:
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 Subtree-matching
d1
useless a1,a2 , b2
blocked c1,b1
: head element
a
b
c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
53
Classifying Head Elements
•Useless element can be discarded safely
•sub-tree Matching element is pushed to the corresponding stack
•Blocked element causes problem
•CANNOT be discarded because it may cause loss of results
•CANNOT be pushed to stack because it may cause useless results
•When all head elements are blocked; optimal holistic matching CANNOT be guaranteed
• We push blocked elements into stack, which may result in useless intermediate results in some cases.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
54
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3
a
b
c
d
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
55
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
56
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca Since a2 is a useless
element, we discard a2 and scan a3.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
57
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
Now all elements are blocked. We push a1 to stack.
a1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
58
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
d1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
59
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1>
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
60
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1>
Since a3 is a sub-tree matching element, we
push a3 to stack.
a3
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
61
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1a3
d2
Output intermediate path solutions:
<a1,d1>
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
62
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1> , <a1,d2>,<a3,d2>
a3
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
63
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1> , <a1,d2>,<a3,d2>
a3
b1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
64
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1> , <a1,d2>,<a3,d2>
a3
b1
c1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
65
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1> , <a1,d2>,<a3,d2>,
<a3,b1,c1>
a3
b1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
66
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
b2
Output intermediate path solutions:
<a1,d1> , <a1,d2>,<a3,d2>,
<a3,b1,c1>
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
67
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1> , <a1,d2>,<a3,d2>,
<a3,b1,c1>
b2
c2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
68
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1> , <a1,d2>,<a3,d2>,
<a3,b1,c1>,<a1,b2,c2>
b2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
69
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1> , <a1,d2>,<a3,d2>,
<a3,b1,c1>,<a1,b2,c2>
b2d3
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
70
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
Output intermediate path solutions:
<a1,d1> ,<a1,d2>,<a1,d3>,<a3,d2>,<a3,b1,c1>, <a1,b2,c2>,
b2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
71
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1
b2
The 1th final solution:<a1,d1,b2,c2>
Output intermediate path solutions:
<a1,d1> ,<a1,d2>,<a1,d3>,<a3,d2>,<a3,b1,c1>, <a1,b2,c2>,
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
72
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1a3
b2
The 2th final solution:<a1,d2,b2,c2>
Output intermediate path solutions:
<a1,d1> ,<a1,d2>,<a1,d3>,<a3,d2>,<a3,b1,c1>, <a1,b2,c2>,
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
73
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1a3
b2
The 3th final solution:<a1,d3,b2,c2>
Output intermediate path solutions:
<a1,d1> ,<a1,d2>,<a1,d3>,<a3,d2>,<a3,b1,c1>, <a1,b2,c2>,
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
74
An example of iTwigJoin algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
1:20,1
2:5,2
3:4,3
6:13,2
7:8,3
9:12,3
10:11,4
14:19,2
15:18,3
16:17,4
a1
Level2:
Level1:
a2 , a3
b2
Level3:
Level2:
b1
C1, C2Level4:
Level3: d1 ,d2,d3 b
dca
a1a3
b2
The 4th final solution:<a3,d2,b1,c1>
Output intermediate path solutions:
<a1,d1> ,<a1,d2>,<a1,d3>,<a3,d2>,<a3,b1,c1>, <a1,b2,c2>,
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
75
Optimal classes of iTwigJoin for three streaming schemes
A
B C
Tag Streaming A-D only pattern
Optimal classStreaming scheme
A-D only
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
76
A
B C
A
B C
Tag Streaming A-D only pattern
Tag+Level Streaming A-D/P-C only pattern
Optimal classStreaming scheme
A-D/P-C only
A-D only
Optimal classes of iTwigJoin for three streaming schemes
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
77
A
B C
A
B C
Tag Streaming A-D only pattern
Tag+Level Streaming A-D/P-C only pattern
Prefix Path Streaming
Optimal classStreaming scheme
A-D/P-C only or 1-Branch node
A-D/P-C only
A-D only
A
B C
A-D/P-C only or 1-Branch
Optimal classes of iTwigJoin for three streaming schemes
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
78
A
B C
A
B C
Tag Streaming A-D only pattern
Tag+Level Streaming A-D/P-C only pattern
Prefix Path Streaming A-D/P-C only or 1-Branch
Optimal classStreaming scheme
A-D/P-C only or 1-Branch node
A-D/P-C only
A-D only
A
B C
More refined
Optimal class:Larger
Optimal classes of iTwigJoin for three streaming schemes
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
79
Outline Background
Define our problem: XML twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching algorithm: iTwigJoin
Experiments Conclusion
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
80
Experiments
Benchmarks XMark: Synthetic Data Treebank: Real Data from Wall Street Journal
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
81
Experiments: I/O Performance
0
20000004000000
60000008000000
1000000012000000
14000000
Tree1 Tree2 Tree3 Tree4 Tree5
Ele
men
t Sca
nned
TwigStack TwigStackLst Tag+Level Prefix
Tree1: A-D only
Tree2: P-C only
Tree3: P-C only
Tree4: 1-branchnode
Tree5: 1-branchnode
By pruning irrelevant streams, PPS usually scan the fewest number of elements.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
82
Experiments: Number of Intermediate PathsTree1: A-D only
Tree2: P-C only
Tree3: P-C only
Tree4: 1-branchnode
Tree5: 1-branchnode1
10
100
1000
10000
100000
Tree1 Tree2 Tree3 Tree4 Tree5In
term
ed
iate
Pa
ths
Ou
tpu
tTwigStack TwigStackLst Tag+Level Prefix
2. For treebank 5, there is no matching results. So Tag+Level and PPS do not output any intermediate results.
1. Tag+level and PPS output less intermediate results than TwigStack and TwigStackList in TreeBank data.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
83
Experiments: Running Time
XMark1: Path Pattern,
XMark2: A-D only,
XMark3: P-C only,
XMark4: 1-branchnode,
XMark5: Non-optimal,
0
2
4
68
10
12
14
XMark1 XMark2 XMark3 XMark4 XMark5
Exe
cutio
n T
ime
(Sec
ond)
TwigStack TwigStackLst Tag+Level Prefix
Tag+level and PPS have better performance than TwigStack and TwigStackList in XMark data.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
84
Outline Background
Define our problem: XML twig pattern matching Previous two algorithms: TwigStack and TwigStackList
Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching algorithm: iTwigJoin
Experiments Conclusion
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
85
Conclusions We develop a general algorithm to perform
holistic twig join on Tag+Level and PPS streaming schemes.
We identify two I/O optimal classes for Tag+Level and PPS streaming schemes.
Since our experiments show that Tag+Level streaming schemes can guarantee to produce very few useless intermediate results in most cases, we recommend to use Tag+Level scheme for efficient XML twig pattern matching.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
86
END
Thank you!
Q & A