An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML
-
Upload
wade-durham -
Category
Documents
-
view
19 -
download
2
description
Transcript of An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML
An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic
Ordered XML
Changqing Li Tok Wang Ling
Department of Computer Science
School of Computing
National University of Singapore
2
Overview
Related work and motivation Labeling with our ImprovedBinary scheme Order-sensitive Update and Query Experimental results Conclustion
3
Related work – Containment scheme
Containment scheme [1] Node u is an ancestor of v iff label(u.start) <
label(v.start) < label(v.end) < label(u.end) Or interval of v is contained in interval of u
[1] Rakesh Agrawal, Alexander Borgida, H. V. Jagadish: Efficient Management of Transitive Relationships in Large Data and Knowledge Bases. SIGMOD Conference 1989: 253-262
1,20
2,3 4,5 6,11 12,13 14,19
7,8 9,10 15,16 17,18
4
2,3
1,20
2,3 4,5 6,11 12,13 14,19
7,8 9,10 15,16 17,18
Related work – Containment scheme
Update is bad Need to re-label all the ancestor nodes and all the nodes after the
inserted node in document order
1,20
2,3 4,5 6,11 12,13 14,19
7,8 9,10 15,16 17,18
Insert a node1,22
4,5 6,7 8,13 14,15 16,21
9,10 11,12 17,18 19,20
Original labels
5
Related work – Containment scheme
How to decrease the update cost Increase the interval size
Small interval size is easy to lead to re-labeling large interval size wastes a lot of values which causes the increase
of storage Use real (float-point) values [2]
float-point is represented in computer with a fixed number of bits which is similar to the representation of integer
As a result, there are a finite number of values between any two real values
[2] Toshiyuki Amagasa, Masatoshi Yoshikawa, Shunsuke Uemura: QRS: A Robust Numbering Scheme for XML Documents. ICDE 2003: 705-707
6
Related work – Prefix scheme
Prefix scheme Node u is an ancestor of v iff label(u) is a prefix of
label(v)
1 2 3 4 5
3.1 3.2 5.1 5.2
DeweyID [6] Integer based
Binary [4] Binary string based
0 10 110 1110 11110
110.0 110.10 11110.0 11110.10
[4] Edith Cohen, Haim Kaplan, Tova Milo: Labeling Dynamic XML Trees. PODS 2002: 271-281
[6] Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang: Storing and querying ordered XML using a relational database system. SIGMOD Conference 2002: 204-215
7
Related work – Prefix scheme
Update is better DeweyID Need to re-label all the sibling nodes after the inserted node and all the
descendants of these siblings
Insert a node
1
1 2 3 4 5
3.1 3.2 5.1 5.2
1 2 3 4 5
3.1 3.2 5.1 5.2
2 3 4 5 6
4.1 4.2 6.1 6.2
Original labels
8
Related work – Prefix scheme
Update is better Binary Same number of nodes to re-label as DeweyID
Insert a node
Original labels
0 10 110 1110 11110
110.0 110.10 11110.0 11110.10
0 0 10 110 1110 11110
110.0 110.10 11110.0 11110.10
10 110 1110 11110 111110
1110.0 1110.10 111110.0 111110.10
9
Related work – Prime scheme
Prime scheme [7] Node u is an ancestor of v iff label(v) mod label(u) = 0 The number above each node is the document order The right number is the label The numbers below each label is its parent_label and self_label
(prime number)
(11*23)(11*19)(5*17)(5*13)
9854(1*7)(1*2) (1*3) (1*5) (1*11)
7632
10
12 3 5 7 11
65 85 209 253
[7] Xiaodong Wu, Mong-Li Lee, Wynne Hsu: A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. ICDE 2004: 66-78
10
Related work – Prime scheme
Update Based on Chinese Remainder Theorem (CRT) [3] And Simultaneous Congruence (SC) value The SC value is 56135603 such that
56135603 mod 2 = 1 (2 is the self_label, 1 is the document order) 56135603 mod 3 = 2, 56135603 mod 5 = 3 , 56135603 mod 13 = 4, ···, and 56135603 mod 23 = 9
Need not re-label the existing nodes, need to re-calculate the SC value
(1*29)
11
Use more SC values to prevent the single SC value become a very large number.
[3] James A. Anderson and James M. Bell, Number Theory with Application, Prentice-Hall, New Jersey, 1997.
(11*23)(11*19)(5*17)(5*13)
9854(1*7)(1*2) (1*3) (1*5) (1*11)
7632
10
12 3 5 7 11
65 85 209 253
Original labels
Insert a node
(11*23)(11*19)(5*17)(5*13)
9854(1*7)(1*2) (1*3) (1*5) (1*11)
7632
10
12 3 5 7 11
65 85 209 253(11*23) 253
(11*19)(5*17)(5*13)
10965(1*7)(1*2) (1*3) (1*5) (1*11)
8743
1
22 3 5 7 11
65 85 209
After insertion, the new SC value is 4071807264, such that
4071807264 mod 29 = 1,
4071807264 mod 2 = 2,
4071807264 mod 3 = 3,
4071807264 mod 5 = 4,
4071807264 mod 13 = 5,
···,
4071807264 mod 23 = 10.
11
Motivation Containment scheme is very bad in updating Prefix and prime schemes are better in updating which are called
dynamic labeling schemes, and we only compare the performance of dynamic labeling schemes in this paper.
Prefix schemes DeweyID and Binary both need to re-label the existing nodes
Prime need not re-label existing nodes, but need to re-calculate the SC values SC value are very large numbers Re-calculation is costly
Our objective Need not re-label existing nodes in updating Need not re-calculate any values in updating
12
Our ImprovedBinary scheme How to label the XML tree based on our ImprovedBinary scheme
Root is empty First sibling self_label is “01” Last sibling self_label is “011” Case (a) left self_label size is less than or equal to right
self_label size The middle self_label is that we change the last character of the right
self_label to “0” and concatenate one more “1”. Case (b) left self_label size is larger than right self_label size
The middle self_label is that we directly concatenate one more “1” after the left self_label.
01 01001 0101 01011 011
0101.01 0101.011 011.01 011.011
13
Properties of our ImprovedBinary labels
Self_labels of siblings are lexically ordered 01 < 01001 < 0101 < 01011 < 011
Labels are lexically ordered when comparing the labels component by component 01 < 01001 < 0101 < 0101.01 < 0101.011 < 01011 < 011 <
011.01 < 011.011
01 01001 0101 01011 011
0101.01 0101.011 011.01 011.011
The component is the labels between two consecutive delimiters.
14
The formal labeling algorithm for our ImprovedBinary
Left self_label is smaller than or equal to the size of the right self_label, the self_label of the middle node is that we change the last character of the right self_label to “0” and concatenate one more “1”.
Otherwise, the self_label of the middle node is the left self_label concatenates “1”.
Algorithm 1: AssignMiddleSelfLabelInput: left self label self_label_L, and right self label self_label_ROutput: middle self label self_label_M, such that self_label_Lself_label_Mself_label_R lexically.begin1: calculate the size of self_label_L and the size of self_label_R2: if size(self_label_L) size(self_label_R)
then self_label_M = change the last character of self_label_R to “0”, and concatenate () one more “1”
3: else if size(self_label_L) > size(self_label_R) then self_label_M = self_label_L “1”
end
AssignMiddleSelfLabel algorithm.
15
The formal labeling algorithm for our ImprovedBinary
SubLabeling is a recursive function
Self_label is an array
Labeling algorithm.
Algorithm 2: LabelingInput: XML documentOutput: Label of each nodebegin1: for all the sibling child nodes of each node of the XML document 2: for the first sibling node, self_label[1]=“01” //self_label is an array3: if the Number of Sibling nodes SN > 1 then self_label[SN]=“011” self_label=SubLabeling(self_label, 1, SN)4: label = prefix_label concatenates delimiter and each element of self_label array end
SubLabelingInput: self_label array, left element index of self_label array L, and right element index of self_label array ROutput: self_label arraybegin1: M = floor((L+R)/2) //M refers to the Mth element of self_label array2: if L+1<R then self_label[M]=AssignMiddleSelfLabel(self_label[L], self_label[R]) SubLabeling(self_label, L, M) SubLabeling(self_label, M, R)end
16
Order-sensitive update for our ImprovedBinary
Case (1): Insert a node before the first sibling node. The self_label of the inserted node is that the last character of the first self_label is changed to “0” and is concatenated with one more “1”. Label of node of a will be 001
After insertion, the order is still kept. Label(a) < 01 lexically
a
c
01011
011.011
b d
011.01
01 01001 0101
0101.01 0101.011
011b d
c
17
Order-sensitive update for our ImprovedBinary
Case (2): Insert a node at any position between the first and last sibling node. We use the AssignMiddleSelfLabel algorithm to assign the self_label of the new inserted node. Label of node of b will be 010011 Label of node of c will be 0101.0101
After insertion, the order is still kept. 01001 < label(b) < 0101 lexically 0101.01 < label(c) < 0101.011 lexically
a
c
01011
011.011
b d
011.01
01 01001 0101
0101.01 0101.011
011b d
c
18
Order-sensitive update for our ImprovedBinary
Case (3): Insert a node after the last sibling node. The self_label of the inserted node is that the last self_label concatenates one more “1”. Label of node of d will be 0111
After insertion, the order is still kept. 011 < label(d) lexically
a
c
01011
011.011
b d
011.01
01 01001 0101
0101.01 0101.011
011b d
c
19
Order-sensitive update for our ImprovedBinary
For all the three cases We need not re-label the exiting nodes We need not re-calculate any values to keep the document order
20
Order-sensitive query1) position = i:
Selects the ith node within a context node set. 2) preceding-sibling or following-sibling:
Selects all the preceding (following) sibling nodes of the context node.
3) preceding or following:Selects all the nodes before (after) (in document order) the context node excluding any ancestors (descendants).
When inserting a node DeweyID and Binary need to re-label the existing nodes Prime needs to re-calculate the SC values
before they can process the order-sensitive queries.
21
Experimental result -- datasets
Datesets available at [8] The depths of real XMLs are usually not too high [5]
[5] Laurent Mignet, Denilson Barbosa, Pierangelo Veltri: The XML web: a first study. WWW 2003: 500-510
[8] The Niagara Project Experimental Data. Available at: http://www.cs.wisc.edu/niagara/data.html
Datasets Topics # of filesMax fan-out for a
fileMax depth for a
fileTotal # of nodes for
each dataset
D1 Bib 18 25 4 2111
D2 Club 12 47 3 2928
D3 Movie 490 38 4 26044
D4 Sigmod Record 988 26 6 39058
D5 Department 19 257 3 48542
D6 Actor 480 368 4 56769
D7 Company 24 529 4 161576
D8 Shakespeare’s play 37 434 5 179689
D9 NASA 1882 1188 6 370292
22
Experimental results -- storage
Storage requirement ImprovedBinary has the smallest label size in each
dataset
40.612677
0
4
8
12
16
D1 D2 D3 D4 D5 D6 D7 D8 D9
Datasets
La
be
l S
ize
(1
,00
0,0
00
bit
s) DeweyID
Binary
Prime
ImprovedBinary
23
Experimental results -- query
Query performance ImprovedBinary works the
fastest for most of the queries, both ordered and un-ordered
2342
0
500
1000
1500
2000
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Queries
Re
sp
on
se
Tim
e (
ms
)
DeweyID
Binary
Prime
ImprovedBinary
Queries # of nodes returned
Q1 /play/act[4] 370
Q2 /play/act[5]//preceding::scene 6110
Q3 /play/act/scene/speech[2] 7300
Q4 /play/*/* 19380
Q5 /play/act//speech[3]/preceding-sibling::* 30930
Q6 /play//act[2]/following::speaker 184060
Q7 /play//scene/speech[6]/following-sibling::speech 267050
Q8 /play/act/scene/speech 309330
Q9 /play/*//line 1078330
24
Experimental results -- update
Update performance Here we study the update performance of the Hamlet XML file in
D8. The update performance of other XML files in D8 is similar. Hamlet has 5 acts, and we test the following six cases:
inserting an act before act[1] inserting an act between act[1] and act[2] inserting an act between act[2] and act[3] inserting an act between act[3] and act[4] inserting an act between act[4] and act[5] and inserting an act after act[5]
Acts are the child nodes of the root play in the Hamlet XML file
25
Experimental results -- update Update performance
Number of nodes to re-label The number of SC values to re-calculate is counted for Prime, one SC
value is for every three nodes. One SC value for 4 or more nodes can not be calculated using 64 bits,
the long type value in java Our ImprovedBinary need not re-label any existing nodes and need not
re-calculate any values
0
1000
2000
3000
4000
5000
6000
7000
1 2 3 4 5 6
Insertion Cases
Nu
mb
er
of
No
de
s f
or
Re
-la
be
lin
g
or
Re
-ca
lcu
lati
on
DeweyID/Binary
Prime
ImprovedBinary
26
Experimental results -- update Update performance
Time to re-label or re-calculate The re-calculation time for prime is more than 337 times larger
than the re-labeling time for DeweyID and Binary Our ImprovedBinary needs 0 milliseconds for the update
210036266203 53166924
0
10
20
30
40
1 2 3 4 5 6
Insertion Cases
Tim
e f
or
Re
-la
be
lin
g o
r R
e-c
alc
ula
tio
n (
ms
)
DeweyID
Binary
Prime
ImprovedBinary
The query time and re-labeling (re-calculation) time in this paper refer to the processing time only without including the I/O time.
27
Conclustion
Our contribution We proposed a node labeling scheme called
ImprovedBinary to decrease the label update cost This scheme need not re-label any existing nodes Need not re-calculate any values
28
Reference[1] Rakesh Agrawal, Alexander Borgida, H. V. Jagadish: Efficient Management of
Transitive Relationships in Large Data and Knowledge Bases. SIGMOD Conference 1989: 253-262
[2] Toshiyuki Amagasa, Masatoshi Yoshikawa, Shunsuke Uemura: QRS: A Robust Numbering Scheme for XML Documents. ICDE 2003: 705-707
[3] James A. Anderson and James M. Bell, Number Theory with Application, Prentice-Hall, New Jersey, 1997.
[4] Edith Cohen, Haim Kaplan, Tova Milo: Labeling Dynamic XML Trees. PODS 2002: 271-281
[5] Laurent Mignet, Denilson Barbosa, Pierangelo Veltri: The XML web: a first study. WWW 2003: 500-510
[6] Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang: Storing and querying ordered XML using a relational database system. SIGMOD Conference 2002: 204-215
[7] Xiaodong Wu, Mong-Li Lee, Wynne Hsu: A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. ICDE 2004: 66-78
[8] The Niagara Project Experimental Data. Available at:http://www.cs.wisc.edu/niagara/data.html