Sequence Indexing Schemes
description
Transcript of Sequence Indexing Schemes
![Page 1: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/1.jpg)
SEQUENCE INDEXING SCHEMESRoman Čížek Erasmus 2687,
Nelly Vouzoukidou MET601
![Page 2: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/2.jpg)
INTRODUCTION
Graph indexes precise Path, (twig only few methods)
Sequence indexing schemes Top-down or bottom-up XML document and XML queries in structure-encoded
sequences Path and twig
![Page 3: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/3.jpg)
TOP-DOWN SEQUENCE INDEXES: VIST
![Page 4: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/4.jpg)
VIST – VIRTUAL SUFFIX TREE
Top-down Sequence Indexes Represent XML documents and XML queries in
structure-encoded sequences Querying XML data is equivalent to finding subsequence
matching Avoid to expensive join operations Provides unified index on both content and structure Support dynamic index update B+Trees which are supported in DBMSs
![Page 5: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/5.jpg)
DTD OF PURCHASE RECORDS
<!ELEMENT purchases (purchase*)><!ELEMENT purchase (seller, buyer)><!ATTRIST seller ID ID location CDATA name CDATA><!ELEMENT seller (item*)><!ATTRIST buyer ID ID location CDATA name CDATA><!ELEMENT item (item*)><!ATTRIST item name CDATA manufacturer CDATA>
![Page 6: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/6.jpg)
A SINGLE PURCHASE RECORD
![Page 7: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/7.jpg)
PREORDER SEQUENCE OF XML
Use capital letters to represent names of elements/attributes
Use hash function h(), to encode attribute values into integers
v1 = h(“dell”) v2=h(“ibm”)
Preorder sequence of XML purchase record example PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8
Isomorphic trees may produce different preorder seq. DTD schema embodies linear order of all elements/attributes Without DTD – use lexicographical order
![Page 8: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/8.jpg)
STRUCTURE-ENCODED SEQUENCE
Definition: A Structure-Encoded Sequence, derived from a prefix traversal of semi-structured XML document, is a sequence of (symbol, prefix) pairs:
D = (a1,p1), (a2,p2),…, (an,pn)
Where ai represents a node in the XML document tree, (of which a1, … ,an is the preorder sequence), and pi is the path from the root node to node ai.
![Page 9: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/9.jpg)
STRUCTURE-ENCODED SEQUENCE
D= (P,ϵ),(S,P),(N,PS),(v1,PSN),(I,PS),(M,PSI),(v2,PSIM),(N,PSI),(v3,PSIN),(I,PSI),(M,PSII),(v4,PSIIM),(I,PS),(N,PSI),(v5,PSIN),
(L,PS),(v6,PSL),(B,P),(L,PB),(v7,PBL),(N,PB),(v8,PBN)
![Page 10: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/10.jpg)
XML QUERIES IN GRAPH FORM
![Page 11: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/11.jpg)
XML QUERIES IN PATH EXPRESSION AND SEQUENCE FORM
Query: Path Expression Structure-Encoded Sequence
Q1 : /Purchase/Seller/Item/Manufacturer (P, ϵ)(S,P)(I,PS)(M,PSI)
Q2 : /Purchase[Seller[Loc = v5]]/Buyer[Loc = v7] (P, ϵ)(S,P)(L,PS)(v5,PSL)(B,P)(L,PB)(v7,PBL)
Q3 : /Purchase/*[Loc = v5]
(P, ϵ)(L, P)(v5,P*L) Q4 : /Purchase//Item[Manufacturer = v3]
(P, ϵ)(I,P//)(M, P//I)(v3,P//IM)
![Page 12: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/12.jpg)
QUERYING XML THROUGH STRUCTURE-ENCODED SEQUENCE MATCHING Querying XML is equivalent to finding (non-contiguous)
subsequence matches Most structural XML queries can be performed through direct
subsequence matching Exception: branch has multiple identical child nodes
Q5=/A[B/C]/B/D Two different sequences
(A, ϵ)(B,A)(C,AB)(B,A)(D,AB) (A, ϵ)(B,A)(D,AB)(B,A)(C,AB)
Find matches separately and union their result We may find false matches if the indexed documents contain
branches with identical child nodes, then we ask multiple queries and compute set difference on result
If the query contains a large number of same child nodes under the branch, we can choose disassemble the tree into multiple trees and use join operations to combine their results
![Page 13: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/13.jpg)
ALGORITHMS
Naïve algorithm RIST – Relationships Indexed Suffix Tree ViST – Virtual Suffix Tree
![Page 14: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/14.jpg)
NAÏVE ALGORITHM: SUFFIX-TREE-LIKE STRUCTURE
Doc1 : (P, ϵ)( S, P)(N, PS)(v1, PSN)(L, PS)(v2, PSL) Doc2 : (P, ϵ)(B, P)(L, PB)(v2, PBL) Q1 : (P, ϵ)(B, P)(L,PB)(v2, PBL) Q2 : (P, ϵ)(L, P*)(v2,P*L)
![Page 15: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/15.jpg)
D-ANCESTORSHIP AND S-ANCESTORSHIP
D-Ancestorship Ancestor-descendant relationships in original XML tree Element (S,P) is a D-Ancestorship of (L,PS)
S-Ancestorship Ancestor-descendant relationships in suffix tree Element (v1, PSN) is an S-Ancestorship of (L, PS)
![Page 16: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/16.jpg)
NAÏVE SEARCH :A NAÏVE ALGORITHM BASED ON SUFFIX TREES
![Page 17: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/17.jpg)
RIST – INDEXING CONSTRUCTION
S-Ancestorship requires additional information Label each suffix tree node x by pair <nx, sizex>
nx prefix traversal order of x in suffix tree sizex is total number of descendants of x in suffix tree
x … <nx, sizex>, y …<ny, sizey> x is S-Ancestor of node y if ny ϵ (nx, nx + sizex]
Construct the B+Trees: Tree nodes into the D-Ancestorship B+Tree using (Symbol,
Prefix) as keys For all nodes x inserted with the same (Symbol, Prefix) we
index them by S-Ancestorship B+Tree, using the nx values of their labels as keys.
![Page 18: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/18.jpg)
THE RIST INDEX STRUCTURE
![Page 19: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/19.jpg)
SEARCH: NON-CONTIGUOUS SUBSEQUENCE MATCHINGUSING B+TREE
![Page 20: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/20.jpg)
VIST – VIRTUAL SUFFIX TREE
Dynamic Virtual suffix tree labeling Semantic and statistical clues Dynamic scope allocation without clues
![Page 21: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/21.jpg)
DYNAMIC SCOPE ALLOCATION
Number of child nodes of x is λ. We allocate 1/ λ of the remaining scope to x’s first child
Dynamic scope allocation with λ=2
![Page 22: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/22.jpg)
DYNAMIC SCOPE OF A SUFFIX TREE NODE
![Page 23: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/23.jpg)
SUBSCOPE(PARENT, E): CREATE A SUB SCOPEWITHIN THE PARENT SCOPE FOR E
![Page 24: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/24.jpg)
INSERTION INDEX
Doc1 = (P, ϵ)(S,P)(N,PS)(v1,PSN)(L,PS)(v2,PSL) Doc2 = (P, ϵ)(S,P)(L,PS)(v2,PSL)
![Page 25: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/25.jpg)
INDEX AN XML DOCUMENT
![Page 26: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/26.jpg)
EXPERIMENTS - SAMPLE QUERIES
Path Expression DatasetQ1 /inproceedings/title DBLPQ2 /book/author[text=‘David’] DBLPQ3 /*/author[text= ‘David’] DBLPQ4 //author[text= ‘David’] DBLPQ5 /book[key=‘books/bc/MaierW88’]/author DBLPQ6 /site//item[location=‘US’]/mail/date[text=‘12/15/1999’]
XMARKQ7 /site//person/*/city[text=‘Pocatello’] XMARKQ8 //closed_auction[*[person=‘person1’]]/date[text=‘12/15/1999’]
XMARK
![Page 27: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/27.jpg)
COMPARING INDEXING METHODS
time in seconds
![Page 28: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/28.jpg)
INDEX STRUCTURE
DBLP (301 MB of data) XMARK (52MB of data)
![Page 29: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/29.jpg)
CONCLUSION
structure-encoded sequences Sequence matching Avoid expensive join operations Top-down scope allocation method Index structure – B+Tree
![Page 30: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/30.jpg)
PRIX:PRUFER SEQUENCES FOR INDEXING XML
![Page 31: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/31.jpg)
PRIX: PRUFER SEQUENCES FOR INDEXING XML
Rao & Moon (2006) proposed a new method for indexing XML documents using sequences
It uses the same idea as in ViST index: The XML tree is transformed into a sequence and saved in the
database Each query is also transformed into a sequence The answer of the query is acquired by performing subsequence
matching
![Page 32: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/32.jpg)
PRIX: PRUFER SEQUENCES FOR INDEXING XML
![Page 33: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/33.jpg)
PRIX: PRUFER SEQUENCES FOR INDEXING XML
![Page 34: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/34.jpg)
MOTIVATION: TWIG QUERIES AND WILDCARDS
Like in ViST, PRIX also tries to efficiently answer twig queries as well as queries containing wildcards (‘*’ any and ‘//’ self or descendant queries)
P
Q
T S
Twig queryXPath: P/Q[T]/S
Query with wildcardsXPath: P//Q/S
P
Q
S
![Page 35: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/35.jpg)
MOTIVATION: PROBLEMS IN VIST INDEX
Memory requirements: In the worst case, ViST requires O(N2) space to index the
document
A
B
C
D
D = (A, ε), (B, A), (C, AB), (D, ABC), (E, ABCD)
EElements in height k
appear k times
<A> <B> <C> <D> <E> </E> </D> </C> </B></A>
![Page 36: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/36.jpg)
MOTIVATION: PROBLEMS IN VIST INDEX
Memory requirements: In the worst case, ViST requires O(N2) space to index the
document False positives
In many cases, query processing in Vist results in false alarms
P
Q
T
R
TUS
Doc1 = (P, e) (Q, P) (T, PQ) (S, PQ) (R, P) (U, PR) (T, PR)
P
Q
T
Q
S
Doc2 = (P, e) (Q, P) (T, PQ) (Q, P) (S, PQ)
P
Q
T S
XPath: P/Q[T]/SQ = (P, e) (Q, P) (T, PQ) (S, PQ)
![Page 37: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/37.jpg)
MOTIVATION: PROBLEMS IN VIST INDEX
Memory requirements: In the worst case, ViST requires O(N2) space to index the
document False positives
In many cases, query processing in Vist results in false alarms False negatives
Correctly answering a twig query depends on the order the branches are created
P
F
T
N
G
Doc = (P, e) (F, P) (T, PF) (N, P) (G, PN)
P
N F
Xpath: P[N]/FQ = (P, e) (N, P) (F, P) ???
![Page 38: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/38.jpg)
MOTIVATION: PROBLEMS IN VIST INDEX
Memory requirements: In the worst case, ViST requires O(N2) space to index the
document False positives
In many cases, query processing in Vist results in false alarms False negatives
Correctly answering a twig query depends on the order the branches are created
![Page 39: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/39.jpg)
PRIX: PRUFER SEQUENCES FOR INDEXING XML
![Page 40: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/40.jpg)
PRIX ARCHITECTURE
![Page 41: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/41.jpg)
INDEXING AND QUERYING IN PRIX
Indexing: The first step is to take as input an XML document and
convert it into a sequence This is achieved using Prufer Sequences
The sequence is saved in the database in a way equivalent to the one used in ViST It is a Virtual Trie implemented as B+ Trees
XML document
![Page 42: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/42.jpg)
INDEXING AND QUERYING IN PRIX
Querying Queries are also transformed to trees and then to Prufer
Sequences
The query sequence looked up in the document sequence and all matching subsequences are retrieved
After this initial filtering, three refinement phases follow
XPath Query
![Page 43: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/43.jpg)
PRIX: PRUFER SEQUENCES FOR INDEXING XML
![Page 44: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/44.jpg)
INDEXING XML DOCUMENTS The first step is to transform the XML document to the
equivalent XML tree
Notice that both elements and text values are represented as nodes (the same stands for attributes)
The tree is not saved in the database
<A> <B></B> <B> <C> D </C> <C> <F/> <E/> </C> </B></A>
A
B B
CC
D F E
![Page 45: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/45.jpg)
INDEXING XML DOCUMENTS
Then the Prufer Sequence is created from the XML tree A Prufer Sequence is a method proposed by Prufer
(1918) that constructs a one-to-one correspondence between a labeled tree and a sequence
8,A
1,B 7,B
6,C3,C
2,D 5,E4,F
8, 3, 7, 6, 6, 7, 8
![Page 46: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/46.jpg)
INDEXING XML DOCUMENTS
Prufer Sequences can only be created from trees with numerical labeling, with each node having a unique number
Since the XML tree contains string labels (the names of elements etc.) we add an additional label to each node
We will use the post-order traversal to name the nodes The prufer sequence can be extracted for any labeling of the
tree, but using post-order numbering has some properties that makes the querying process easier
![Page 47: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/47.jpg)
INDEXING XML DOCUMENTS
Initial labeling
A
B B
CC
D F E
8,A
1,B 7,B
6,C3,C
2,D 5,E4,F
![Page 48: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/48.jpg)
INDEXING XML DOCUMENTS
Finding the Prufer Sequence The algorithm to find the Prufer sequence is the
following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left
In PRIX index, two sequences are held: The actual Prufer Sequence holding the numbers of the
labels called Numbered Prufer Sequence: NPS The corresponding sequence holding the actual labels of the
nodes of the XML Tree called Labeled Prufer Sequence: LPS
![Page 49: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/49.jpg)
INDEXING XML DOCUMENTS
Finding the Prufer Sequence The algorithm to find the Prufer sequence is the
following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left
8,A
1,B 7,B
6,C3,C
2,D 5,E4,F
NPS : 8, LPS : A,
![Page 50: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/50.jpg)
INDEXING XML DOCUMENTS
Finding the Prufer Sequence The algorithm to find the Prufer sequence is the
following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left
8,A
7,B
6,C3,C
2,D 5,E4,F
NPS : 8, 3LPS : A, C
1,B
![Page 51: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/51.jpg)
INDEXING XML DOCUMENTS
Finding the Prufer Sequence The algorithm to find the Prufer sequence is the
following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left
8,A
7,B
6,C3,C
2,D 5,E4,F
NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, A
1,B
![Page 52: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/52.jpg)
INDEXING XML DOCUMENTS
Properties Both NPS and LPS have length N-1 (where N is the total
number of nodes Due to the fact that we delete one node at a time until only
one node is left
NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, A
8,A
1,B 7,B
6,C3,C
2,D 5,E4,F
![Page 53: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/53.jpg)
INDEXING XML DOCUMENTS
Properties Both NPS and LPS have length N-1 (where N is the total
number of nodes The i-th element deleted is always the node with label i
This helps us find the edges of the tree! (that is the mapping from the NPS to the tree)
NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, A
8,A
1,B 7,B
6,C3,C
2,D 5,E4,F
Deleted node: 1, 2, 3, 4, 5, 6, 7
![Page 54: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/54.jpg)
INDEXING XML DOCUMENTS
Properties Both NPS and LPS have length N-1 (where N is the total
number of nodes The i-th element deleted is always the node with label i LPS does not contain any leaves
NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, A
8,A
1,B 7,B
6,C3,C
2,D 5,E4,F
Deleted node: 1, 2, 3, 4, 5, 6, 7
![Page 55: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/55.jpg)
INDEXING XML DOCUMENTS
Indexes held in the database are The LPS (label prufer sequence) The NPS (numbered prufer sequence) The mapping between the number and the xml label of
the leaves of the tree
NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, ALeaves mapping: 1 B, 2 D, 4 F, 5 E
8,A
1,B 7,B
6,C3,C
2,D 5,E4,F
![Page 56: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/56.jpg)
PRIX: PRUFER SEQUENCES FOR INDEXING XML
![Page 57: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/57.jpg)
QUERYING
When a query arrives it is also transformed to a prufer sequence
Then, an initial filtering is performed The results of the initial filtering are sorted out in order
to acquire the correct answer to the query after three more refinement phases.
XPath Query
![Page 58: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/58.jpg)
QUERYING:TRANSFORMING A QUERY TO A PRUFER SEQUENCE The same process as in documents is followed For instance if we have the XPath query
A[B/C]/D/E/F The query tree is:
The NPS and LPS are: NPS(Q) = 2, 6, 4, 5, 6 LPS(Q) = B, A, E, D, A
![Page 59: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/59.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING
Suppose we have the following XML tree (T) of the document:
NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A
![Page 60: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/60.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING
To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T)
“A subsequence is any string that can be obtained by deleting zero or more symbols from a given string”
![Page 61: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/61.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the
subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree
LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A
T Q
![Page 62: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/62.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the
subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree
LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A
T Q
![Page 63: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/63.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the
subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree
LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A
12 subsequences are found in total, while only 4 are correct
T Q
![Page 64: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/64.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the
subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree
LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A
12 subsequences are found in total, while only 4 are correct
T Q
![Page 65: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/65.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the
subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree
LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A
12 subsequences are found in total, while only 4 are correct
T Q
![Page 66: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/66.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the
subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree
LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A
12 subsequences are found in total, while only 4 are correct
T Q
![Page 67: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/67.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the
subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree
LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A
12 subsequences are found in total, while only 4 are correct
T Q
![Page 68: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/68.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the path in the tree that is represented by the
sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the
NPS(T)
NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A
T Q
![Page 69: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/69.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the path in the tree that is represented by the
sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the
NPS(T)
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A
T Q
![Page 70: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/70.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the path in the tree that is represented by the
sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the
NPS(T)
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A
T Q
![Page 71: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/71.jpg)
QUERYING:FILTERING BY SEQUENCE MATCHING To find the path in the tree that is represented by the
sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the
NPS(T)
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A
T Q
![Page 72: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/72.jpg)
QUERYING:REFINEMENT STEPS
Despite the filtering, some false positives are in the results.
To find these false positives we have 3 refinement steps, namely: Refinement by connectedness Refinement by structure Refinement by matching leaf nodes
![Page 73: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/73.jpg)
QUERYING: FALSE NEGATIVES
A false negative can appear in the same case as in ViST index
The subsequence filtering relies on the assumption that the query branches come in the “correct” order
P
F
T
N
G
Document
P
N F
Query
![Page 74: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/74.jpg)
QUERYING: FALSE NEGATIVES
The solution proposed by Rao and Moon is to test the query in all possible permutations of the branches and then return the union as the answer of the query N branches N! permutations
Their main argument is that queries usually have a small number of branches
![Page 75: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/75.jpg)
QUERYING: FALSE NEGATIVES
The solution proposed by Rao and Moon is to test the query in all possible permutations of the branches and then return the union as the answer of the query N branches N! permutations
Their main argument is that queries usually have a small number of branches
P
N F D
S
P
N FD
S
P
NF D
S
… (three more permutations)
![Page 76: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/76.jpg)
EXPERIMENTS
![Page 77: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/77.jpg)
VIST VS PRIX: EXPERIMENTS
1.8GHz Pentium IV processor 512 MB RAM running Solaris 8 40GB EIDE disk drive (store data and indexes) Compiled by GNU g++ compiler version 2.95.3 Buffer pool size: 2000 pages of size 8K
![Page 78: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/78.jpg)
VIST VS PRIX: EXPERIMENTS
![Page 79: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/79.jpg)
VIST VS PRIX: EXPERIMENTS
![Page 80: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/80.jpg)
VIST VS PRIX: EXPERIMENTS
DBLP dataset
![Page 81: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/81.jpg)
VIST VS PRIX: EXPERIMENTS
SWISSPROT dataset
![Page 82: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/82.jpg)
VIST VS PRIX: EXPERIMENTS
TREEBANK dataset
![Page 83: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/83.jpg)
VIST VS PRIX
O(N2)
![Page 84: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/84.jpg)
?
? ?
QUESTIONS?
![Page 85: Sequence Indexing Schemes](https://reader036.fdocuments.in/reader036/viewer/2022062408/5681444d550346895db0eb28/html5/thumbnails/85.jpg)
THANK YOU!!