LCA -Based Selection for XML Document Collections
Georgia Koloniarijoint work with Evaggelia Pitoura
Department of Computer ScienceUniversity of Ioannina, Greece
http://dmod.cs.uoi.gr
DMOD Laboratory, University of Ioannina HDMS 2010
2
Fundamental question:
Given a query and many available data sources with large volumes of data,
select the most relevant sources for the query/filter out the irrelevant ones
What is the topic of this talk?
DMOD Laboratory, University of Ioannina HDMS 2010
3
More formally:
Source/Database Selection: Problem Definition
Given a query q and a set of data sources, rank the data sources according to the relevance (called goodness) of their data to q
Evaluate q against the most relevant (best) data sources
What is the topic of this talk?
Database selection for: relational databases Sayyadian et al [ICDE ‘07], Yu et al [SIGMOD ‘07], Vu at el [SIGMOD ‘08]
textual document collectionsCallan et al [SIGIR ‘95], Gravano et al [ACM Trans. Database Syst. ‘99]
Source Selection Problem: Previous research
However, many data sources with XML documents
DMOD Laboratory, University of Ioannina HDMS 2010
4
In this paper,
The source selection problem for XML Document Collections
Given a set of N distributed collections of XML documents and a query q rank the collections based on their goodness (i.e., relevance) to q
Keyword queries, q = (w1, w2, …, wk)
XML Selection Problem: Definition
DMOD Laboratory, University of Ioannina HDMS 2010
5
OUTLINE
In the rest of this talk,
What is different with XML?
our LCA-based approach
Define goodness for a database of XML documents
How to compute goodness for a given query
using pre-computed summaries
Experimental evaluation
DMOD Laboratory, University of Ioannina HDMS 2010
6
conf
WWW
cname paper
title
facet name
author
van Zwol
paper
title
RDF name
author
Atre
demo
title
Top-k name
author
Soliman
name
Chaoij
2010
year
author
name
author
Sigurbjörnsson
…
Query: Atre RDF
Keyword search for XML Documents: an example
Search for nodes that contain the keywords (as their label, content, label or value of their attributes) Result: the subtrees whose nodes contain all the keywords
DMOD Laboratory, University of Ioannina HDMS 2010
7
conf
HDMS
cname paper
title
XPath name
author
Georgiadis
paper
title
RDF name
author
Atre
demo
title
Top-k name
author
Soliman
name
Chaoij
2010
year
author
name
author
Vassalos
…
Query: Atre RDF
Keyword search for XML Documents: an example
The Lowest Common Ancestor (LCA) of a set of nodes V ‘ = {v1, . . . , vk} (V’ V ) is the deepest node v in a tree T which is an ancestor of all nodes in V’
DMOD Laboratory, University of Ioannina HDMS 2010
8
Keyword query q = (w1, w2, …, wk)
An unordered labeled XML tree T = (V, E) of an XML document d
An element (node) v ∈ V contains a keyword wi - contains(v, wi)
Si = {v|v ∈ V and contains(v, wi)}, 1 ≤ i ≤ k
(set of nodes that contain keyword wi)
Result(q) subset of (basic LCA-approach)
lca(S1, . . . , Sk) that evaluates the set of LCA nodes V, such that v ∈ V if v = lca(v1, . . . , vk) and v1 ∈ S1, . . . , vk ∈ Sk
(at least one occurence of each keyword)
Keyword search for XML Documents: LCA semantics
DMOD Laboratory, University of Ioannina HDMS 2010
9
Query: paper van Zwol
Keyword search for XML Documents: LCA semantics
conf
WWW
cname
paper
title
facet name
author
van Zwol
paper
title
facet name
author
Lin
demo
title
RDF name
author
van Zwol
name
Yan
2010
year
author
name
author
Sigurbjörnsson
…
contentSLCA
DMOD Laboratory, University of Ioannina HDMS 2010
10
Query: paper van Zwol
Keyword search for XML Documents: LCA semantics
conf
WWW
cname
paper
title
facet name
author
van Zwol
paper
title
facet name
author
Lin
demo
title
RDF name
author
van Zwol
name
Yan
2010
year
author
name
author
Sigurbjörnsson
…
contentELCA
ELCA
DMOD Laboratory, University of Ioannina HDMS 2010
11
Lowest Common Ancestor
Many variations
Only structural (Smallest LCA, Exclusive LCA, etc)
Schema of the documents (Meaningful LCA, Valuable LCA, based also on node/element types)
in addition IR-based statistics
We do not propose yet another one, instead we use the basic LCA (the Result(q) set)
Most others can be implemented on filtering our results (details in the paper)
Experimental evaluation on ELCA
DMOD Laboratory, University of Ioannina HDMS 2010
12
Query: paper van Zwol
Keyword search for XML Documents: Ranking
conf
WWW
cname
paper
title
facet name
author
van Zwol
paper
title
facet name
author
Lin
demo
title
RDF name
author
van Zwol
name
Yan
2010
year
author
name
author
Sigurbjörnsson
…
contentELCA
ELCA
Structure is used to improve the quality of the result -> rank results based on the distance of the keywords from their LCA
DMOD Laboratory, University of Ioannina HDMS 2010
13
e
v
Keyword search for XML Documents: Ranking
root
),(max)( ii
vvdistuh the height of the LCA node v ∈ Result(q)
the maximum distance of any of the keywords of q in the XML tree to their LCA node
Height: 2
Query: o, b
f
v
aa a
d m
e
d
x
f
b ob
f
c h
oHeight: 1
DMOD Laboratory, University of Ioannina HDMS 2010
14
Query: paper van Zwol
Keyword search for XML Documents: Ranking
conf
WWW
cname
paper
title
facet name
author
van Zwol
paper
title
facet name
author
Lin
demo
title
RDF name
author
van Zwol
name
Yan
2010
year
author
name
author
Sigurbjörnsson
…
contentELCA,
ELCA
Height: 4
Height: 3
DMOD Laboratory, University of Ioannina HDMS 2010
15
conf
WWW
name paper
title
object name
author
Zaragoza
paper
title
RDF name
author
Atre
demo
title
Top-k name
author
Soliman
name
Chaoij
2010
year
author
name
author
Pound
…
Query: demo RDF Pound
Keyword search for XML Documents: Relevance
Not all trees that contain the keywords are relevant
Exclude some of the results as not relevant based on height
DMOD Laboratory, University of Ioannina HDMS 2010
16
otherwise,0
)(min if,1),,( )(
vhdqsim qResultv
otherwise,0
)(min if)),((min),,( )()(
vhvhF
dqsim qResultvqResultv
F(h(v)): a function F of the height h of a result node v such that the similarity of d to q is greater when h(v) is small
Boolean Problem:
Weighted Problem:
A user is interested in d as a result for q iff the distance (height) of a result in d is lower or equal to
Database Selection: Document relevance
DMOD Laboratory, University of Ioannina HDMS 2010
17
17
A database D is ranked based on its goodness to q by aggregating the relevance of their documents
Database Selection
Dd
dqsimDqGoodness ),,(),,(
The goodness measure ranks highly collections that: have a large number of documents with a relatively small similarity
score have less documents but with higher similarity scores
The threshold limits the tendency to favor large collections in contrast to more relevant ones
DMOD Laboratory, University of Ioannina HDMS 2010
18
OUTLINE
In the rest of this talk,
What is different with XML?
our LCA-based approach
Define goodness for a database of XML documents
How to compute goodness for a given query
using pre-computed summaries
Experimental evaluation
DMOD Laboratory, University of Ioannina HDMS 2010
19
Goodness Estimation
To estimate the goodness of a collection D for a keyword query, the straightforward approach is:1. For each document d ∈ D
Evaluate q against dFind all the LCA nodes in d of the k keywords that appear in q (Result(q))Select v ∈ Result(q) with the minimum heightif h(v) ≤ l
the boolean model returns a matchthe weighted model computes the similarity based on function F
2. Aggregate over all d ∈ D
Computing LCA online for each query is expensive
DMOD Laboratory, University of Ioannina HDMS 2010
20
Goodness Estimation
To avoid at execution time: Pre-compute the LCA nodes of for all possible combinations of keywords that appear in each d and maintain their heightsNumber of computed LCA nodes for an XML document with n keywords:
)2(2
nn
i
Oi
n
DMOD Laboratory, University of Ioannina HDMS 2010
21
Pair-Wise Goodness Estimation
OUR APPROACH
We maintain information for the height only for pairs of keywords and use this to estimate the height of the LCA for more than 2 keywords
For each distinct pair of keywords (wi, wj) in a document d, we maintain
the height hmin(i, j) of the LCA node v ∈ lca(Si, Sj) with h(v) ≤ h(u), ∀ u ∈ lca(Si, Sj) (the lowest LCA) and
the height hmax(i, j) of the LCA node v ∈ lca(Si, Sj) with h(v) ≥ h(u), ∀u ∈ lca(Si, Sj) (the highest LCA)
DMOD Laboratory, University of Ioannina HDMS 2010
22
Proposition. Let G(V,E) be an acyclic directed graph, and V ‘ = {v1, . . . , vM} any subset of M nodes in G, V ‘ V . Then,
h(lca(v1, . . . , vM)) = maxvi,vj∈V h(lca(vi, vj)).
Pairwise-based Height Estimation
If the keywords are distinct (just a single LCA), then it is easy to see that the height is equal to the maximum
Else, we get estimations
DMOD Laboratory, University of Ioannina HDMS 2010
23
Pair-based Height Estimation
Hmin(d, q): the maximum value of the minimum LCA height values for any pair of keywords in q
Hmax(d, q): the maximum value of the maximum LCA height values for any pair of keywords in q
(o, b) → 1-3(o, a) → 2-3(b, a) → 1-3
Hmin(d, q): 2Hmax(d, q): 3
Theorem. Given a keyword query q and a document d, the height of any v ∈ Result(q) is such that: Hmin(d, q) ≤ h(v) ≤ Hmax(d, q)
Query: o, b, a
f
v
a a a
d m
e
d
x
f
b ob
f
c h
d
v
e
DMOD Laboratory, University of Ioannina HDMS 2010
24
Boolean Goodness Estimation
If Hmin(d, q) > -> not relevant (no false negatives)
If Hmin(d, q) and Hmax (d, q) then relevant
If Hmin(d, q) and Hmax (d, q) > , relevant but false positives are possibleFor the weigthed and the goodness estimation
bounds, details in the paper
DMOD Laboratory, University of Ioannina HDMS 2010
25
Even with the optimizations, the information to be maintained may remain large =>
summaries to reduce its size
Our summaries are based on Bloom filters
Summarizing the matrices
DMOD Laboratory, University of Ioannina HDMS 2010
26
Bloom-based Summaries
Test whether y in the set (look up y), again apply the same function Tunable probability of False Positive: probability of incorrectly identifying an element as a match
Bloom FiltersCompact data structures for a probabilistic representation of a set Used to answer membership queries
1 1 1 1 Bit vector v
h1(x) = 4
h2(x) = 2
h3(x) = 5
h4(x) = 8
m = 10 bits
Bit vector of m bits, initially set to 0 - l hash function: 0 -> m - 1
Insert x in the Bloom - Apply the l hash function, set to 1 the corresponding bits
DMOD Laboratory, University of Ioannina HDMS 2010
27
Bloom-based Summaries for the Boolean ProblemFor each d in D maintain two Bloom filters:
BFmin(d) for the hmin(i, j) and BFmax(d) for the hmax(i, j) values
of each distinct keyword pair (wi,wj) in d
Given a similarity threshold , for all (wi, wj) in d
if hmin (i, j) ≤ , then (wi, wj) is hashed as one key and inserted into BFmin(d)
if hmax(i, j) ≤ , then (wi, wj) is also inserted into BFmax(d)
DMOD Laboratory, University of Ioannina HDMS 2010
28
Bloom-based Summaries for the Boolean Problem
Similarity Evaluation of d to q:
1. every pair of keywords of q is looked up in BFmin(d) and if one is not found, d is not relevant
2. else, we also look them in BFmax(d), if found, definite relevant else relevant but with a false positive probability
DMOD Laboratory, University of Ioannina HDMS 2010
29
Bloom-based Summaries for the Weighted Problem
Group the keyword pairs according to their hmin(i, j) (hmax(i, j) ) value and use a separate Bloom filter for each such group - distance
Compute the similarity by applying F on the number of the highest level for which there was a hit for any of the keyword pairs of the query
f
v
a a a
d f f m
o h o h e
d
x
f
b o
e
vb
DMOD Laboratory, University of Ioannina HDMS 2010
30
OUTLINE
In the rest of this talk,
What is different with XML?
our LCA-based approach
Define goodness for a database of XML documents
How to do compute goodness for a given query
using pre-computed summaries
Experimental evaluation
DMOD Laboratory, University of Ioannina HDMS 2010
31
Experimental EvaluationWe consider four approaches for goodness evaluation:
keyword: ignores structure - based solely on the appearance of the keywords tree: exact evaluation based on ELCA semantics
pair: pairwise estimation bloom: pairwsise + Bloom-based summaries
Experiments on both synthetic and real datasets
goodness estimation of a single collection
accuracy of the ranking based on goodness
DMOD Laboratory, University of Ioannina HDMS 2010
32
Goodness Estimation (Single Collection)
Using Bloom filters increases the estimation error but also reduces the storage overhead to 8% of the pair-based one
Due to false positives, Bloom filters derive more optimistic lower bounds
Weighted Boolean
DMOD Laboratory, University of Ioannina HDMS 2010
33
Weighted Boolean
For low threshold values, the goodness estimations and the lower bounds are more accurate, while they increase as the threshold increases
When the threshold value is close to the tree depth of the documents, the accuracy of the estimations improves again
Similarity Threshold (Single Collection)
DMOD Laboratory, University of Ioannina HDMS 2010
34
Document & Query Structure (single collection)
Absolute estimation error (distance from ELCA)
Overall acceptable estimations (below 20%)
Our approaches behaves worse for queries of "medium" length (4-5) and small number of repeating elements
DMOD Laboratory, University of Ioannina HDMS 2010
35
Achieved ranking
Optimal Ranking (Ranking achieved through the actual ELCA computation) and Pair-wise Ranking (with and without Blooms)
Spearman Footrule distance between two ranked lists: the absolute difference of their pairwise elements normalized by dividing by 1/2(S), where S the number of elements in the lists
Mean Average Precision (MAP) for a set of different queries: the average of the precision value (percentage of relevant documents) attained after each query, divided by the number of queries
three different collections (same size, different size, random)
DMOD Laboratory, University of Ioannina HDMS 2010
36
Ranking (Spearman)
Equal Size Collections Different Size Collections
Random Collections The keyword-based approach ignores the document structure and ranks the collections according to their size Our approaches behave well, with maximum distance to the actual ranking at 0.3 in the worst case The Bloom-based approach sometimes outperforms the pair-based one due to the more optimistic estimations
DMOD Laboratory, University of Ioannina HDMS 2010
37
Equal Size Collections Different Size Collections
Random Collections
Our approaches behave well, with a MAP around 0.75 to 0.85 The Bloom-based approach is less precise because of the false positives
Ranking (MAP)
DMOD Laboratory, University of Ioannina HDMS 2010
3838
We split the DBLP bibliographic data collection:
Two sets of collections grouped by:1. year of publication (i.e., collections “2009”, "2008", etc)2. conference name (i.e., collection “WWW”, "VLDB", etc)
Queries with author names as keywords With λ equal to 1, we retrieve publications cowritten by two authors
Pair-based Bloom-based Keyword-based
“Omar Benjelloun and Serge Abiteboul” & collections by yearCorrect order (by counting commnon publications): 2004 2002 2003 2005
SF distance: 0Precision: 1
SF distance: 0.2Precision: 5/6
SF distance: 0.46Precision: 1/2
“Alon Y. Halevy and Zachary G. Ives” & collections by conferenceCorrect order (by counting common publications): SIGMOD, WebDb, WWW, ...
SF distance: 0Precision: 1
SF distance: 0.75Precision: 6/8
SF distance: 0.85Precision: 1/3
Real Data
DMOD Laboratory, University of Ioannina HDMS 2010
39
Summary
(LCA-based ranking) Maintain information about the height of the LCA node between keywords
Propose a pair-wise aproach: the actual height for a combination of keywords is estimated using the pair-wise heights
Introduce Bloom-based summaries for maintaining heights
Both a Boolean and a Weighted version for document similarity
Evaluation of the quality of the goodness estimation per collection and the actual ranking, as well as usefulness for real data
Consider the problem of source selection for XML documents:Given a set of XML databases and a keyword query, ranked the databases based on their goodness
DMOD Laboratory, University of Ioannina HDMS 2010
40
Future Work
Other definitions of document relevance (including schema based and IR techniques)
Alternative definitions of database goodness + user study for their evaluation
Other types of summaries
DMOD Laboratory, University of Ioannina HDMS 2010
41
Thank you
DMOD Laboratory, University of Ioannina HDMS 2010
42
Related WorkLCA-type Description Definition
Smallest LCA(SLCA)(Yu, & Papakonstantinou, Sigmod’05)
v is an SLCA if all keywords of q appear in the subtree rooted at v and none of its descendants has such a subtree containing all keywords.
v slca(S1, S2, . . ., Sk) if v lca(S1, . . ., Sk) and u lca(S1, S2, . . ., Sk) v not an ancestor of u.
Exclusive LCA (ELCA)((Yu, & Papakonstantinou, EDBT ‘08)
v is an ELCA if it contains at least one occurrence of each keyword in the subtree rooted at v, excluding the occurrences of the keywords in subtrees of its descendants already containing all the keywords.
v elca(S1, S2, . . ., Sk) iff v1 S1, . . ., vk Sk: v=lca(v1, . . ., vk) and vi (1≤ i ≤ k) the child of v in the path from v to vi is not an LCA of S1, . . ., Sk itself or an ancestor of such an LCA.
Meaningful LCA (MLCA)(Li et al, VLDB’04)
v is an MLCA if in the subtree rooted at v, the nodes containing the keywords are pairwise meaningfully related.
v is not an MLCA, if all pairs of nodes (vi, vj) in the subtree rooted at v that contain the keywords of q are such that v’i, v’j containing the same keywords such that lca(vi, vj) is an ancestor of lca(v’i, v’j).
Valuable LCA (VLCA)(Li et al, CIKM’07)
v is a VLCA, iff for the nodes vi, vj, containing keywords (wi, wj), in the subtree rooted at v, there are no other two nodes of the same label/tag except vi, vj.
For v=lca(v1, . . ., vk) , v is the VLCA of v1, . . ., vk iff vi, vj
there are no other two nodes of the same label/tag.
For all the variations of the LCA, for any query q and document d the set of the LCA nodes of the keywords in q (basic LCA nodes) is a superset of any type of LCA nodes, i.e., SLCA, ELCA, MLCA, VLCA
DMOD Laboratory, University of Ioannina HDMS 2010
43
Experimental Evaluation
Parameter Range Default
# of documents per collection (|D|) 20-200 100
# of elements per document (n) - 50000
depth of XML tree (depth) 4-20 12
% of repeating element names (r) 0-0.6 0.3
query elements appearing in documents - 90%
query length (k) 1-6 4
similarity threshold (l) 1-12 4
number of collections (N) - 12
Number of Bloom filter hash functions - 4
Size of Bloom filter - 996bits
Top Related