LCA -Based Selection for XML Document Collections

43
LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina, Greece http://dmod.cs.uoi.gr

description

LCA -Based Selection for XML Document Collections. Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina, Greece http://dmod.cs.uoi.gr. What is the topic of this talk?. Fundamental question: - PowerPoint PPT Presentation

Transcript of LCA -Based Selection for XML Document Collections

Page 1: LCA -Based Selection for   XML Document Collections

LCA -Based Selection for XML Document Collections

Georgia Koloniarijoint work with Evaggelia Pitoura

Department of Computer ScienceUniversity of Ioannina, Greece

http://dmod.cs.uoi.gr

Page 2: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

2

Fundamental question:

Given a query and many available data sources with large volumes of data,

select the most relevant sources for the query/filter out the irrelevant ones

What is the topic of this talk?

Page 3: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

3

More formally:

Source/Database Selection: Problem Definition

Given a query q and a set of data sources, rank the data sources according to the relevance (called goodness) of their data to q

Evaluate q against the most relevant (best) data sources

What is the topic of this talk?

Database selection for: relational databases Sayyadian et al [ICDE ‘07], Yu et al [SIGMOD ‘07], Vu at el [SIGMOD ‘08]

textual document collectionsCallan et al [SIGIR ‘95], Gravano et al [ACM Trans. Database Syst. ‘99]

Source Selection Problem: Previous research

However, many data sources with XML documents

Page 4: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

4

In this paper,

The source selection problem for XML Document Collections

Given a set of N distributed collections of XML documents and a query q rank the collections based on their goodness (i.e., relevance) to q

Keyword queries, q = (w1, w2, …, wk)

XML Selection Problem: Definition

Page 5: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

5

OUTLINE

In the rest of this talk,

What is different with XML?

our LCA-based approach

Define goodness for a database of XML documents

How to compute goodness for a given query

using pre-computed summaries

Experimental evaluation

Page 6: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

6

conf

WWW

cname paper

title

facet name

author

van Zwol

paper

title

RDF name

author

Atre

demo

title

Top-k name

author

Soliman

name

Chaoij

2010

year

author

name

author

Sigurbjörnsson

Query: Atre RDF

Keyword search for XML Documents: an example

Search for nodes that contain the keywords (as their label, content, label or value of their attributes) Result: the subtrees whose nodes contain all the keywords

Page 7: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

7

conf

HDMS

cname paper

title

XPath name

author

Georgiadis

paper

title

RDF name

author

Atre

demo

title

Top-k name

author

Soliman

name

Chaoij

2010

year

author

name

author

Vassalos

Query: Atre RDF

Keyword search for XML Documents: an example

The Lowest Common Ancestor (LCA) of a set of nodes V ‘ = {v1, . . . , vk} (V’ V ) is the deepest node v in a tree T which is an ancestor of all nodes in V’

Page 8: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

8

Keyword query q = (w1, w2, …, wk)

An unordered labeled XML tree T = (V, E) of an XML document d

An element (node) v ∈ V contains a keyword wi - contains(v, wi)

Si = {v|v ∈ V and contains(v, wi)}, 1 ≤ i ≤ k

(set of nodes that contain keyword wi)

Result(q) subset of (basic LCA-approach)

lca(S1, . . . , Sk) that evaluates the set of LCA nodes V, such that v ∈ V if v = lca(v1, . . . , vk) and v1 ∈ S1, . . . , vk ∈ Sk

(at least one occurence of each keyword)

Keyword search for XML Documents: LCA semantics

Page 9: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

9

Query: paper van Zwol

Keyword search for XML Documents: LCA semantics

conf

WWW

cname

paper

title

facet name

author

van Zwol

paper

title

facet name

author

Lin

demo

title

RDF name

author

van Zwol

name

Yan

2010

year

author

name

author

Sigurbjörnsson

contentSLCA

Page 10: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

10

Query: paper van Zwol

Keyword search for XML Documents: LCA semantics

conf

WWW

cname

paper

title

facet name

author

van Zwol

paper

title

facet name

author

Lin

demo

title

RDF name

author

van Zwol

name

Yan

2010

year

author

name

author

Sigurbjörnsson

contentELCA

ELCA

Page 11: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

11

Lowest Common Ancestor

Many variations

Only structural (Smallest LCA, Exclusive LCA, etc)

Schema of the documents (Meaningful LCA, Valuable LCA, based also on node/element types)

in addition IR-based statistics

We do not propose yet another one, instead we use the basic LCA (the Result(q) set)

Most others can be implemented on filtering our results (details in the paper)

Experimental evaluation on ELCA

Page 12: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

12

Query: paper van Zwol

Keyword search for XML Documents: Ranking

conf

WWW

cname

paper

title

facet name

author

van Zwol

paper

title

facet name

author

Lin

demo

title

RDF name

author

van Zwol

name

Yan

2010

year

author

name

author

Sigurbjörnsson

contentELCA

ELCA

Structure is used to improve the quality of the result -> rank results based on the distance of the keywords from their LCA

Page 13: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

13

e

v

Keyword search for XML Documents: Ranking

root

),(max)( ii

vvdistuh the height of the LCA node v ∈ Result(q)

the maximum distance of any of the keywords of q in the XML tree to their LCA node

Height: 2

Query: o, b

f

v

aa a

d m

e

d

x

f

b ob

f

c h

oHeight: 1

Page 14: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

14

Query: paper van Zwol

Keyword search for XML Documents: Ranking

conf

WWW

cname

paper

title

facet name

author

van Zwol

paper

title

facet name

author

Lin

demo

title

RDF name

author

van Zwol

name

Yan

2010

year

author

name

author

Sigurbjörnsson

contentELCA,

ELCA

Height: 4

Height: 3

Page 15: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

15

conf

WWW

name paper

title

object name

author

Zaragoza

paper

title

RDF name

author

Atre

demo

title

Top-k name

author

Soliman

name

Chaoij

2010

year

author

name

author

Pound

Query: demo RDF Pound

Keyword search for XML Documents: Relevance

Not all trees that contain the keywords are relevant

Exclude some of the results as not relevant based on height

Page 16: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

16

otherwise,0

)(min if,1),,( )(

vhdqsim qResultv

otherwise,0

)(min if)),((min),,( )()(

vhvhF

dqsim qResultvqResultv

F(h(v)): a function F of the height h of a result node v such that the similarity of d to q is greater when h(v) is small

Boolean Problem:

Weighted Problem:

A user is interested in d as a result for q iff the distance (height) of a result in d is lower or equal to

Database Selection: Document relevance

Page 17: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

17

17

A database D is ranked based on its goodness to q by aggregating the relevance of their documents

Database Selection

Dd

dqsimDqGoodness ),,(),,(

The goodness measure ranks highly collections that: have a large number of documents with a relatively small similarity

score have less documents but with higher similarity scores

The threshold limits the tendency to favor large collections in contrast to more relevant ones

Page 18: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

18

OUTLINE

In the rest of this talk,

What is different with XML?

our LCA-based approach

Define goodness for a database of XML documents

How to compute goodness for a given query

using pre-computed summaries

Experimental evaluation

Page 19: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

19

Goodness Estimation

To estimate the goodness of a collection D for a keyword query, the straightforward approach is:1. For each document d ∈ D

Evaluate q against dFind all the LCA nodes in d of the k keywords that appear in q (Result(q))Select v ∈ Result(q) with the minimum heightif h(v) ≤ l

the boolean model returns a matchthe weighted model computes the similarity based on function F

2. Aggregate over all d ∈ D

Computing LCA online for each query is expensive

Page 20: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

20

Goodness Estimation

To avoid at execution time: Pre-compute the LCA nodes of for all possible combinations of keywords that appear in each d and maintain their heightsNumber of computed LCA nodes for an XML document with n keywords:

)2(2

nn

i

Oi

n

Page 21: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

21

Pair-Wise Goodness Estimation

OUR APPROACH

We maintain information for the height only for pairs of keywords and use this to estimate the height of the LCA for more than 2 keywords

For each distinct pair of keywords (wi, wj) in a document d, we maintain

the height hmin(i, j) of the LCA node v ∈ lca(Si, Sj) with h(v) ≤ h(u), ∀ u ∈ lca(Si, Sj) (the lowest LCA) and

the height hmax(i, j) of the LCA node v ∈ lca(Si, Sj) with h(v) ≥ h(u), ∀u ∈ lca(Si, Sj) (the highest LCA)

Page 22: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

22

Proposition. Let G(V,E) be an acyclic directed graph, and V ‘ = {v1, . . . , vM} any subset of M nodes in G, V ‘ V . Then,

h(lca(v1, . . . , vM)) = maxvi,vj∈V h(lca(vi, vj)).

Pairwise-based Height Estimation

If the keywords are distinct (just a single LCA), then it is easy to see that the height is equal to the maximum

Else, we get estimations

Page 23: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

23

Pair-based Height Estimation

Hmin(d, q): the maximum value of the minimum LCA height values for any pair of keywords in q

Hmax(d, q): the maximum value of the maximum LCA height values for any pair of keywords in q

(o, b) → 1-3(o, a) → 2-3(b, a) → 1-3

Hmin(d, q): 2Hmax(d, q): 3

Theorem. Given a keyword query q and a document d, the height of any v ∈ Result(q) is such that: Hmin(d, q) ≤ h(v) ≤ Hmax(d, q)

Query: o, b, a

f

v

a a a

d m

e

d

x

f

b ob

f

c h

d

v

e

Page 24: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

24

Boolean Goodness Estimation

If Hmin(d, q) > -> not relevant (no false negatives)

If Hmin(d, q) and Hmax (d, q) then relevant

If Hmin(d, q) and Hmax (d, q) > , relevant but false positives are possibleFor the weigthed and the goodness estimation

bounds, details in the paper

Page 25: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

25

Even with the optimizations, the information to be maintained may remain large =>

summaries to reduce its size

Our summaries are based on Bloom filters

Summarizing the matrices

Page 26: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

26

Bloom-based Summaries

Test whether y in the set (look up y), again apply the same function Tunable probability of False Positive: probability of incorrectly identifying an element as a match

Bloom FiltersCompact data structures for a probabilistic representation of a set Used to answer membership queries

1 1 1 1 Bit vector v

h1(x) = 4

h2(x) = 2

h3(x) = 5

h4(x) = 8

m = 10 bits

Bit vector of m bits, initially set to 0 - l hash function: 0 -> m - 1

Insert x in the Bloom - Apply the l hash function, set to 1 the corresponding bits

Page 27: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

27

Bloom-based Summaries for the Boolean ProblemFor each d in D maintain two Bloom filters:

BFmin(d) for the hmin(i, j) and BFmax(d) for the hmax(i, j) values

of each distinct keyword pair (wi,wj) in d

Given a similarity threshold , for all (wi, wj) in d

if hmin (i, j) ≤ , then (wi, wj) is hashed as one key and inserted into BFmin(d)

if hmax(i, j) ≤ , then (wi, wj) is also inserted into BFmax(d)

Page 28: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

28

Bloom-based Summaries for the Boolean Problem

Similarity Evaluation of d to q:

1. every pair of keywords of q is looked up in BFmin(d) and if one is not found, d is not relevant

2. else, we also look them in BFmax(d), if found, definite relevant else relevant but with a false positive probability

Page 29: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

29

Bloom-based Summaries for the Weighted Problem

Group the keyword pairs according to their hmin(i, j) (hmax(i, j) ) value and use a separate Bloom filter for each such group - distance

Compute the similarity by applying F on the number of the highest level for which there was a hit for any of the keyword pairs of the query

f

v

a a a

d f f m

o h o h e

d

x

f

b o

e

vb

Page 30: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

30

OUTLINE

In the rest of this talk,

What is different with XML?

our LCA-based approach

Define goodness for a database of XML documents

How to do compute goodness for a given query

using pre-computed summaries

Experimental evaluation

Page 31: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

31

Experimental EvaluationWe consider four approaches for goodness evaluation:

keyword: ignores structure - based solely on the appearance of the keywords tree: exact evaluation based on ELCA semantics

pair: pairwise estimation bloom: pairwsise + Bloom-based summaries

Experiments on both synthetic and real datasets

goodness estimation of a single collection

accuracy of the ranking based on goodness

Page 32: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

32

Goodness Estimation (Single Collection)

Using Bloom filters increases the estimation error but also reduces the storage overhead to 8% of the pair-based one

Due to false positives, Bloom filters derive more optimistic lower bounds

Weighted Boolean

Page 33: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

33

Weighted Boolean

For low threshold values, the goodness estimations and the lower bounds are more accurate, while they increase as the threshold increases

When the threshold value is close to the tree depth of the documents, the accuracy of the estimations improves again

Similarity Threshold (Single Collection)

Page 34: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

34

Document & Query Structure (single collection)

Absolute estimation error (distance from ELCA)

Overall acceptable estimations (below 20%)

Our approaches behaves worse for queries of "medium" length (4-5) and small number of repeating elements

Page 35: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

35

Achieved ranking

Optimal Ranking (Ranking achieved through the actual ELCA computation) and Pair-wise Ranking (with and without Blooms)

Spearman Footrule distance between two ranked lists: the absolute difference of their pairwise elements normalized by dividing by 1/2(S), where S the number of elements in the lists

Mean Average Precision (MAP) for a set of different queries: the average of the precision value (percentage of relevant documents) attained after each query, divided by the number of queries

three different collections (same size, different size, random)

Page 36: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

36

Ranking (Spearman)

Equal Size Collections Different Size Collections

Random Collections The keyword-based approach ignores the document structure and ranks the collections according to their size Our approaches behave well, with maximum distance to the actual ranking at 0.3 in the worst case The Bloom-based approach sometimes outperforms the pair-based one due to the more optimistic estimations

Page 37: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

37

Equal Size Collections Different Size Collections

Random Collections

Our approaches behave well, with a MAP around 0.75 to 0.85 The Bloom-based approach is less precise because of the false positives

Ranking (MAP)

Page 38: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

3838

We split the DBLP bibliographic data collection:

Two sets of collections grouped by:1. year of publication (i.e., collections “2009”, "2008", etc)2. conference name (i.e., collection “WWW”, "VLDB", etc)

Queries with author names as keywords With λ equal to 1, we retrieve publications cowritten by two authors

Pair-based Bloom-based Keyword-based

“Omar Benjelloun and Serge Abiteboul” & collections by yearCorrect order (by counting commnon publications): 2004 2002 2003 2005

SF distance: 0Precision: 1

SF distance: 0.2Precision: 5/6

SF distance: 0.46Precision: 1/2

“Alon Y. Halevy and Zachary G. Ives” & collections by conferenceCorrect order (by counting common publications): SIGMOD, WebDb, WWW, ...

SF distance: 0Precision: 1

SF distance: 0.75Precision: 6/8

SF distance: 0.85Precision: 1/3

Real Data

Page 39: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

39

Summary

(LCA-based ranking) Maintain information about the height of the LCA node between keywords

Propose a pair-wise aproach: the actual height for a combination of keywords is estimated using the pair-wise heights

Introduce Bloom-based summaries for maintaining heights

Both a Boolean and a Weighted version for document similarity

Evaluation of the quality of the goodness estimation per collection and the actual ranking, as well as usefulness for real data

Consider the problem of source selection for XML documents:Given a set of XML databases and a keyword query, ranked the databases based on their goodness

Page 40: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

40

Future Work

Other definitions of document relevance (including schema based and IR techniques)

Alternative definitions of database goodness + user study for their evaluation

Other types of summaries

Page 41: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

41

Thank you

Page 42: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

42

Related WorkLCA-type Description Definition

Smallest LCA(SLCA)(Yu, & Papakonstantinou, Sigmod’05)

v is an SLCA if all keywords of q appear in the subtree rooted at v and none of its descendants has such a subtree containing all keywords.

v slca(S1, S2, . . ., Sk) if v lca(S1, . . ., Sk) and u lca(S1, S2, . . ., Sk) v not an ancestor of u.

Exclusive LCA (ELCA)((Yu, & Papakonstantinou, EDBT ‘08)

v is an ELCA if it contains at least one occurrence of each keyword in the subtree rooted at v, excluding the occurrences of the keywords in subtrees of its descendants already containing all the keywords.

v elca(S1, S2, . . ., Sk) iff v1 S1, . . ., vk Sk: v=lca(v1, . . ., vk) and vi (1≤ i ≤ k) the child of v in the path from v to vi is not an LCA of S1, . . ., Sk itself or an ancestor of such an LCA.

Meaningful LCA (MLCA)(Li et al, VLDB’04)

v is an MLCA if in the subtree rooted at v, the nodes containing the keywords are pairwise meaningfully related.

v is not an MLCA, if all pairs of nodes (vi, vj) in the subtree rooted at v that contain the keywords of q are such that v’i, v’j containing the same keywords such that lca(vi, vj) is an ancestor of lca(v’i, v’j).

Valuable LCA (VLCA)(Li et al, CIKM’07)

v is a VLCA, iff for the nodes vi, vj, containing keywords (wi, wj), in the subtree rooted at v, there are no other two nodes of the same label/tag except vi, vj.

For v=lca(v1, . . ., vk) , v is the VLCA of v1, . . ., vk iff vi, vj

there are no other two nodes of the same label/tag.

For all the variations of the LCA, for any query q and document d the set of the LCA nodes of the keywords in q (basic LCA nodes) is a superset of any type of LCA nodes, i.e., SLCA, ELCA, MLCA, VLCA

Page 43: LCA -Based Selection for   XML Document Collections

DMOD Laboratory, University of Ioannina HDMS 2010

43

Experimental Evaluation

Parameter Range Default

# of documents per collection (|D|) 20-200 100

# of elements per document (n) - 50000

depth of XML tree (depth) 4-20 12

% of repeating element names (r) 0-0.6 0.3

query elements appearing in documents - 90%

query length (k) 1-6 4

similarity threshold (l) 1-12 4

number of collections (N) - 12

Number of Bloom filter hash functions - 4

Size of Bloom filter - 996bits