Approximate XML Query Answers

Approximate XML Query Answers

Alkis Polyzotis (UC Santa Cruz)Minos Garofalakis (Bell Labs)Yannis Ioannidis (U. of Athens, Hellas)

Motivation

XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost

Increased query response times Users might wait for un-interesting results

XML Data

Warehouse

XMLR

Q

Approximate Query Answers

Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result

Use approximate result as timely feedback User can assess the “value” of the query

Goal: reduce number of evaluated queries

XML Data

Warehouse

Synopsis

XMLR

XML R’

Q

Contributions

TreeSketch Synopses Structural summaries for XML data Approximate answers for complex twig queries Summarization model Structural clustering of elements Efficient processing and construction

Element Simulation Distance Novel distance metric for XML data Captures “approximate” similarity between two XML trees

Experimental Results Accurate approximate answers for low space budgets Low-error selectivity estimates Efficient construction algorithm

Outline

Preliminaries TreeSketches

Synopsis model Computing approximate answers Summary construction

Element Simulation Distance Experimental Study Conclusions

Data and Query Model

XML Document

q0

q1

q2 q3

//section

.//equation./figure

Twig Query

s2

e11 e13f5 f7

rNesting Tree

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

e10f5s2r

e8f5s2r

e10f4s2r

e8f4s2r

q3q2q1q0

Binding Tuples

Problem Definition

Process twig query over a synopsis Compute approximation of nesting tree

q0

q1

q2 q3

//section

.//equation./figure

s2

e11 e13f5 f7

r

s

e ef

r ApproximateNesting Tree

True Nesting Tree

XML Data

Synopsis

TreeSketch Model

Graph Synopsis

XML Document Graph Synopsis

Synopsis node Set of elements of the same tag Synopsis edge Document edge(s)

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

XML Document TreeSketch

TreeSketch Synopsis

Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r


TreeSketch Synopsis

Is there a lossless synopsis? What is the quality of a lossy synopsis?

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r


Count Stability

(u,v) count-stable: all elements in u have the same child-count in v

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r


Count-Stable TreeSketch

A count-stable synopsis can recover the input tree Efficient one-pass construction Stable summary can be too large for practical use!

1

1

2 2

111

P(1)

S(1)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

S(1)1


Lossy TreeSketch

2

1

1 1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r 2

#F

#F

TreeSketches and Clustering

TreeSketch Element clustering All elements in a node are mapped to a “centroid” Tight clusters Accurate synopsis

Synopsis quality Clustering error Options: Manhattan Distance, Squared Error, … Quality can be measured independent of a workload Key for effective construction

Computing Approximate Answers

TreeSketch

q0

q1

q2 q3

//section

.//equation.//caption

Query Approximate Nesting Tree

R

E

11+1=2

C

S

2

Compute TreeSketch of approximate answer Accuracy depends on quality of clustering

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

TreeSketch Construction

Given an XML tree T, build a TreeSketch of size B Difficult clustering problem

Space dimensionality depends on the clustering itself

Construction based on bottom-up clustering Compress perfect synopsis by merging clusters Best merge determined by marginal gains Heuristic to reduce number of candidate merges

Perfect Space Budget

…

Element Simulation Distance

Error of Approximation

Error Distance between R’ and R Popular metric: Tree-edit distance

Min-cost sequence of operations that transform R’ to R Measures syntactic differences between R and R’

Not intuitive for approximate answers!

T1 T

r

s

e

s

f1 4

ef4 1

r

s

e

s

f4 4

ef1 1

r

s

e

s

f2 6

ef6 2

T2

Different countsSimilar Trait

Same countsOpposite Trait

Element Simulation Distance

Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v

How well the structure of u matches the structure of v

Modeled as the distance between multi-sets Efficient computation using perfect summaries

T

r

s

e

s

f1 4

ef4 1

r

s

e

s

f2 6

ef6 2

T2

eeee

f

eeeeee

ffRecursive application

of ESD

Experimental Results

Methodology

Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics:

Average ESD for approximate answers Mean absolute relative error for selectivity estimation

€

1

|W |×

| estim(q) − count(q) |

count(q)q∈W

∑

Approximate Answers - IMDB

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

10 15 20 25 30 35 40 45 50Summary Size (KB)

Mean ESD

TreeSketchXSketch

IMDB (~102K Elements)Avg. Result Size: 3,477 tuples

Selectivity Estimation - SwissProt

0

20

40

60

80

100

120

140

160

10 15 20 25 30 35 40 45 50Summary Size (KB)

Estimation Error (%)

TreeSketchXSketch

SwissProt (~182K Elements)Avg. Result Size: 104,592 tuples

Selectivity Estimation - ALL

0

5

10

15

20

25

30

10 15 20 25 30 35 40 45 50Summary Storage (KB)

Error (%)

DBLPIMDBSwiss ProtXMark

Data Set

#Elements (x 103)

# Tuples (x 103)

DBLP 1,500 78

IMDB 236 13

S-Prot 473 365

XMark 2,000 145

Data Set

Construction Time (min)

DBLP 11

IMDB 2.5

S-Prot 38

XMark 240

Conclusions

Approximate query answering for XML databases TreeSketch Synopses

Structural summaries for tree-structured XML Approximate answers for twig-queries Model: Graph Synopsis + Edge-counts Efficient processing and construction

Element Simulation Distance Capture approximate similarity between XML trees

Experimental Results High accuracy for low space budgets Efficient construction

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Questions?

XML Document

p1

s2

f7

c14

s3

f9

c17

f5

e11 c12 e13

f9

c17

r

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R

TreeSketch

1

2

1 1

111

TreeSketch Model (2/2)

Average number of children <--> Edge count

#E

#C

1

1

XML

XML Document

p1

s2

f7

c14

s3

f9

c17

f5

e11 c12 e13

p: papers: sectionc: captiont: titlef: figuree: equationf9

c17

r


TreeSketch Synopsis

Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u

2

1

2

2

10.5

P(1)

S(2)

C(4)

F(4)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

#F

Depth-Guided Merging

Key observation: Two elements have similar structure, if their children have similar structure

Bottom-up merging, based on depth Depth: distance from the leaves of the tree Build a pool of candidate merges by increasing depth Replenish the pool when it falls below a given threshold

Reduced construction time - Accurate synopses

Depth-Guided Merging

Observation: Two elements have similar structure, if their children have similar structure

Heuristic: If a merge of two clusters is good, then merges of the child clusters are likely to have been good as well

Bottom-up merging strategy Savings in construction time - Accurate synopses

Approximate XML Query Answers

Documents

Transcript of Approximate XML Query Answers