Approximate XML Query Answers

32
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

description

Approximate XML Query Answers. Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas). XML. XML Data. Motivation. XML: de-facto standard for data exchange Development of the “ XML Warehouse” - PowerPoint PPT Presentation

Transcript of Approximate XML Query Answers

Page 1: Approximate XML Query Answers

Approximate XML Query Answers

Alkis Polyzotis (UC Santa Cruz)Minos Garofalakis (Bell Labs)Yannis Ioannidis (U. of Athens, Hellas)

Page 2: Approximate XML Query Answers

Motivation

XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost

Increased query response times Users might wait for un-interesting results

XML Data

Warehouse

XMLR

Q

Page 3: Approximate XML Query Answers

Approximate Query Answers

Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result

Use approximate result as timely feedback User can assess the “value” of the query

Goal: reduce number of evaluated queries

XML Data

Warehouse

Synopsis

XMLR

XML R’

Q

Page 4: Approximate XML Query Answers

Contributions

TreeSketch Synopses Structural summaries for XML data Approximate answers for complex twig queries Summarization model Structural clustering of elements Efficient processing and construction

Element Simulation Distance Novel distance metric for XML data Captures “approximate” similarity between two XML trees

Experimental Results Accurate approximate answers for low space budgets Low-error selectivity estimates Efficient construction algorithm

Page 5: Approximate XML Query Answers

Outline

Preliminaries TreeSketches

Synopsis model Computing approximate answers Summary construction

Element Simulation Distance Experimental Study Conclusions

Page 6: Approximate XML Query Answers

Data and Query Model

XML Document

q0

q1

q2 q3

//section

.//equation./figure

Twig Query

s2

e11 e13f5 f7

rNesting Tree

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

e10f5s2r

e8f5s2r

e10f4s2r

e8f4s2r

q3q2q1q0

Binding Tuples

Page 7: Approximate XML Query Answers

Problem Definition

Process twig query over a synopsis Compute approximation of nesting tree

q0

q1

q2 q3

//section

.//equation./figure

s2

e11 e13f5 f7

r

s

e ef

r ApproximateNesting Tree

True Nesting Tree

XML Data

Synopsis

Page 8: Approximate XML Query Answers

TreeSketch Model

Page 9: Approximate XML Query Answers

Graph Synopsis

XML Document Graph Synopsis

Synopsis node Set of elements of the same tag Synopsis edge Document edge(s)

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

Page 10: Approximate XML Query Answers

XML Document TreeSketch

TreeSketch Synopsis

Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

Page 11: Approximate XML Query Answers

XML Document TreeSketch

TreeSketch Synopsis

Is there a lossless synopsis? What is the quality of a lossy synopsis?

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

Page 12: Approximate XML Query Answers

XML Document TreeSketch

Count Stability

(u,v) count-stable: all elements in u have the same child-count in v

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

Page 13: Approximate XML Query Answers

XML Document TreeSketch

Count-Stable TreeSketch

A count-stable synopsis can recover the input tree Efficient one-pass construction Stable summary can be too large for practical use!

1

1

2 2

111

P(1)

S(1)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

S(1)1

Page 14: Approximate XML Query Answers

XML Document TreeSketch

Lossy TreeSketch

2

1

1 1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r 2

#F

#F

Page 15: Approximate XML Query Answers

TreeSketches and Clustering

TreeSketch Element clustering All elements in a node are mapped to a “centroid” Tight clusters Accurate synopsis

Synopsis quality Clustering error Options: Manhattan Distance, Squared Error, … Quality can be measured independent of a workload Key for effective construction

Page 16: Approximate XML Query Answers

Computing Approximate Answers

TreeSketch

q0

q1

q2 q3

//section

.//equation.//caption

Query Approximate Nesting Tree

R

E

11+1=2

C

S

2

Compute TreeSketch of approximate answer Accuracy depends on quality of clustering

1

2

1 1

111

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R(1)

Page 17: Approximate XML Query Answers

TreeSketch Construction

Given an XML tree T, build a TreeSketch of size B Difficult clustering problem

Space dimensionality depends on the clustering itself

Construction based on bottom-up clustering Compress perfect synopsis by merging clusters Best merge determined by marginal gains Heuristic to reduce number of candidate merges

Perfect Space Budget

Page 18: Approximate XML Query Answers

Element Simulation Distance

Page 19: Approximate XML Query Answers

Error of Approximation

Error Distance between R’ and R Popular metric: Tree-edit distance

Min-cost sequence of operations that transform R’ to R Measures syntactic differences between R and R’

Not intuitive for approximate answers!

T1 T

r

s

e

s

f1 4

ef4 1

r

s

e

s

f4 4

ef1 1

r

s

e

s

f2 6

ef6 2

T2

Different countsSimilar Trait

Same countsOpposite Trait

Page 20: Approximate XML Query Answers

Element Simulation Distance

Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v

How well the structure of u matches the structure of v

Modeled as the distance between multi-sets Efficient computation using perfect summaries

T

r

s

e

s

f1 4

ef4 1

r

s

e

s

f2 6

ef6 2

T2

eeee

f

eeeeee

ffRecursive application

of ESD

Page 21: Approximate XML Query Answers

Experimental Results

Page 22: Approximate XML Query Answers

Methodology

Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics:

Average ESD for approximate answers Mean absolute relative error for selectivity estimation

1

|W |×

| estim(q) − count(q) |

count(q)q∈W

Page 23: Approximate XML Query Answers

Approximate Answers - IMDB

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

10 15 20 25 30 35 40 45 50Summary Size (KB)

Mean ESD

TreeSketchXSketch

IMDB (~102K Elements)Avg. Result Size: 3,477 tuples

Page 24: Approximate XML Query Answers

Selectivity Estimation - SwissProt

0

20

40

60

80

100

120

140

160

10 15 20 25 30 35 40 45 50Summary Size (KB)

Estimation Error (%)

TreeSketchXSketch

SwissProt (~182K Elements)Avg. Result Size: 104,592 tuples

Page 25: Approximate XML Query Answers

Selectivity Estimation - ALL

0

5

10

15

20

25

30

10 15 20 25 30 35 40 45 50Summary Storage (KB)

Error (%)

DBLPIMDBSwiss ProtXMark

Data Set

#Elements (x 103)

# Tuples (x 103)

DBLP 1,500 78

IMDB 236 13

S-Prot 473 365

XMark 2,000 145

Data Set

Construction Time (min)

DBLP 11

IMDB 2.5

S-Prot 38

XMark 240

Page 26: Approximate XML Query Answers

Conclusions

Approximate query answering for XML databases TreeSketch Synopses

Structural summaries for tree-structured XML Approximate answers for twig-queries Model: Graph Synopsis + Edge-counts Efficient processing and construction

Element Simulation Distance Capture approximate similarity between XML trees

Experimental Results High accuracy for low space budgets Efficient construction

Page 27: Approximate XML Query Answers

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Questions?

Page 28: Approximate XML Query Answers

XML Document

p1

s2

f7

c14

s3

f9

c17

f5

e11 c12 e13

f9

c17

r

P(1)

S(2)

F(2)

C(4)

F(2)

E(2)

R

TreeSketch

1

2

1 1

111

TreeSketch Model (2/2)

Average number of children <--> Edge count

#E

#C

1

1

Page 29: Approximate XML Query Answers

XML

XML Document

p1

s2

f7

c14

s3

f9

c17

f5

e11 c12 e13

p: papers: sectionc: captiont: titlef: figuree: equationf9

c17

r

Page 30: Approximate XML Query Answers

XML Document TreeSketch

TreeSketch Synopsis

Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u

2

1

2

2

10.5

P(1)

S(2)

C(4)

F(4)

E(2)

R(1)

p1

s2

f5

c11

s3

f6

c12

f4

e8 c9 e10

f7

c13

r

#F

Page 31: Approximate XML Query Answers

Depth-Guided Merging

Key observation: Two elements have similar structure, if their children have similar structure

Bottom-up merging, based on depth Depth: distance from the leaves of the tree Build a pool of candidate merges by increasing depth Replenish the pool when it falls below a given threshold

Reduced construction time - Accurate synopses

Page 32: Approximate XML Query Answers

Depth-Guided Merging

Observation: Two elements have similar structure, if their children have similar structure

Heuristic: If a merge of two clusters is good, then merges of the child clusters are likely to have been good as well

Bottom-up merging strategy Savings in construction time - Accurate synopses