Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and...

21
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join

description

Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join. Feature Based Similarity. Simple Similarity Queries. Specify query object and Find similar objects – range query - PowerPoint PPT Presentation

Transcript of Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and...

Page 1: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität München

A Cost Model and Index Architecture for the Similarity Join

Page 2: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

2

Feature Based Similarity

Page 3: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

3

Simple Similarity Queries

Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.

Page 4: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

4

Join Applications: Catalogue Matching

Catalogue matching• E.g. Astronomic catalogues

R

S

Page 5: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

5

Join Applications: Clustering

Clustering (e.g. DBSCAN)

Similarity self-join

Page 6: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

6

R-tree Spatial Join (RSJ)

procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);

R S

Page 7: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

7

Cost Modeling

Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum

Page 8: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

8

Cost Modeling

Binomial formula:

Page 9: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

9

Cost Modeling

Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum

Page 10: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

10

Page Capacity Optimization

Cost model can determine index selectivity which depends on various parameters

Page capacity (number of stored points) is an important parameter

Known from similarity search: Page capacity optimization yields considerable improvement

Page 11: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

11

Analysis of the Index Overhead

Assuming 100% selectivity (index doesnt work)How much more expensive is index usage ?

CPU:• Distance betw. boxes more

expensive to compute than distance betw. points:

• Smaller capacity more box distance computations

Page 12: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

12

Analysis of the Index Overhead

Disk I/O:• High constant cost per page access (move disk head)• Page access is by factor 10000 / d more

expensive than continuous reading of a point• Smaller capacity more disk head movement

Page 13: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

13

Analysis of the Index Overhead

What selectivity is needed that index pays off ?

Page 14: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

14

Optimization

I/O cost function:

is optimized by

CPU cost function:

is optimized by:

Page 15: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

15

Optimization

I/O cost:• Large capacity optimum (several 10,000 points, typically)

CPU cost:• Small capacity optimum (< 100 points, typically)

• No compromise achievable

Page 16: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

16

Multipage Index (MuX)

CPU-performance like CPU optimized index I/O- performance like I/O optimized index

separateoptimization

Page 17: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

17

Experimental Evaluation

Uniform 4D Uniform 8D

Page 18: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

18

Experimental Evaluation

CAD Data 16D Color Images 64D

Page 19: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

19

Conclusions

Summary• High potential for performance gains of the

similarity join by page capacity optimization• Necessary to separately optimize I/O and CPU

Future research potential• Similarity join for metric index structures• Approximate similarity join• Parallel similarity join algorithms

Page 20: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

20

Consequences

Assume for I/O optimization selectivity Page accesses in a nested block loop like style:

if mindist(r,s) then join (r,s) ;foreach joining R-page r in cache do

load (s) ;if s joins some of the cached R-pg then

foreach S-page s dofill cache with pages of R (1 page free) ;

Page 21: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model  and Index Architecture  for the Similarity Join

Chr

istia

n B

öhm

21

R-tree Spatial Join (RSJ)

procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);

R S