Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and...
description
Transcript of Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and...
Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität München
A Cost Model and Index Architecture for the Similarity Join
Chr
istia
n B
öhm
2
Feature Based Similarity
Chr
istia
n B
öhm
3
Simple Similarity Queries
Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.
Chr
istia
n B
öhm
4
Join Applications: Catalogue Matching
Catalogue matching• E.g. Astronomic catalogues
R
S
Chr
istia
n B
öhm
5
Join Applications: Clustering
Clustering (e.g. DBSCAN)
Similarity self-join
Chr
istia
n B
öhm
6
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);
R S
Chr
istia
n B
öhm
7
Cost Modeling
Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum
Chr
istia
n B
öhm
8
Cost Modeling
Binomial formula:
Chr
istia
n B
öhm
9
Cost Modeling
Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum
Chr
istia
n B
öhm
10
Page Capacity Optimization
Cost model can determine index selectivity which depends on various parameters
Page capacity (number of stored points) is an important parameter
Known from similarity search: Page capacity optimization yields considerable improvement
Chr
istia
n B
öhm
11
Analysis of the Index Overhead
Assuming 100% selectivity (index doesnt work)How much more expensive is index usage ?
CPU:• Distance betw. boxes more
expensive to compute than distance betw. points:
• Smaller capacity more box distance computations
Chr
istia
n B
öhm
12
Analysis of the Index Overhead
Disk I/O:• High constant cost per page access (move disk head)• Page access is by factor 10000 / d more
expensive than continuous reading of a point• Smaller capacity more disk head movement
Chr
istia
n B
öhm
13
Analysis of the Index Overhead
What selectivity is needed that index pays off ?
Chr
istia
n B
öhm
14
Optimization
I/O cost function:
is optimized by
CPU cost function:
is optimized by:
Chr
istia
n B
öhm
15
Optimization
I/O cost:• Large capacity optimum (several 10,000 points, typically)
CPU cost:• Small capacity optimum (< 100 points, typically)
• No compromise achievable
Chr
istia
n B
öhm
16
Multipage Index (MuX)
CPU-performance like CPU optimized index I/O- performance like I/O optimized index
separateoptimization
Chr
istia
n B
öhm
17
Experimental Evaluation
Uniform 4D Uniform 8D
Chr
istia
n B
öhm
18
Experimental Evaluation
CAD Data 16D Color Images 64D
Chr
istia
n B
öhm
19
Conclusions
Summary• High potential for performance gains of the
similarity join by page capacity optimization• Necessary to separately optimize I/O and CPU
Future research potential• Similarity join for metric index structures• Approximate similarity join• Parallel similarity join algorithms
Chr
istia
n B
öhm
20
Consequences
Assume for I/O optimization selectivity Page accesses in a nested block loop like style:
if mindist(r,s) then join (r,s) ;foreach joining R-page r in cache do
load (s) ;if s joins some of the cached R-pg then
foreach S-page s dofill cache with pages of R (1 page free) ;
Chr
istia
n B
öhm
21
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);
R S