TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang,...
-
Upload
helena-logan -
Category
Documents
-
view
212 -
download
0
Transcript of TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang,...
TIME 2002, Manchester, UK
Index Based Processing of Semi-Restrictive Temporal Joins
Donghui Zhang, Vassilis J. Tsotras University of California, Riverside
TIME 2002, Manchester, UK
Contents
Background Join problem definition Straightforward approaches Proposed join algorithms Performance study Conclusions
TIME 2002, Manchester, UK
Background Temporal record: (key, time interval) and some
attributes. TE-Join: two records qualify for join if
their time intervals intersect; and their keys are equal.
TIME 2002, Manchester, UK
Background Our earlier work [ICDE02] solved a general
TE-Join (GTE-Join), where portions from each relation are joined: the portion is selected via a range-interval
selection: record keys should be in range r and time intervals should intersect interval i.
interesting because (1) temporal relations are large; (2) TE-Join is a special case, when r and i are (-, +).
TIME 2002, Manchester, UK
Problem Definition
Semi-restrictive joins: records join if their keys are equal (GE-Join), or their intervals intersect (GT-join), but not both.
GE-Join: select a subset from X, a subset from Y, and join records from the subsets if their keys are equal.
GT-Join: select a subset from X, a subset from Y, and join records from the subsets if their intervals intersect.
TIME 2002, Manchester, UK
Problem Definition
GT-Join example: find employees whose last names start with ‘B’ and who co-worked during 1995 with the employees whose last names start with ‘S’.
GE-Join example: find the 1998 IBM employees who were UC Riverside students in 1995.
TIME 2002, Manchester, UK
GT-Join Solutions...
TIME 2002, Manchester, UK
Straightforward Solutions for GT-Join
1. Unsynchronized join.
2. Synchronized join using B+-trees.
3. Synchronized join using R-trees.
TIME 2002, Manchester, UK
1. Unsynchronized join: separate the selection and join phases; not efficient because: storing the intermediate result can be
large; selection in one relation ignores data
distribution of the other relation.
Straightforward Solutions for GT-Join
TIME 2002, Manchester, UK
2. Synchronized using B+-trees.
Not efficient: y needs to be checked against every record whose start is before end of y.
tmin tmax
y
If cluster on start:
Cluster on end is similar.
Straightforward Solutions for GT-Join
TIME 2002, Manchester, UK
Store each record as a two-dimensional interval in the R-tree;
Use existing R-tree join algorithms [BKS93, HJR97];
Modifications: (1) integrate the selection condition; (2) join index records as long as they intersect in time dimension and ignore key dimension.
However, not efficient since R-trees do not handle long intervals well.
3. Synchronized using R-trees.
Straightforward Solutions for GT-Join
TIME 2002, Manchester, UK
Our Solutions
Synchronized join using temporal indices. Multi-version B+-tree (MVBT) [BGO+96]:
asymptotically optimal space, update, query. We propose three synchronized, MVBT-
based join algorithms.
(apply to other temporal indices as well)
TIME 2002, Manchester, UK
Review of MVBT
A “forest” of trees: different trees may overlap.
Root nodes correspond to contiguous, non-intersecting time intervals.
A record may be stored in multiple pages. Efficient range-interval selection algorithms.
TIME 2002, Manchester, UK
Top-down GT-Join
Idea: for each pair of trees, one from each MVBT forest, synchronized tree traversal (STT).
STT for two trees:
Note that special care is needed to avoid duplicates, since a record has multiple copies.
initially, join root nodes; to join two nodes, join their children; eventually, join elements in leaf pages.
TIME 2002, Manchester, UK
Link-based GT-Join
A
B
C
In each leaf page, store a pointer to its predecessor.
D find pairs of data pages that (1) intersect with the
right border of the query rectangle; and (2) intersect with each other in time dimension;
keep such pairs in priority queue; sweep left synchronously.
For GT-Join:
TIME 2002, Manchester, UK
Plane Sweep GT-Join
Similar to link-based. Maintain two priority queues, one for each
MVBT. At each step, access the leaf page with the
largest end time and add records to buffer. To add records to buffer, join with
existing records from the other MVBT. Throw away useless records.
TIME 2002, Manchester, UK
GE-Join Solutions...
TIME 2002, Manchester, UK
GE-Join Solutions...
Similarly, we have: unsynchronized synchronized using B+-trees synchronized using R-trees top-down using MVBT link-based using MVBTNote: some of them, especially the link-based algorithm, are quite different due to different join condition.
TIME 2002, Manchester, UK
Implemented Algorithms
Notation: Meaning:
mvbt_df Synchronized MVBT, depth-first
mvbt_bf Synchronized MVBT, breadth-first
mvbt_link Synchronized MVBT, link-based
r*_df Synchronized R*-tree, depth-first
r*_bf Synchronized R*-tree, breadth-first
Common to both GT-Join and GE-Join:
TIME 2002, Manchester, UK
Implemented Algorithms
mvbt_ps Synchronized MVBT, plane-sweep
spj spatially partitioned join [LOT94]
b+ Synchronized B+-tree, index on keymvbt_sm Unsynchronized, sort-merge after selection
Specific to GE-Join:
Specific to GT-Join:
TIME 2002, Manchester, UK
Experimental Setup
• Implemented in GNU C++.• Sun Enterprise 250 Server machine with two
UltraSPARC-II processors using Solaris 2.8.• Page size = 8KB.• Buffer size = 10MB; LRU buffer.• Each data set: 10 million records.• R/I ratio: length of query key range divided
by length of query time interval. It describes the shape of query rectangle.
TIME 2002, Manchester, UK
GT-Join Performance
R/I ratio = 10.
0
1000
2000
3000
4000
5000
6000
7000
mvbt_df
mvbt_bf
mvbt_link
mvbt_ps
r*_df
r*_bf sp
j
IO
CPU
TIME 2002, Manchester, UK
GT-Join Performance
R/I ratio = 0.1.
0
250
500
750
1000
1250
1500
mvbt_df
mvbt_bf
mvbt_link
mvbt_ps
r*_df
r*_bf sp
j
IO
CPU
TIME 2002, Manchester, UK
GE-Join Performance
R/I ratio = 10.
0100200300400500600700800900
mvbt_df
mvbt_bf
mvbt_link
mvbt_sm b+ r*_
dfr*_bf
IO
CPU
TIME 2002, Manchester, UK
GE-Join Performance
R/I ratio = 0.1.
02505007501000125015001750200022502500
mvbt_df
mvbt_bf
mvbt_link
mvbt_sm b+ r*_
dfr*_bf
IO
CPU
TIME 2002, Manchester, UK
Conclusions We addressed index-based GT-Join and GE-Join. Joins using traditional indices (B+-tree, R-tree)
are not efficient. We proposed various synchronized approaches
based on temporal indices (MVBT). Experiments:
– for GT-Join, link-based and plane-sweep are the best;– for GE-Join, link-based and sort-merge are the best;– overall, link-based is the best: multi-fold
improvement over B+-tree/R-tree joins.
TIME 2002, Manchester, UK