Stabbing the Sky: Efficient Skyline Computation over Sliding Windows
COMP9314 Lecture Notes
COMP9314 Xuemin [email protected] 2
Outline
• Introduction
• n-of-N Queries
• (n1, n2)-of-N Queries
• Performance Evaluation
• Conclusions
COMP9314 Xuemin [email protected] 3
Skyline
Skyline Query:• Input: a set of points in d-
dimensional space.• Output: points not dominated
by another point.
• (x1, x2, …, xd) dominates (y1, y2, …, yd) iff xi<=yi (1<=i<=d) & ∃k, xk<yk.
COMP9314 Xuemin [email protected] 4
Applications
Multi-criteria decision making…
Stock Trading Example:
• What are the top deals?
COMP9314 Xuemin [email protected] 5
Skyline Query Over Sliding WindowStock Trading Example• Top deals of a stock in the last 5 mins? last 4 mins, …• Top deals of a stock in the last 10K deals? …
Queries:• n-of-N model (n <= N): the most recent n elements
• (n1, n2)-of-N model
• One-time queries• Continuous queries
COMP9314 Xuemin [email protected] 6
Challenges
Insertions & deletions (possibly high speed).
On-line information– memory requirement– processing speed
Existing techniques do not support n-of-N:[Borzsonyi et al (ICDE01), Tan et al (VLDB01), Kossman et al (VLDB 02), Papadias et al (SIGMOD03), Kapoor (SIAM J. comp00)]
– support the computation of whole dataset– O (n logd-2 n) for d >= 4 & O (n log n ) otherwise
COMP9314 Xuemin [email protected] 7
Results
n-of-N:• keep N’ (N’ N) elements where N’ = O (logd N) if data
distribution on each dimension is independent.• a novel encoding scheme, with O (N’) space, leads to n-
of-N query time O ( log N’ + s ) instead of O (n logd-2 n).• a new trigger based technique for continuously
processing an n-of-N query. – trigger update time: O ( log s).– result update time: O (logδ) where δ is a result change.
(n1, n2)-of-N: similar results.
COMP9314 Xuemin [email protected] 8
n-of-N Queries
• e is redundant Point e in PN (the most recent N elements) iff– e expires w.r.t PN, or
– ∃e’ s.t. e’ e, and e’ is younger than e
N = 6
COMP9314 Xuemin [email protected] 9
Optimality
Theorem: Non-redundant Points (RN) vs. n-of-N Skyline Query Result (Qn,N)
– (PN – RN) does not appear in any Qn,N
– Qn,N must be a subset of RN
xRN n, xQn,N
– RN = O(logd-1N) for “independent” distributions
• Only need to keep RN – the minimum number of elements to be kept.
COMP9314 Xuemin [email protected] 10
Querying RN
• critical dominance: e e’ where e is the youngest.
• dominance graph GRN: RN and the critical dominance
relationships.
Y
X
M = 7, N = 7
1 2
3
4
6
5
7
Y
X
M = 7, N = 7
1 2
3
4
6
5
7
COMP9314 Xuemin [email protected] 11
Querying RN
e Qn,N iff
• e is a root in GRN or
• e’ e in GRN & e’ has expired t(e’) < M – n + 1 <= t(e)Y
X
M = 7, N = 7
1 2
3
4
6
5
7
n RN M-n+1 Qn,N
n=5 {3,4,5,6,7 } 3 {3,4}
n=4 {4,5,6,7 } 4 {4, 7}
n=3 {5,6,7} 5 {5,6,7}
COMP9314 Xuemin [email protected] 12
Querying RN: Optimal Algorithm
To answer an n-of-N Query, encode the GRN using intervals:
• Stab the intervals by (M-n+1).• For all returned intervals (x,y), return point whose timestamp is y
• Technique: Use an interval tree index to achieve optimal O(log|RN|+s) query timeY
X
M = 7, N = 7
1 2
3
4
6
5
7
Y
X
M = 7, N = 7
1 2
3
4
6
5
7
(3,7](4,6]
(4,5](0,3]
(0,4]
e.g., n=6
(0,3](0,4](3,7](4,5](4,6]
COMP9314 Xuemin [email protected] 13
Maintaining RN
new element enew arrives:
• If the oldest eold RN expires, remove eold and update RN and GRN (interval
tree).
• find D RN dominated by enew, update RN and GRN
– Depth-first search on a R-tree of RN
• find e c enew, update GRN
– Best-first search on the R-tree of RNY
X
M = 8, N = 5
3
4
6
5
7
8
Y
X
M = 8, N = 5
3
4
6
5
7
8
enew = 8eold = 3D = {6}e = 4
COMP9314 Xuemin [email protected] 14
Continuous n-of-N QueryTrigger-based algorithm: • Deletion: Qn,N – {eold}, and Qn,N – {D}
• Insertion: Qn,N {enew} if (e’ c enew and t(e’ ) >= M-n+1)
• Maintain a min-heap of Qn,N for efficiency
Y
X
M = 8, N = 5
3
4
6
5
7
8
enew = 8M = 8eold = 3D = {6}e’ = 4
M = 7n = 4, N=5Q4,5 = {4,7}
M = 8n = 4, N=5Q4,5 = {5,7,8}
Q5,5 = {3,4} Q5,5 = {4,7}
COMP9314 Xuemin [email protected] 15
(n1,n2)-of-N Query
More complicated than n-of-N Query• PN needs to be kept!
• (Old) critical dominance: t (ae) = max { t (e’): e’ e & t (e’) < t (e) }
• backward critical dominance: t (be) = min { t (e’): e’ e & t (e’) > t (e)}
• e Q(n1,n2),N iff ae < M-n2+1 e M-n1+1 < be
• CBC dominance graph: PN & the two kinds of dominance relationships
M = 7, N = 7Y
X1
2
3
4
5
6
7
(2,4)-of-7:{4, 6}
M = 7, N = 7n1 = 4, n2 = 6
Y
X1
2
3
4
5
6
7
a5 = 3, b5 = 6(M-n2+1, M-n1+1] = (4,6]
COMP9314 Xuemin [email protected] 16
Processing (n1,n2)-of-N QueryEncode the CBC dominance graph:
– e ((ae, e], be)
– build an interval tree on (ae, e] only
Stab using M-n2+1 against the interval tree and check e <= M-n1+1 < be) on-the-fly:
– O(logN+s*), sub-optimal
M = 7, N = 7Y
X1
2
3
4
5
6
7
(2,4)-of-7: ???(M-n2+1, M-n1+1] = (4,6]
(1,6](3,4](3,5]...
(2,4)-of-7: {4, 6} Candidates: {4, 5, 6}
COMP9314 Xuemin [email protected] 17
More on (n1,n2)-of-N QueryMaintenance: Similar to that of n-of-N query, but
– Always expires the oldest element in PN, and maintain the interval tree and the R-tree on RN.
– Implementation-wise: Use two interval trees to index RN and PN-RN, respectively.
Continuous queries– More complicated
• A new skyline point might not be a skyline in the previous result,
• nor critically dominated by a skyline point in the previous result
• nor a newly arrived point
– Basic idea• Maintain additional Candidate Solutions (minimization) & triggers
– Details in the full paper
COMP9314 Xuemin [email protected] 18
Experiment Setup
• Hardware– P4 2.8G CPU, 1G Memory
• Datasets– Correlated, independent, and anti-
correlated– d = 2 to 5, N = 106
• Algorithms– KLP, nN, mnN, cnN, n12N, mn12N
• Metrics– Processing time Streaming rate
COMP9314 Xuemin [email protected] 19
n-of-N Query
• Varying dimensionality– M up to 2M, N = 1M, n uniformly from [1K,
1M], #queries = 1000
COMP9314 Xuemin [email protected] 20
n-of-N Query (cont’d)
• Varying n• for correlated, independent, and anti-correlated datasets
COMP9314 Xuemin [email protected] 21
Maintenance Costs
2d and 5d datasets, measure average and max time, N = i * 105
COMP9314 Xuemin [email protected] 22
Scalability
M (total number) = 2M, N = 1M, #queries = 2M
anti-correlatedindependent
COMP9314 Xuemin [email protected] 23
Continuous n-of-N Queries
• 2d & 5d datasets • N = 10K and 1M • 10 queries with n = i*(N/10)• measures cnN avg, cnN max, nN avg, nN max
COMP9314 Xuemin [email protected] 24
(n1,n2)-of-N Queries
Varying dimensionality– M up to 2M, N = 1M, #queries = 1000
– restricting n2 – n1 >= 500
Scalability– M = 2M, N = 1M, #queries = 2M
COMP9314 Xuemin [email protected] 25
Maintenance
• 2d and 5d datasets• measure average and max time• N = i * 105
COMP9314 Xuemin [email protected] 26
Conclusions
• Efficient algorithms for various sliding windows skyline queries– Keep only minimum number of points– Encode and index those points– Maintain all the data structures
• The proposed solutions – have theoretical guarantee on the performance, and– have demonstrated efficiency and scalability in the experiments
• Future work– Improve the current solution for (n1,n2)-of-N queries
– Approximate skyline queries
COMP9314 Xuemin [email protected] 28
Reference• [ICDE01] S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline
operator. ICDE, 2001.
• [VLDB01] K. Tan, P. Eng, and B. Ooi. Efficient progressive skyline computation. VLDB, 2001.
• [VLDB 02] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. VLDB, 2002.
• [SIGMOD03] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal progressive alogrithm for skyline queries. SIGMOD, 2003.
• [SIAM J. comp00] S. Kapoor. Dynamic maintenance of maxima of 2-d point sets. SIAM J. Comput., 2000.
Top Related