BMQ-Index: Shared and Incremental Processing of Border Monitoring Queries over Data Streams
Continuous Processing of Preference Queries in Data Streams : a Survey
description
Transcript of Continuous Processing of Preference Queries in Data Streams : a Survey
Continuous Processing of Preference Queries in
Data Streams : a Survey
M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos
Data Engineering LabDepartment of Informatics
Aristotle University of Thessaloniki
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Data Streams Data Stream is an infinite sequence
of objects. Each object can be one-dimensional
or multi-dimensional. Streaming Time Series are finite
sequences of objects. Streaming Time Series changes over
time. Arrival rate of objects usually varies.
t1 t2 t3 t4 t5 t6 t7 t8
Time
W=5
expiredactive
Count-based window: Sliding window contains the W most recent tuples (“active”).
Older tuples expire.
Sliding Window Model (1)
Sliding Window Model (2)
t1 t2 t3 t4 t5
t6
t7
Time
W=5
expiredactive
Time-based window: Sliding window contains the tuples (“active”) of the W most recent timestamps.
Older records expire.
t8
User / Application
Input
Query ResultResultQuery
Database System
Continuous Evaluation in a Data Stream System
User / Application
Query
Query processor
Result
Motivation (1) Numerous data stream contexts
Financial data analysis Network management Astronomical data analysis Sensor network Telecommunication data
management
Motivation (2) Preference queries
Useful decision support tool Many applications in data streams
Example 1 (telecommunication data)Report the clients with the maximum call time and the maximum number of calls.
Continuous skyline query
Example 2 (stock-market data)Report the products with the maximum price, the minimum sales and the minimum number of buyers.
Continuous top-k dominating query
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Conclusions
Skyline Query
distance
price
T4
Hotelsprice
distance
T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5
T3
T2
T6
T5
T1
Dominant tuple: A tuple t dominates another tuple t’ if • t is not worse than t’ in all dimensions, and • t is better than t’ in at least one dimension.
Skyline: contains all the tuples not dominated by any other tuple.
Continuous Skyline Query Problem definition: We have to
continuously evaluate a skyline query in multidimensional streaming time series.
Application example: network data Computers with suspicious behavior. Network traffic, number of connections,
number of destinations.
Basic Idea Skyline changes due
The insertion of a new skyline tuple. The expiration of a skyline tuple.
LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] Use of a spatial index Advantage: simple implementation Disadvantage: the expiration of a
skyline tuple is not handled efficiently
Event Approach (1) Existing skyline tuple expires:
How can we find new skyline tuples? Very costly operation
Skyline influence time (SIT) Minimum time in which a tuple may
become a skyline tuple. Generate events based on SIT
Event Approach (2)W=10
K.SIT=19Tuple K can be discardeddue to tuple L (younger and better)
H(8)A(1)
D(4)
C(3)
I(9)B(2)
F(6)
E(5)
K(11)G(7)
J(10)
L(12)
Eager [Tao, TKDE06] Advantage: handles
skyline expiration Disadvantage: pro-
cessing time per tuple
n-of-N Skyline Queries (1)
S6 = {a,c} S4 = {c,g}
source: icde05
n-of-N definition
n-of-N Skyline Queries (2)
S6 = {c,h} S4 = {e,h}
source: icde05
n-of-N definition
Method cnN(1)
Tuple K is redundant because tuple L is better and younger than K
The dominance relation between L and E is critical because E is the youngest tuple which dominates L
Tuple L is dominated by D and E.
W=10H(8)
A(1)
D(4)
C(3)
I(9)B(2)
F(6)
E(5)
K(11)G(7)
J(10)
L(12)
Method cnN [Lin, ICDE05] is also based on events
Method cnN (2)
Generate intervals For the skyline tuples, e.g. C = (0,3] For the critical dominance relations, C -> G =
(3,7] Use an interval-tree to store them
Dominance graph contains all the critical dominance relations
A(1)B(2)
D(4)
F(6)E(5)
C(3)
G(7)
Redundant tuples
Critical dominance relation
Method cnN (3) A tuple t is in the answer of an n-of-N skyline
query iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far.A(1)
B(2)
D(4)
F(6)E(5)
C(3)
G(7)
To answer a n-of-N query, apply a (M–n+1) stabbing
query
C = (0,3]
D -> F = (4,6] D -> E = (4,5]C -> G = (3,7]
D = (0,4]For n = 6,
M–n+1 = 2
S6 = {C, D}
For n = 4,M–n+1 = 4
S4 = {D, G}
stabbing queryM = 7
Method cnN (3) Advantages
Good use of skyline properties Multiple query processing
Disadvantages Processing time per tuple Increased memory requirements
Frequent Skyline - Motivation
Highly dynamic environment The skyline results are meaningful
only if the skyline tuples appear consistently
Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]
Streaming Model Client/Server architecture Server receives object updates from the
clients. Each object can be represented as a
d-dimensional point. Object update (point movement in the
d-dimensional space). at least a value in one dimension changes
Object insertion or deletion Point movement from/to a nonexistent position
Minimization of communication cost
Filter Safe region technique
Skyline remains unchanged if each object stays in a safe region
Communication happens only when the safe region is violated
Safe region approach leads to communication optimization
An object as a point and its filter (safe
region)
source: sigmod09
Sampling All clients report their skyline at
the same sampled time The clients are synchronized with
the same random seed Guaranteed quality if sampling
rate is high enough
Hybrid Hybrid solution
Combines Filter and Sampling Small changes: apply Filter Larger changes: apply Sampling
Disadvantage of all three methods energy consumption is not uniform
(critical in sensor networks)
k-dominant Skyline Query - Μotivation
Skyline: contains tuples not dominated by any other tuple.Disadvantage: High dimensionality problem.Solution: Relax the notion of dominance.k-dominant tuple: A tuple t k-dominates another tuple t’ if • t is not worse than t’ in at least k dimensions and • t is better than t’ in at least one of them. k-dominant skyline: contains all tuples not k-dominated by any other tuple [Kontaki, SAC08]
k-dominant Skyline Query - Εxample
D1 D2 D3 D4 D5 D6
T1 6 5 4 3 2 1T2 5 4 3 5 4 3T3 6 6 2 2 6 5T4 6 6 6 1 6 6T5 6 6 6 5 5 5
Conventional skyline {T1, T2, T3, T4}5-dominant skyline {T1, T2,
T3}4-dominant skyline {T1,
T2}Smaller k, less tuples in k-dominant skyline
T1 dominates T5
T1 5-dominates T4
T1 4-dominates T3
Observations Traditional or streaming skyline
methods are inappropriate Skyline properties do not hold
E.g. transitive property k-dominance can be cyclic
Existence of multiple users and multiple queries.
Method CoSMuQ (1) A query on D dimensions arrives. Given a parameter value k, split the query
to subqueries of d=k dimensions. Compute the conventional skyline of each
subquery. The k-dominant skyline is the intersection
of the skylines of the subqueries of a query.
Method CoSMuQ (2) Advantages
Based on conventional skyline (simple domination checks)
Properties of conventional skylines can be used Exploits the overlap between different queries.
Disadvantages Memory requirements increase in high
dimensionality.
Continuous Skyline methods - SummaryMethod Query
TypeWindow
TypeMultiple Queries
LookOut skyline time noLazy and
Eagerskyline both no
n-of-N skyline count yesFilter and Sampling
frequent skyline
time no
CoSMuQ k-dominant skyline
both yes
Presentation Layout
Data streams - Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Top-k query - Εxample
distance
price
T4
Hotelsprice
distance
T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5
T3
T2T6
T5
T1
Given a preference function, a top-k query returns the k tuples with the best scores.
k=1k=2F=price+distance
Continuous Top-k Query Problem definition: Continuous evaluation
of top-k query in multidimensional streaming time series.
Application Example: network data top-100 flows with the largest individual
throughput Common destination DDoS attack
Basic Idea
Influence region
tk
x2
x1
New tuple changes the top-k Should belong in the influence
region of the query Top-k tuple expiration
From scratch query computation TMA (Top-k Monitoring
Algorithm) [Mouratidis, SIGMOD06] Advantage: simple
implementation Disadvantage: no efficient
handling of an expired top-k tuple
source: sigmod06
Line defined by theF = score(tk) =
x1 + x2
Skyband - Example
k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.
E
DB
C
A1-skyband (tuples not dominated by other tuples)
1-skyband is the skyline2-skyband (tuples dominated by at most 1 other tuples)
Dominated by 2 other tuples
(3-skyband)
Skyband Approach (1)
Dominance counter (DC): number of tuples that are younger and better
Rule: Keep tuples with DC < k
Observation: tuples appearing in some top-k result belong to the k-skyband in the (score,exp_time) space.
Transform tuples in the (score,expiration_time) space
T4
T3
T2
T6
T5
T1
distance
pricescore
T1 5T2 5T3 3.5T4 7T5 5.5T6 8.5
original space transformed space
T4
T3
T2
T6
T5
T1
exp_time
scoreF=price+distance
DC=0DC=1
DC=1
DC=0
DC=1
DC=0
top-1
Skyband Approach (2) SMA (Skyband Monitoring Algorithm)
proposed in [Mouratidis, SIGMOD06] Advantage: independent of the
dimensionality 2-dimensional space (score-exp_time)
Disadvantage: k-skyband may contain less than k tuples In this case, a top-k tuple expiration will cause
query computation from scratch
Distributed Top-k Continuously report the k largest
values obtained from distributed data streams.
Objective is to minimize communication cost
Proposed by [Babcock, SIGMOD03]
Streaming Model Nodes: N1, N2 , … , Nm, coordinator node: N0 Set of n data objects O1, O2 , … , On associated
with real values V1, V2 , … , Vn Value updates are represented as <Oi, Nj, >
tuples: Nj detects a change in the value Vi of Oi. Change is not seen by other nodes Nk
(kj) The value Vi for an object Oi:
Vi= j (Vi,j) where Vi,j is the value of i-th object in the j-th node
Method (1) Initialize a top-k set at the coordinator
node Set arithmetic constraints at monitor
nodes Depend on current top-k set
Constraints valid No communications Constraints invalidated
Client communicates with server Possibly new top-k set Recomputation of constraints
Method(2) - Adjustment Factors
V1,1 = 1 V2,1 = 9V1,2 = 3 V2,2 = 1
= 0 = -3 = 0 = 32,2
1,21,1Node 1
Node 2
Object 1 Object 2
Top-1 = {O1}Node 1, Local Top-1 = {O1}Node 2, Local Top-1 = {O2}
Local top-ks differ from global top-k=>Unnecessary constraint violations
=> Increased communication cost
2,1
Object 1 Object 2Adjustment Factors (AF)
Node 2: V1,2 = 3+0 = 3Node 2: V2,1 = 1+3 = 4
Local top-k similar to global=>Low communication costTo keep the results valid
AF for each object sum to zeroDisadvantage: Energy consumption is not
uniform
Uncertain DataScore Prob
.6 0.85 0.52 0.48 0.4
Tuples Pr. Tuples
Pr. Tuples
Pr.
2, 5, 6, 8
.064
2, 5, 6 .096
2, 6, 8 .064
2, 6 .096
2, 5, 8 .016
2, 5 .024
2, 8 .016
2 .024
5, 6, 8 .096
5, 6 .144
5, 8 .024
5 .036
6, 8 .096
6 .144
8 .024
Empty .036
tuples 16 possible worlds
Pk-topk query: returns the k most probable tuples of being the top-k.Top-2: {6,5} with prob. {0.64, 0.5}
Compute probability of 6
Sum the world probabilities
source: pvldb08
Pk-topk Query Solution proposed by [Jin, PVLDB08]
Compact set based Space-efficient solution
Discard unnecessary tuples and Apply several compression schemes to
compress data Disadvantages
Model assumption: the probability of a tuple is assumed random and independent of each other.
Continuous Top-k Methods -Summary
Method Query Type
Window Type
Multiple Queries
TMA and SMA top-k both yesDistributed top-k Distributed
top-ktime no
Compact set based
Pk-topk both no
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Top-k Dominating Query - Example
distance
price
T4
Hotelsprice
distance
T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5
T3
T2T6
T5
T1Skyline: contains all the tuples not dominated by any other tuple.
Disadvantage: High dimensionality problem.
Top-k: Given a preference function, a top-k query returns the k tuples with the best scores.
Disadvantage: user-defined preference function.
Top-k dominating: the answer contains the k tuples with highest domination power.
Combines the advantages of skyline and top-k queries and avoids their disdvantages.
k=1k=2F=price+distance
Continuous Top-k Dominating Query
Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series.
Application Example: sensor network Areas with high probability of fire outbreak Temperature, humidity and wind speed
EVA Objective: reduce domination checks Safe interval of a tuple
Ignore tuple for this interval It depends on its score and the k-th score
End of safe interval -> event Event
Try to compute new safe interval, else Compute score from scratch
New tuple Find another tuple that dominates the new one Estimate a lower bound of the safe interval
ADA Advanced computation of safe
interval Depends on the number of tuples that
dominate this tuple and expire later Candidate tuples
Tuples with scores close to k-th score are updated in each time instance
EVA and ADA proposed by [Kontaki 2009]
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Summary Preference queries are very useful
in data streams Presented state-of-the-art methods
For continuous skyline queries For continuous top-k queries For continuous top-k dominating
queries Examined advantages and
disadvantages of the proposed methods
Research Directions Continuous subspace skyline
queries Solutions appropriate for
distributed environments uniform energy consumption
Approximate algorithms Existence of multiple queries
Thank you