Continuous Processing of Preference Queries in Data Streams : a Survey

56
Preference Queries in Data Streams : a Survey M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki

description

Continuous Processing of Preference Queries in Data Streams : a Survey. M. Kontaki , A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki. Presentation Layout. Preliminaries Continuous skyline queries - PowerPoint PPT Presentation

Transcript of Continuous Processing of Preference Queries in Data Streams : a Survey

Page 1: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Continuous Processing of Preference Queries in

Data Streams : a Survey

M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos

Data Engineering LabDepartment of Informatics

Aristotle University of Thessaloniki

Page 2: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Page 3: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Page 4: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Data Streams Data Stream is an infinite sequence

of objects. Each object can be one-dimensional

or multi-dimensional. Streaming Time Series are finite

sequences of objects. Streaming Time Series changes over

time. Arrival rate of objects usually varies.

Page 5: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

t1 t2 t3 t4 t5 t6 t7 t8

Time

W=5

expiredactive

Count-based window: Sliding window contains the W most recent tuples (“active”).

Older tuples expire.

Sliding Window Model (1)

Page 6: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Sliding Window Model (2)

t1 t2 t3 t4 t5

t6

t7

Time

W=5

expiredactive

Time-based window: Sliding window contains the tuples (“active”) of the W most recent timestamps.

Older records expire.

t8

Page 7: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

User / Application

Input

Query ResultResultQuery

Database System

Page 8: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Continuous Evaluation in a Data Stream System

User / Application

Query

Query processor

Result

Page 9: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Motivation (1) Numerous data stream contexts

Financial data analysis Network management Astronomical data analysis Sensor network Telecommunication data

management

Page 10: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Motivation (2) Preference queries

Useful decision support tool Many applications in data streams

Example 1 (telecommunication data)Report the clients with the maximum call time and the maximum number of calls.

Continuous skyline query

Example 2 (stock-market data)Report the products with the maximum price, the minimum sales and the minimum number of buyers.

Continuous top-k dominating query

Page 11: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Conclusions

Page 12: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Skyline Query

distance

price

T4

Hotelsprice

distance

T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5

T3

T2

T6

T5

T1

Dominant tuple: A tuple t dominates another tuple t’ if • t is not worse than t’ in all dimensions, and • t is better than t’ in at least one dimension.

Skyline: contains all the tuples not dominated by any other tuple.

Page 13: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Continuous Skyline Query Problem definition: We have to

continuously evaluate a skyline query in multidimensional streaming time series.

Application example: network data Computers with suspicious behavior. Network traffic, number of connections,

number of destinations.

Page 14: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Basic Idea Skyline changes due

The insertion of a new skyline tuple. The expiration of a skyline tuple.

LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] Use of a spatial index Advantage: simple implementation Disadvantage: the expiration of a

skyline tuple is not handled efficiently

Page 15: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Event Approach (1) Existing skyline tuple expires:

How can we find new skyline tuples? Very costly operation

Skyline influence time (SIT) Minimum time in which a tuple may

become a skyline tuple. Generate events based on SIT

Page 16: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Event Approach (2)W=10

K.SIT=19Tuple K can be discardeddue to tuple L (younger and better)

H(8)A(1)

D(4)

C(3)

I(9)B(2)

F(6)

E(5)

K(11)G(7)

J(10)

L(12)

Eager [Tao, TKDE06] Advantage: handles

skyline expiration Disadvantage: pro-

cessing time per tuple

Page 17: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

n-of-N Skyline Queries (1)

S6 = {a,c} S4 = {c,g}

source: icde05

n-of-N definition

Page 18: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

n-of-N Skyline Queries (2)

S6 = {c,h} S4 = {e,h}

source: icde05

n-of-N definition

Page 19: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Method cnN(1)

Tuple K is redundant because tuple L is better and younger than K

The dominance relation between L and E is critical because E is the youngest tuple which dominates L

Tuple L is dominated by D and E.

W=10H(8)

A(1)

D(4)

C(3)

I(9)B(2)

F(6)

E(5)

K(11)G(7)

J(10)

L(12)

Method cnN [Lin, ICDE05] is also based on events

Page 20: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Method cnN (2)

Generate intervals For the skyline tuples, e.g. C = (0,3] For the critical dominance relations, C -> G =

(3,7] Use an interval-tree to store them

Dominance graph contains all the critical dominance relations

A(1)B(2)

D(4)

F(6)E(5)

C(3)

G(7)

Redundant tuples

Critical dominance relation

Page 21: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Method cnN (3) A tuple t is in the answer of an n-of-N skyline

query iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far.A(1)

B(2)

D(4)

F(6)E(5)

C(3)

G(7)

To answer a n-of-N query, apply a (M–n+1) stabbing

query

C = (0,3]

D -> F = (4,6] D -> E = (4,5]C -> G = (3,7]

D = (0,4]For n = 6,

M–n+1 = 2

S6 = {C, D}

For n = 4,M–n+1 = 4

S4 = {D, G}

stabbing queryM = 7

Page 22: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Method cnN (3) Advantages

Good use of skyline properties Multiple query processing

Disadvantages Processing time per tuple Increased memory requirements

Page 23: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Frequent Skyline - Motivation

Highly dynamic environment The skyline results are meaningful

only if the skyline tuples appear consistently

Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]

Page 24: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Streaming Model Client/Server architecture Server receives object updates from the

clients. Each object can be represented as a

d-dimensional point. Object update (point movement in the

d-dimensional space). at least a value in one dimension changes

Object insertion or deletion Point movement from/to a nonexistent position

Minimization of communication cost

Page 25: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Filter Safe region technique

Skyline remains unchanged if each object stays in a safe region

Communication happens only when the safe region is violated

Safe region approach leads to communication optimization

An object as a point and its filter (safe

region)

source: sigmod09

Page 26: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Sampling All clients report their skyline at

the same sampled time The clients are synchronized with

the same random seed Guaranteed quality if sampling

rate is high enough

Page 27: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Hybrid Hybrid solution

Combines Filter and Sampling Small changes: apply Filter Larger changes: apply Sampling

Disadvantage of all three methods energy consumption is not uniform

(critical in sensor networks)

Page 28: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

k-dominant Skyline Query - Μotivation

Skyline: contains tuples not dominated by any other tuple.Disadvantage: High dimensionality problem.Solution: Relax the notion of dominance.k-dominant tuple: A tuple t k-dominates another tuple t’ if • t is not worse than t’ in at least k dimensions and • t is better than t’ in at least one of them. k-dominant skyline: contains all tuples not k-dominated by any other tuple [Kontaki, SAC08]

Page 29: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

k-dominant Skyline Query - Εxample

D1 D2 D3 D4 D5 D6

T1 6 5 4 3 2 1T2 5 4 3 5 4 3T3 6 6 2 2 6 5T4 6 6 6 1 6 6T5 6 6 6 5 5 5

Conventional skyline {T1, T2, T3, T4}5-dominant skyline {T1, T2,

T3}4-dominant skyline {T1,

T2}Smaller k, less tuples in k-dominant skyline

T1 dominates T5

T1 5-dominates T4

T1 4-dominates T3

Page 30: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Observations Traditional or streaming skyline

methods are inappropriate Skyline properties do not hold

E.g. transitive property k-dominance can be cyclic

Existence of multiple users and multiple queries.

Page 31: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Method CoSMuQ (1) A query on D dimensions arrives. Given a parameter value k, split the query

to subqueries of d=k dimensions. Compute the conventional skyline of each

subquery. The k-dominant skyline is the intersection

of the skylines of the subqueries of a query.

Page 32: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Method CoSMuQ (2) Advantages

Based on conventional skyline (simple domination checks)

Properties of conventional skylines can be used Exploits the overlap between different queries.

Disadvantages Memory requirements increase in high

dimensionality.

Page 33: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Continuous Skyline methods - SummaryMethod Query

TypeWindow

TypeMultiple Queries

LookOut skyline time noLazy and

Eagerskyline both no

n-of-N skyline count yesFilter and Sampling

frequent skyline

time no

CoSMuQ k-dominant skyline

both yes

Page 34: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Presentation Layout

Data streams - Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Page 35: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Top-k query - Εxample

distance

price

T4

Hotelsprice

distance

T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5

T3

T2T6

T5

T1

Given a preference function, a top-k query returns the k tuples with the best scores.

k=1k=2F=price+distance

Page 36: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Continuous Top-k Query Problem definition: Continuous evaluation

of top-k query in multidimensional streaming time series.

Application Example: network data top-100 flows with the largest individual

throughput Common destination DDoS attack

Page 37: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Basic Idea

Influence region

tk

x2

x1

New tuple changes the top-k Should belong in the influence

region of the query Top-k tuple expiration

From scratch query computation TMA (Top-k Monitoring

Algorithm) [Mouratidis, SIGMOD06] Advantage: simple

implementation Disadvantage: no efficient

handling of an expired top-k tuple

source: sigmod06

Line defined by theF = score(tk) =

x1 + x2

Page 38: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Skyband - Example

k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.

E

DB

C

A1-skyband (tuples not dominated by other tuples)

1-skyband is the skyline2-skyband (tuples dominated by at most 1 other tuples)

Dominated by 2 other tuples

(3-skyband)

Page 39: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Skyband Approach (1)

Dominance counter (DC): number of tuples that are younger and better

Rule: Keep tuples with DC < k

Observation: tuples appearing in some top-k result belong to the k-skyband in the (score,exp_time) space.

Transform tuples in the (score,expiration_time) space

T4

T3

T2

T6

T5

T1

distance

pricescore

T1 5T2 5T3 3.5T4 7T5 5.5T6 8.5

original space transformed space

T4

T3

T2

T6

T5

T1

exp_time

scoreF=price+distance

DC=0DC=1

DC=1

DC=0

DC=1

DC=0

top-1

Page 40: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Skyband Approach (2) SMA (Skyband Monitoring Algorithm)

proposed in [Mouratidis, SIGMOD06] Advantage: independent of the

dimensionality 2-dimensional space (score-exp_time)

Disadvantage: k-skyband may contain less than k tuples In this case, a top-k tuple expiration will cause

query computation from scratch

Page 41: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Distributed Top-k Continuously report the k largest

values obtained from distributed data streams.

Objective is to minimize communication cost

Proposed by [Babcock, SIGMOD03]

Page 42: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Streaming Model Nodes: N1, N2 , … , Nm, coordinator node: N0 Set of n data objects O1, O2 , … , On associated

with real values V1, V2 , … , Vn Value updates are represented as <Oi, Nj, >

tuples: Nj detects a change in the value Vi of Oi. Change is not seen by other nodes Nk

(kj) The value Vi for an object Oi:

Vi= j (Vi,j) where Vi,j is the value of i-th object in the j-th node

Page 43: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Method (1) Initialize a top-k set at the coordinator

node Set arithmetic constraints at monitor

nodes Depend on current top-k set

Constraints valid No communications Constraints invalidated

Client communicates with server Possibly new top-k set Recomputation of constraints

Page 44: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Method(2) - Adjustment Factors

V1,1 = 1 V2,1 = 9V1,2 = 3 V2,2 = 1

= 0 = -3 = 0 = 32,2

1,21,1Node 1

Node 2

Object 1 Object 2

Top-1 = {O1}Node 1, Local Top-1 = {O1}Node 2, Local Top-1 = {O2}

Local top-ks differ from global top-k=>Unnecessary constraint violations

=> Increased communication cost

2,1

Object 1 Object 2Adjustment Factors (AF)

Node 2: V1,2 = 3+0 = 3Node 2: V2,1 = 1+3 = 4

Local top-k similar to global=>Low communication costTo keep the results valid

AF for each object sum to zeroDisadvantage: Energy consumption is not

uniform

Page 45: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Uncertain DataScore Prob

.6 0.85 0.52 0.48 0.4

Tuples Pr. Tuples

Pr. Tuples

Pr.

2, 5, 6, 8

.064

2, 5, 6 .096

2, 6, 8 .064

2, 6 .096

2, 5, 8 .016

2, 5 .024

2, 8 .016

2 .024

5, 6, 8 .096

5, 6 .144

5, 8 .024

5 .036

6, 8 .096

6 .144

8 .024

Empty .036

tuples 16 possible worlds

Pk-topk query: returns the k most probable tuples of being the top-k.Top-2: {6,5} with prob. {0.64, 0.5}

Compute probability of 6

Sum the world probabilities

source: pvldb08

Page 46: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Pk-topk Query Solution proposed by [Jin, PVLDB08]

Compact set based Space-efficient solution

Discard unnecessary tuples and Apply several compression schemes to

compress data Disadvantages

Model assumption: the probability of a tuple is assumed random and independent of each other.

Page 47: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Continuous Top-k Methods -Summary

Method Query Type

Window Type

Multiple Queries

TMA and SMA top-k both yesDistributed top-k Distributed

top-ktime no

Compact set based

Pk-topk both no

Page 48: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Page 49: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Top-k Dominating Query - Example

distance

price

T4

Hotelsprice

distance

T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5

T3

T2T6

T5

T1Skyline: contains all the tuples not dominated by any other tuple.

Disadvantage: High dimensionality problem.

Top-k: Given a preference function, a top-k query returns the k tuples with the best scores.

Disadvantage: user-defined preference function.

Top-k dominating: the answer contains the k tuples with highest domination power.

Combines the advantages of skyline and top-k queries and avoids their disdvantages.

k=1k=2F=price+distance

Page 50: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Continuous Top-k Dominating Query

Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series.

Application Example: sensor network Areas with high probability of fire outbreak Temperature, humidity and wind speed

Page 51: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

EVA Objective: reduce domination checks Safe interval of a tuple

Ignore tuple for this interval It depends on its score and the k-th score

End of safe interval -> event Event

Try to compute new safe interval, else Compute score from scratch

New tuple Find another tuple that dominates the new one Estimate a lower bound of the safe interval

Page 52: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

ADA Advanced computation of safe

interval Depends on the number of tuples that

dominate this tuple and expire later Candidate tuples

Tuples with scores close to k-th score are updated in each time instance

EVA and ADA proposed by [Kontaki 2009]

Page 53: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Page 54: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Summary Preference queries are very useful

in data streams Presented state-of-the-art methods

For continuous skyline queries For continuous top-k queries For continuous top-k dominating

queries Examined advantages and

disadvantages of the proposed methods

Page 55: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Research Directions Continuous subspace skyline

queries Solutions appropriate for

distributed environments uniform energy consumption

Approximate algorithms Existence of multiple queries

Page 56: Continuous Processing  of  Preference Queries  in  Data Streams : a Survey

Thank you