A D - HOC TOP - K QUERY ANSWERING FOR DATA STREAMS GAUTAM DAS DIMITRIOS GUNOPULOS NICK KOUDAS NIKOS...

1

AD-HOC TOP-K QUERY ANSWERING FOR DATA STREAMS

GAUTAM DAS DIMITRIOS GUNOPULOS NICK KOUDAS NIKOS SARKAS

CSE 6339: DATA EXPLORATION

Anitha Josephine Royappan

2

OUTLINE

Introduction Data Stream DBMS VS DSMS

Top-k query answering in Primal Plane Primal Plane Arrangements

Top-k query answering in Dual Plane Dual Plane Operations

Tuple pruning Principles Implementation

Experimental Evaluation Conclusion

3

DATA STREAM

Traffic monitoring Applications : To monitor traffic slowdown or accidents using data sent by each car on the road every few seconds or minutes.

Sensor Monitoring Applications: Monitoring room for switching on AC, heater, light, fire alarm based on data received from sensors.

4

DATA STREAM

Needs to process data at a high input-rate. Data are processed continuously over a long period of time and data is obtained in the form of DATA STREAM.

Data Stream: Unbounded (or never-ending) sequence of DATA ITEMS that are usually ordered.

5

CHARACTERISTICS OF A DATA STREAM

Continuous arrival in multiple, rapid, high rates, possibly unpredictable and unbounded streams

Data items belonging to the same data stream are usually processed in the order they arrive.

A main memory buffer maintains incoming tuples.

Data streams are usually generated by external sources or other applications and are sent to a Data Stream management System (DSMS).

6

DATA STREAM

Large number of sensors are distributed in the physical world and generate streams of data that need to be combined, monitored, and analyzed.

DBMS’s are not designed for rapid and continuous loading of individual data items.

One time Queries: Evaluated once over a point-in-time snapshot of the data set, with the answer returned to the user.

7

DATA STREAM

Continuous Queries: Evaluated continuously as data streams continue

to arrive. Predefined Queries:

Queries are fixed and Data keeps changing. Ad-hoc Queries:

Both Queries and Data keeps changing Continuous queries are not supported. DSMS: Data Stream Management System

8

DATA STREAM PROCESSING USING DBMS (RFID)

Increased response time

9

DATA STREAM PROCESSING USING DSMS IN RFID

Date rate is high, Data is stored in buffer.

Incoming streams are processed directly by a DSMS.

Decrease in response time, data may also be persisted for archival

10

DBMS VS DSMS

DBMS Persistent One-time queries Random access

DSMS Transient Continuous

queries Sequential access

11

BUFFER MANAGEMENT FOR INCOMING TUPLES

Memory space is limited, so aged tuples are removed in order to free space for fresh incoming tuples.

Sliding window of a specific size W. Time based (timestamp) Tuple count-based (recent N records)

Example: K highest ranking tuples according to temperature, humidity and both.

Rank tuples in a buffer according to ad-hoc preferences towards their attributes.

12

TECHNIQUE TO SUPPORT EFFICIENT EVALUATION OF AD-HOC TOP-K QUERIES

Novel geometric representation that utilizes arrangement of geometric objects to perform indexing and ad-hoc top-k query answering.

Algorithm for updating and querying. Tuple pruning to minimize the number of

tuple need to be indexed. Using real and synthetic data set we evaluate

the performance of this technique.

13

TOP-K QUERY PROBLEM AND SOLUTION

A top-k query retrieves the k highest scoring tuples from a data set with respect to a scoring function defined on the attributes of a tuple.

This is not directly applicable to highly dynamic environments and on-line applications, like data streams. (predefined queries)

Introduce a novel geometric representation for the top-k query problem based on DCEL.

Query evaluation strategies that operate on top of an arrangement data structure that are able to guarantee efficient evaluation for ad-hoc queries

14

TOP K ANSWERING IN THE PRIMAL PLANE

Data Set D = {t1 , t2, t3, . . . , tn )

d numeric attributes = {x1, x2, x3, . . , xd }

Domi : domain of ith attribute Consider unit interval [0,1] Tuple t (t. x1, . . . , t. xd) Top-k query: Q = ( S , k ) Scoring function S: Dom1 x … x Domd

Consider d=2

PRIMAL PLANE

0

1

1

P1P2

P3

x

y

w

w’ Additive Scoring function:

Scoring function

Scoring function

16

ARRANGEMENT OF LINES

The arrangement A(S) of a finite collection S of geometric objects is the decomposition of the d-dimensional space into connected open cells of dimensions 0, . . . , d induced by S

Geometric objects: Lines, curves, hyper planes, triangles, circles

Cell Range: 0 to d l-cell : cell of dimensionality l ( 0 ≤ l ≤ d ) Vertices, Edges and Faces d-cell: cell of maximum dimensionality (l = d)

17

EXAMPLE

3 Vertices, 9 Edges, 7 Faces

18

COMBINATORIAL COMPLEXITY

The combinatorial complexity of an arrangement is the overall number of cells of all dimensions in the arrangement.

The combinatorial complexity of an l-cell is the number of cells of the arrangement of dimension less than l that are contained in the boundary of the cell.

zone of a surface : set of d-cells intersecting the surface.

19

COMBINATORIAL COMPLEXITY

The faces are convex, but can be unbounded .

An arrangement of n lines is composed of O(n2) vertices, O(n2) edges and O(n2) faces

Theorem

Complexity results

20

REPRESENTATION (DCEL)

21

REPRESENTATION

Each halfedge maintains a pointer to its twin, e.g., e2 to e9 and vice versa.

For each vertex, a circular list of the incident halfedges is maintained, in clockwise order. For example vertex v3 maintains a list with edges (e2 , e8, e6, e4).

Each halfedge has pointers to its source and target vertices, for example edge e1 to vertices v1 and v2.

Each halfedge stores a pointer to its incident face, e.g., edges e1, e2, e3 store a link to face f1.

For each face, the halfedges forming its boundary are organized in a doubly connected circular list.

The total space complexity of the data structure : O(n2)

22

INDEXING FOR TOP-K QUERY ANSWERING IN THE DUAL PLANE

Each tuple t = (x1, x2) is mapped to a line et : y = (1 − x2)x + (1 − x1) in the dual plane.

A query Q can be represented as a point p(Q) = (w2 / w1 , 0), where w1, w2 are the weights of its scoring function.

Theorem 2

23

TOP-K QUERY ANSWERING IN THE DUAL PLANE

24

TOP-K QUERY ANSWERING IN THE DUAL PLANE

Solution for top-k query answering problem. Map the tuples of a data set to lines in the dual

plane. Maintains the rankings of all possible top-k

queries in a non-redundant, self-organizing manner.

All queries that produce the exact same tuple ranking are mapped to a continuous interval on the x axis and use the same part of the arrangement to retrieve that ranking.

An intersection signifies a change in the ranking of two tuples.

Self organizing for line insertion and deletion.

25

OPERATING ON THE ARRANGEMENT

Arrangement Representation

26

ARRANGEMENT REPRESENTATION

The points representing a query can only lie in the positive part of the x axis of the dual plane.

The domain of the tuples is the unit square, the lines that result after the mapping to the dual plane are of the form y = ax + b, where 0 ≤ a, b ≤ 1.

The selected mapping places all the elements of the top-k query answering problem in the positive quadrant of the dual plane.

Bound frame of dimensions [0,M] × [0,M + 1]

27

TOP-K RETRIEVAL

29

LINE INSERTION

31

INDEXING

The complexity of the main loop of the insertion procedure is determined by the number of the arrangement’s edges that must be traversed and the number of new edges that must be inserted.

An edge insertion is an O(1) operation. Since a line can intersect with up to n other lines, the cost for inserting the new edges is O(n).

The complexity of the line’s zone is of size O(n).

The space complexity of the solution is O(n2) and the cost of the query answering, insert and delete operations is O(n).

32

TUPLE PRUNING

Minimizes the number of tuples that need to be indexed, while maintaining the capability to correctly answer any top-k query.

Let Q: top-k query. Rk(Q) the point in the dual plane where a ray

shooting upwards from p(Q) meets the k-th line.

Let Q1 < Q2 denote the fact that p(Q1) lies left of p(Q2) and let [Q1, Q2] be the interval along the x axis of the dual plane between p(Q1) and p(Q2).

33

LEMMA 1

Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also l1(Q1, Q2) be the line in the dual plane that passes through the origin and Rk(Q1). Then, any line that is located above l1(Q1, Q2) in the interval [Q1, Q2] cannot be in the result of any top-k query that lies inside [Q1, Q2]

34

LEMMA 2

Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also l2(Q1, Q2) be the horizontal line in the dual plane that passes through Rk(Q2). Then, any line that is located above l2(Q1, Q2) in the interval [Q1, Q2] cannot be in the result of any top-k query that lies inside [Q1, Q2].

35

THEOREM 3

Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also I(Q1, Q2) be the intersection point of lines l1(Q1, Q2) (Lemma 1) and l2(Q1, Q2) (Lemma 2). We refer to this point as the pruning point. Then, any line that is located above I(Q1, Q2) cannot be in the result of any top-k query that lies inside [Q1, Q2].

36

PRUNING

Given two top-k queries Q1, Q2 and their result, we can filter out a portion of the data set D that is definitely irrelevant to any top-k query in [Q1,Q2].

In other words, only the part of the data set not pruned, denoted by D∗, needs to be stored in the arrangement.

Top-k query in [Q1,Q2] can be answered by requesting up to a number of results determined by the number of results (K) we choose to return for queries Q1, Q2.

K determines the position of the pruning point.

37

PRUNING

Consider a set B of m + 1 top-k queries B = {B1, . . . ,Bm+1}, such that Bi < Bj for i <

j and p(B1) = (0, 0), p(Bm+1) = (M, 0). We will refer to those queries as borders.

Treating each border as a query, we can compute the query result for each border top-k query.

Pruning point is computed I(Si) = I(Bi,Bi+1) for each strip and identify the part of the full data set D that we need to use in order to be able to answer any top-k query that lies inside a strip. We denote the filtered data set associated with strip Si by D∗i .

38

BORDERS

39

BORDERS

Full Arrangement Strip arrangement Each strip is responsible for answering queries

that lie between two borders Bi and Bi+1, we only need to construct and maintain the arrangement in the interval [Bi,Bi+1].

This effect greatly reduces the arrangement complexity.

The complexity of arrangement operations is reduced to O(|D∗|) instead of O(n), n being the size of the buffer. In the case of uncorrelated data, the cost of arrangement operations is only O(k ln n).

40

EXPERIMENTAL EVALUATIONS

Pruning Efficiency

41

EVALUATING THE PERFORMANCE

42

CONCLUSION

Primal Plane Dual Plane Use of arrangements Tuple Pruning Technique Borders

43

REFERENCES Models and Issues in Data Stream Systems by Brian

Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom

Stream Data Processing: A Quality of Service Perspective Modeling, Scheduling, Load Shedding, and Complex Event Processing by Sharma Chakravarthy.

http://www.vldb.org/archives/website/2007/program/slides/s183-das.pdf

44

QUESTIONS ? ? ? ?

THANK YOU

A D - HOC TOP - K QUERY ANSWERING FOR DATA STREAMS GAUTAM DAS DIMITRIOS GUNOPULOS NICK KOUDAS NIKOS...

Documents

Transcript of A D - HOC TOP - K QUERY ANSWERING FOR DATA STREAMS GAUTAM DAS DIMITRIOS GUNOPULOS NICK KOUDAS NIKOS...