A D - HOC TOP - K QUERY ANSWERING FOR DATA STREAMS GAUTAM DAS DIMITRIOS GUNOPULOS NICK KOUDAS NIKOS...
-
date post
19-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of A D - HOC TOP - K QUERY ANSWERING FOR DATA STREAMS GAUTAM DAS DIMITRIOS GUNOPULOS NICK KOUDAS NIKOS...
1
AD-HOC TOP-K QUERY ANSWERING FOR DATA STREAMS
GAUTAM DAS DIMITRIOS GUNOPULOS NICK KOUDAS NIKOS SARKAS
CSE 6339: DATA EXPLORATION
Anitha Josephine Royappan
2
OUTLINE
Introduction Data Stream DBMS VS DSMS
Top-k query answering in Primal Plane Primal Plane Arrangements
Top-k query answering in Dual Plane Dual Plane Operations
Tuple pruning Principles Implementation
Experimental Evaluation Conclusion
3
DATA STREAM
Traffic monitoring Applications : To monitor traffic slowdown or accidents using data sent by each car on the road every few seconds or minutes.
Sensor Monitoring Applications: Monitoring room for switching on AC, heater, light, fire alarm based on data received from sensors.
4
DATA STREAM
Needs to process data at a high input-rate. Data are processed continuously over a long period of time and data is obtained in the form of DATA STREAM.
Data Stream: Unbounded (or never-ending) sequence of DATA ITEMS that are usually ordered.
5
CHARACTERISTICS OF A DATA STREAM
Continuous arrival in multiple, rapid, high rates, possibly unpredictable and unbounded streams
Data items belonging to the same data stream are usually processed in the order they arrive.
A main memory buffer maintains incoming tuples.
Data streams are usually generated by external sources or other applications and are sent to a Data Stream management System (DSMS).
6
DATA STREAM
Large number of sensors are distributed in the physical world and generate streams of data that need to be combined, monitored, and analyzed.
DBMS’s are not designed for rapid and continuous loading of individual data items.
One time Queries: Evaluated once over a point-in-time snapshot of the data set, with the answer returned to the user.
7
DATA STREAM
Continuous Queries: Evaluated continuously as data streams continue
to arrive. Predefined Queries:
Queries are fixed and Data keeps changing. Ad-hoc Queries:
Both Queries and Data keeps changing Continuous queries are not supported. DSMS: Data Stream Management System
8
DATA STREAM PROCESSING USING DBMS (RFID)
Increased response time
9
DATA STREAM PROCESSING USING DSMS IN RFID
Date rate is high, Data is stored in buffer.
Incoming streams are processed directly by a DSMS.
Decrease in response time, data may also be persisted for archival
10
DBMS VS DSMS
DBMS Persistent One-time queries Random access
DSMS Transient Continuous
queries Sequential access
11
BUFFER MANAGEMENT FOR INCOMING TUPLES
Memory space is limited, so aged tuples are removed in order to free space for fresh incoming tuples.
Sliding window of a specific size W. Time based (timestamp) Tuple count-based (recent N records)
Example: K highest ranking tuples according to temperature, humidity and both.
Rank tuples in a buffer according to ad-hoc preferences towards their attributes.
12
TECHNIQUE TO SUPPORT EFFICIENT EVALUATION OF AD-HOC TOP-K QUERIES
Novel geometric representation that utilizes arrangement of geometric objects to perform indexing and ad-hoc top-k query answering.
Algorithm for updating and querying. Tuple pruning to minimize the number of
tuple need to be indexed. Using real and synthetic data set we evaluate
the performance of this technique.
13
TOP-K QUERY PROBLEM AND SOLUTION
A top-k query retrieves the k highest scoring tuples from a data set with respect to a scoring function defined on the attributes of a tuple.
This is not directly applicable to highly dynamic environments and on-line applications, like data streams. (predefined queries)
Introduce a novel geometric representation for the top-k query problem based on DCEL.
Query evaluation strategies that operate on top of an arrangement data structure that are able to guarantee efficient evaluation for ad-hoc queries
14
TOP K ANSWERING IN THE PRIMAL PLANE
Data Set D = {t1 , t2, t3, . . . , tn )
d numeric attributes = {x1, x2, x3, . . , xd }
Domi : domain of ith attribute Consider unit interval [0,1] Tuple t (t. x1, . . . , t. xd) Top-k query: Q = ( S , k ) Scoring function S: Dom1 x … x Domd
Consider d=2
PRIMAL PLANE
0
1
1
P1P2
P3
x
y
w
w’ Additive Scoring function:
Scoring function
Scoring function
16
ARRANGEMENT OF LINES
The arrangement A(S) of a finite collection S of geometric objects is the decomposition of the d-dimensional space into connected open cells of dimensions 0, . . . , d induced by S
Geometric objects: Lines, curves, hyper planes, triangles, circles
Cell Range: 0 to d l-cell : cell of dimensionality l ( 0 ≤ l ≤ d ) Vertices, Edges and Faces d-cell: cell of maximum dimensionality (l = d)
17
EXAMPLE
3 Vertices, 9 Edges, 7 Faces
18
COMBINATORIAL COMPLEXITY
The combinatorial complexity of an arrangement is the overall number of cells of all dimensions in the arrangement.
The combinatorial complexity of an l-cell is the number of cells of the arrangement of dimension less than l that are contained in the boundary of the cell.
zone of a surface : set of d-cells intersecting the surface.
19
COMBINATORIAL COMPLEXITY
The faces are convex, but can be unbounded .
An arrangement of n lines is composed of O(n2) vertices, O(n2) edges and O(n2) faces
Theorem
Complexity results
20
REPRESENTATION (DCEL)
21
REPRESENTATION
Each halfedge maintains a pointer to its twin, e.g., e2 to e9 and vice versa.
For each vertex, a circular list of the incident halfedges is maintained, in clockwise order. For example vertex v3 maintains a list with edges (e2 , e8, e6, e4).
Each halfedge has pointers to its source and target vertices, for example edge e1 to vertices v1 and v2.
Each halfedge stores a pointer to its incident face, e.g., edges e1, e2, e3 store a link to face f1.
For each face, the halfedges forming its boundary are organized in a doubly connected circular list.
The total space complexity of the data structure : O(n2)
22
INDEXING FOR TOP-K QUERY ANSWERING IN THE DUAL PLANE
Each tuple t = (x1, x2) is mapped to a line et : y = (1 − x2)x + (1 − x1) in the dual plane.
A query Q can be represented as a point p(Q) = (w2 / w1 , 0), where w1, w2 are the weights of its scoring function.
Theorem 2
23
TOP-K QUERY ANSWERING IN THE DUAL PLANE
24
TOP-K QUERY ANSWERING IN THE DUAL PLANE
Solution for top-k query answering problem. Map the tuples of a data set to lines in the dual
plane. Maintains the rankings of all possible top-k
queries in a non-redundant, self-organizing manner.
All queries that produce the exact same tuple ranking are mapped to a continuous interval on the x axis and use the same part of the arrangement to retrieve that ranking.
An intersection signifies a change in the ranking of two tuples.
Self organizing for line insertion and deletion.
25
OPERATING ON THE ARRANGEMENT
Arrangement Representation
26
ARRANGEMENT REPRESENTATION
The points representing a query can only lie in the positive part of the x axis of the dual plane.
The domain of the tuples is the unit square, the lines that result after the mapping to the dual plane are of the form y = ax + b, where 0 ≤ a, b ≤ 1.
The selected mapping places all the elements of the top-k query answering problem in the positive quadrant of the dual plane.
Bound frame of dimensions [0,M] × [0,M + 1]
27
TOP-K RETRIEVAL
28
29
LINE INSERTION
30
31
INDEXING
The complexity of the main loop of the insertion procedure is determined by the number of the arrangement’s edges that must be traversed and the number of new edges that must be inserted.
An edge insertion is an O(1) operation. Since a line can intersect with up to n other lines, the cost for inserting the new edges is O(n).
The complexity of the line’s zone is of size O(n).
The space complexity of the solution is O(n2) and the cost of the query answering, insert and delete operations is O(n).
32
TUPLE PRUNING
Minimizes the number of tuples that need to be indexed, while maintaining the capability to correctly answer any top-k query.
Let Q: top-k query. Rk(Q) the point in the dual plane where a ray
shooting upwards from p(Q) meets the k-th line.
Let Q1 < Q2 denote the fact that p(Q1) lies left of p(Q2) and let [Q1, Q2] be the interval along the x axis of the dual plane between p(Q1) and p(Q2).
33
LEMMA 1
Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also l1(Q1, Q2) be the line in the dual plane that passes through the origin and Rk(Q1). Then, any line that is located above l1(Q1, Q2) in the interval [Q1, Q2] cannot be in the result of any top-k query that lies inside [Q1, Q2]
34
LEMMA 2
Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also l2(Q1, Q2) be the horizontal line in the dual plane that passes through Rk(Q2). Then, any line that is located above l2(Q1, Q2) in the interval [Q1, Q2] cannot be in the result of any top-k query that lies inside [Q1, Q2].
35
THEOREM 3
Let Q1, Q2 be two top-k queries such that Q1 < Q2. Let also I(Q1, Q2) be the intersection point of lines l1(Q1, Q2) (Lemma 1) and l2(Q1, Q2) (Lemma 2). We refer to this point as the pruning point. Then, any line that is located above I(Q1, Q2) cannot be in the result of any top-k query that lies inside [Q1, Q2].
36
PRUNING
Given two top-k queries Q1, Q2 and their result, we can filter out a portion of the data set D that is definitely irrelevant to any top-k query in [Q1,Q2].
In other words, only the part of the data set not pruned, denoted by D∗, needs to be stored in the arrangement.
Top-k query in [Q1,Q2] can be answered by requesting up to a number of results determined by the number of results (K) we choose to return for queries Q1, Q2.
K determines the position of the pruning point.
37
PRUNING
Consider a set B of m + 1 top-k queries B = {B1, . . . ,Bm+1}, such that Bi < Bj for i <
j and p(B1) = (0, 0), p(Bm+1) = (M, 0). We will refer to those queries as borders.
Treating each border as a query, we can compute the query result for each border top-k query.
Pruning point is computed I(Si) = I(Bi,Bi+1) for each strip and identify the part of the full data set D that we need to use in order to be able to answer any top-k query that lies inside a strip. We denote the filtered data set associated with strip Si by D∗i .
38
BORDERS
39
BORDERS
Full Arrangement Strip arrangement Each strip is responsible for answering queries
that lie between two borders Bi and Bi+1, we only need to construct and maintain the arrangement in the interval [Bi,Bi+1].
This effect greatly reduces the arrangement complexity.
The complexity of arrangement operations is reduced to O(|D∗|) instead of O(n), n being the size of the buffer. In the case of uncorrelated data, the cost of arrangement operations is only O(k ln n).
40
EXPERIMENTAL EVALUATIONS
Pruning Efficiency
41
EVALUATING THE PERFORMANCE
42
CONCLUSION
Primal Plane Dual Plane Use of arrangements Tuple Pruning Technique Borders
43
REFERENCES Models and Issues in Data Stream Systems by Brian
Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom
Stream Data Processing: A Quality of Service Perspective Modeling, Scheduling, Load Shedding, and Complex Event Processing by Sharma Chakravarthy.
http://www.vldb.org/archives/website/2007/program/slides/s183-das.pdf
44
QUESTIONS ? ? ? ?
THANK YOU