Categorical Skylines for Streaming Categorical Skylines for Streaming DataData
Center for E-Business TechnologySeoul National University
Seoul, Korea
Nam, Kwang-hyun
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Nikos Sarkas, Gautam Das, Nick Koudas and Anthony K. H. Tunge
Univ. of Toronto, Univ. of Texas at Arlington,and National Univ. of Singapore
SIGMOD 2008
Copyright 2008 by CEBT
ContentsContents
Introduction
Motivation
Background
Efficient skyline maintenance for categorical tuples
Experimental Evaluation
Conclusions
IDS Lab Seminar (User Log Analysis) - 2
Copyright 2008 by CEBT
IntroductionIntroduction
Current situation
It has become increasingly difficult to process abundant data in order to isolate useful and relevant information
To facilitate the exploration of a data space
One such successful tool is the skyline query
The skyline of a data set
– The subset of tuples which are not dominated on all of their attributes by any other tuple
– The tuples that have a uniquely interesting combination of attribute values that no other tuple can match.
IDS Lab Seminar (User Log Analysis) - 3
Copyright 2008 by CEBT
MotivationMotivation
Recent work
Considered the on-demand evaluation of skyline queries on tuples with partially ordered categorical attributes
But in an offline environment
– Inappropriate for a highly dynamic data streaming environment
Objective
To identify and study the problem of maintaining the skyline of streaming data with partially ordered, categorical attributes
To realize two novel techniques
– Constitute the building blocks of STARS (STreaming ARrangement Skyline) to solve the problem.
Database Tuning - 4
Copyright 2008 by CEBT
BackgroundBackground
A limited-capacity, sliding-window buffer B in memory
To only store the n most recent tuples from the stream
The contents of the buffer constitute a data set denoted by D
The data set is comprised of n tuples with d categorical attributes
The domain Domi of each the attributes is partially ordered
Partial order (an antisymmetric preorder)
A binary relation"≤" over a set P which is reflexive, antisymmetric, and transitive, i.e., for all a, b, and c in P, we have that:
– a ≤ a (reflexivity)
– if a ≤ b and b ≤ a then a = b (antisymmetry)
– if a ≤ b and b ≤ c then a ≤ c (transitivity)
Poset (Partially ordered set)
– A set with a partial order
– Commonly represented as DAG (directed acyclic graph)
Database Tuning - 5
Copyright 2008 by CEBT
BackgroundBackground
Example 1
Value a and b -> ‘Comparable and a dominates b’
Value b and c -> ‘Incomparable’
Assume tuples comprised of d categorical attributes
Tuple t1 dominates tuple t2 ( t2 < t1 or t1 > t2 )
– If for every attribute Xi, t2.Xi ≤i t1.Xi and there is at least one
attribute Xj such that t2.Xj < t1.Xj
Tuples t1 and t2 do not dominate one another
– Call this situation ‘tied’ ( t1 ~ t2)
– Then, the skyline of data set D is the subset of all tuples that are not dominated by any other tuple
Database Tuning - 6
• Vertex : Domain value• Directed edge : For each pair of comparable values• Vertex : Domain value• Directed edge : For each pair of comparable values
Copyright 2008 by CEBT
BackgroundBackground
Example 2
Consider three tuples t1, t2 and t3 with two categorical attributes
The domain of both attributes is the poset of Figure 1
Assume t1 = (a,b), t2 = (b,d), and t3 = (b,c)
Skyline maintenance solution
Checking whether a tuple a dominated by the current skyline
Retrieving the tuples in the buffer that are dominated by the outgoing skyline tuple
– Since only these tuples are candidates for entering the skyline
Lemma 1
Let t1,t2∈B be two tuples so that t1 < t2. Then, if t2 arrived after
t1 in the stream, t1 will never be in the skyline of B
Database Tuning - 7
i) t2 < t1 ∵ b < a and d < bii) t2 < t3 ∵ b = b and d < ciii) t1 ~ t3 ∵ b < a, but b ~ c∴ The skyline of this data set consists of tuples t1 and t3.
i) t2 < t1 ∵ b < a and d < bii) t2 < t3 ∵ b = b and d < ciii) t1 ~ t3 ∵ b < a, but b ~ c∴ The skyline of this data set consists of tuples t1 and t3.
Copyright 2008 by CEBT
Efficient skyline maintenance for categorical Efficient skyline maintenance for categorical tuplestuples
Two novel techniques for realizing the building blocks of the skyline maintenance frame work
Indexing the skybuffer (the relevant part of the buffer for skyline)
– To enable efficiently identify the skybuffer tuples dominated by a query tuple
Organizing the skyline
– To enable rapidly answer whether a query tuple is dominated by the skyline
Database Tuning - 8
Copyright 2008 by CEBT
Topological sortingTopological sorting
Topological sort
A numbering of the vertices of a DAG
– Every edge from a vertex numbered i to a vertex numbered j satisfies i < j
If for two values a, b, a > b, then a will appear before b.
– For a given value x, all values dominated by x will appear after it in the linear order, while x can never dominate any value that precedes it in the ordering
Definition 1
Let v be a value of a partially-ordered domain. We denote by r(v) the integer corresponding to v’s position in a ceratin topological sort of the domain
Database Tuning - 9
Copyright 2008 by CEBT
Topological sortingTopological sorting
Lemma 2
Let v1, … , vm be the m values of a partially-ordered domain
Dom. Then vi > vj only if r(vi) < (vj)
Lemma 3
Let t1, t2 be two tuples with d partially-ordered categorical
attributes X1, …, Xd. Then t1 > t2 only if r(t1.Xi) < r(t2.Xi),
1≤i≤d
Lemma 4
Let t1, t2 be two tuples with d partially-ordered categorical
attributes X1, … Xd. If ∃i, j such that r(t1.Xi) < r(t2.Xi) and
r(t1.Xj) > r(t2.Xj), then t1 ~ t2
Database Tuning - 10
Copyright 2008 by CEBT
Organizing the skybuffer tuples in a Organizing the skybuffer tuples in a gridgrid
One of goals is indexing the skybuffer
To efficiently insert and delete tuples, as well as identify the tuples dominated by a query tuple
Build skybuffer indexing solution around a simple grid
Example 3
Consider a set of tuples with two categorical attributes, the domain of both attributes being the poset of Figures 2. Suppose that the poset values have been mapped to integers according to topological sort (i) in Figure 2. Then, we create the grid by using this topological sort as the grid scales and place the skybuffer tuples in its cell.
Each grid cell corresponds to a unique combination of attribute values and only contains tuples with these exact values
– The cell corresponding to tuples with attribute values (d, e) has been marked with an “x”.
Database Tuning - 11
Copyright 2008 by CEBT
Improving the skybuffer Improving the skybuffer organizationorganization
1. Visiting only relevant cells
One of advantage of the grid-based index
Directly identify and process precisely the cells that contain dominated tuples (directly visit the cells marked with •)
Lemma 5 (Focused Search)
Let t be a tuple with d partially-ordered categorical attributes X1
∈ Domi, …, Xd ∈ Domd. Let dom(t.Xi) be the values in Domi such
that t.Xi ≥ v, v ∈ Domi. Then, t dominates a tuple s if and only if
s ∈ dom(t.X1) x … x dom(t.Xd)
Example
– The query tuple is (d,e) in figure 3. The domain values dominated by d are {d,g}. The domain values dominated by e are {e,g,h}. Then only tuples with values in {d,g} x {e,g,h} are dominated by {d,e}.
Database Tuning - 12
Copyright 2008 by CEBT
Improving the skybuffer Improving the skybuffer organizationorganization
2. Controlling the grid granularity
A problem with the grid-based index
The lack of control over granularity of the grid
Example
– If the tuples have 4 attributes and each domain size is about 500, then the grid would be comprised of 62.5 billion cells
Definition 2
Consider a DAG and its vertices. A vertex is a source if it has no incoming edges. Then, the depth of a vertex in a DAG is the length of the longest path from a source to the vertex.
Lemma 6
Let v1, …, vm be the m values of a partially-ordered domain Dom.
Then vi > vj only if depth(vi) < depth(vj)
Database Tuning - 13
Copyright 2008 by CEBT
Improving the skybuffer Improving the skybuffer organizationorganization
Example 4
Consider a data set consisting of tuples with 2 partially-ordered categorical attributes, the domain of both of them being the poset of Figure 4(a). Then, we can group domain values that lie on the same depth and create the grid of Figure 4(b). Every cell would contain tuples with the same attribute values. Now, tuples with corresponding attributes that lie at the same depth level are placed in the same cell.
– tuple (d,e) lies in the cell marked with “x” in Figure 4(b), along with other tuples with values in {c,d,e} x {c,d,e}.
Problem
The cells can contain tuples both dominated and not dominated by the query tuple
– A dominance check is required to identify the dominated tuples
Database Tuning - 14
Copyright 2008 by CEBT
Improving the skybuffer Improving the skybuffer organizationorganization
3. Poset partitioning
Example 5
Suppose that our data set contains tuples with two partially-ordered categorical attributes, the domain of both being the poset of Figure 5(a). The values of the domain have been grouped. If we order the groups in ascending order of their depth value, breaking ties arbitrarily, we propose a valid topological sort for the poset
Figure 5(b) presents the resulting grid by using as scales the grouping of Figure 5(a)
The cells that contain candidate are marked with •
– This allows us to use focused search in order to directly access relevant cells.
Database Tuning - 15
All cells that can potentially contain tuples that are dominated by (d,e) are locatedin the rectangular area.
All cells that can potentially contain tuples that are dominated by (d,e) are locatedin the rectangular area.
Copyright 2008 by CEBT
Experimental EvaluationExperimental Evaluation
SDC (Stratification by Dominance Classification) Introduced in Stratified computation of skylines with partially-
ordered domains written by Chan et al. in 2005.
Improves over BBS+ in terms of both progressiveness and speed
Suggested in the numerical skyline maintenance solution in offline environment
Performance evaluation of STARS and SDC Goal
– To identify the impact of the buffer size, data dimensionality and domain structure on performance
Measure
– The average time required to process a buffer update
– i.e., A combined tuple arrival and expiration
Database Tuning - 23
Copyright 2008 by CEBT
Experimental EvaluationExperimental Evaluation
Performance on synthetic data
STARS outperforms SDC by an order of magnitude
The time required by STARS is in the order of a millisecond or less Rendering its use in real life applications entirely realistic
Database Tuning - 24
Effect of buffer size and dimensionality on performance
STARSSDC
Copyright 2008 by CEBT
Experimental EvaluationExperimental Evaluation
Performance on a stream of real, skewed and correlated data
Used DMV data set and three categorical attributes of the “cars” table: Maker/Model, Color, and Year
The result can be attributed to the presence of skew and correlation in the data that leads considerably larger skylines as the buffer size increases
Database Tuning - 25
Performance on skewed real data
STARSSDC
3 dimensional tuples
Copyright 2008 by CEBT
Experimental EvaluationExperimental Evaluation
Further evlauation of STARS
Pruning efficency
Measured as the fraction of skyline tuples that need to be examined on average in order to answer a dominance query
Found to be resilient to increases in data dimensionality
Database Tuning - 26
Pruning efficiency of arrangement skyline organization
Every depth level has twice the number of values from the level above
Every depth level has twice the number of values from the level above
All depth levels have the same number of valuesAll depth levels have the same number of values
Copyright 2008 by CEBT
Experimental EvaluationExperimental Evaluation
Demonstrates the potential perfomance benefits by utilizing the techniques of ‘improving the skybuffer organization’
The benefit of increasing the grid granularity is eventually offset( 상쇄 ) by the overhead of visiting many sparse cells
Knee in the curves
Database Tuning - 27
Effect of grid granularity on performance
Every depth level has twice the number of values from the level above
Every depth level has twice the number of values from the level above
All depth levels have the same number of valuesAll depth levels have the same number of values
3 dimensional tuples
Copyright 2008 by CEBT
ConclusionsConclusions
Identified and motivated the problem of maintaining the skyline of streaming data with partially ordered, categorical attributes
Realized two novel techniques
To constitute the building blocks of an efficient solution to the problem.
Introduce a lightweight data structure for indexing the tuples in the streaming buffer
IDS Lab Seminar (User Log Analysis) - 28
SkylineSkyline
Center for E-Business TechnologySeoul National University
Seoul, Korea
Nam, Kwang-hyun
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Copyright 2008 by CEBT
Introduction of Skyline QueryIntroduction of Skyline Query
Motivation
The amount of information is explosively increasing
– Users often face information overload
How can different attributes be compared?
Solution
Return objects which are not dominated by any other object in all attributes
Intuitive querying
– No overhead of formulation
IDS Lab Seminar (User Log Analysis) - 30
AABB
Low
Low
High
High
Price
Speed
Copyright 2008 by CEBT
Categorical Skylines for Categorical Skylines for Streaming DataStreaming Data
Motivation
Existing work is not proper to maintain skylines for streaming data
Skyline maintenance of streaming tuples
Incoming tuple
– Consider whether incoming tuple can become part of skyline
Outgoing tuple
– Consider whether the tuples dominated by outgoing tuple can be candidates for entering skyline
Skyline maintenance solution
– Checking whether a tuple is dominated by the current line
– Retrieving the tuples in the buffer that are dominated by the outgoing sky line tuple
Database Tuning - 31
Copyright 2008 by CEBT
Categorical Skylines for Categorical Skylines for Streaming DataStreaming Data
Proposal
Indexing the skybuffer
– To insert and delete tuples, as well as identify the skybuffer tuples dominated by a query tuple efficiently
– 1. Topological sorting
– 2. Organizaing the skybuffer tuples in a grid
– 3. Improving the skybuffer
Visiting only relevant cells, Controlling the grid granularity, Poset partitioning
Advantages
Optimize query evaluation
Applicable to a wide range of dimensionality
Adapt any size of poset
Database Tuning - 32
Top Related