Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

31
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo

description

HKU CS DB Seminar3 Example Find some good places for us to hold the next DB Seminar Good  Closer to HKU (Min) Good  Larger Area (Max) Return those homes that are not worse than any others in ALL DIMENSIONS Dataset (Table Homes): HomeDistance from HKUArea (m 2 ) K.K Loo1 km10 Ben9 km100 Ivy5 km2 Nikos8 km250

Transcript of Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

Page 1: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

Finding skyline on the fly

HKU CS DB Seminar21 July 2004

Speaker: Eric Lo

Page 2: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 2

Skyline A new operator (like “ORDER BY”) in

database systems A set of data points that is not dominated by

any other data points

Page 3: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 3

Example Find some good places for us to hold the next DB Seminar Good Closer to HKU (Min) Good Larger Area (Max) Return those homes that are not worse than any others in ALL DIMENSIONS Dataset (Table Homes):

Home Distance from HKU Area (m2)

K.K Loo 1 km 10

Ben 9 km 100

Ivy 5 km 2

Nikos 8 km 250

Page 4: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 4

Outline Introduction to skyline queries Non-progressive skylining on the Web

Basic Distributed Skyline Algorithm (BDS) Progressive skylining on the Web Experimental result Conclusion and future directions

Page 5: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 5

Skylining on the Web One distributed site holds one attribute

Attribute “Distance from HKU” stored at HKU Attribute “Area (m2)” stored at Purdue

Home Distance from HKU

K.K Loo 1 km

Ben 9 km

Ivy 5 km

Nikos 8 km

Home Area (m2)

K.K Loo 10

Ben 100

Ivy 2

Nikos 250

PurdueHKU

Internet

Page 6: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 6

Accessing interfacesHome Distance from HKU

K.K Loo 1 km

Ben 9 km

Ivy 5 km

Nikos 8 km

Home Area (m2)

K.K Loo 10

Ben 100

Ivy 2

Nikos 250

Purdue

HKU

Internet

Interfaces of Web-accessible sites:1. Sorted Access (SA):

HKUgetNext(): returns rank 1st data tuple “K.K Loo” HKUgetNext() 2nd “Ivy” , HKUgetNext()3rd “Nikos”, ….

2. Random Access (RA): PurduegetScore(“K.K. Loo”) 10 m2

HKUgetScore(“Nikos”) 8 km

Page 7: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 7

Basic distributed skyline algorithm (EDBT 04) Phase 1 – find all possible skyline:

Perform sorted access on each source 1-by-1 S1getNext(), S2getNext(), S3getNext() S1getNext(), S2getNext() …. ….

Stop until there is an object which attribute values are all known

Page 8: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 8

Phase 1 f is the terminating object

Page 9: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 9

Phase 1 (15 sorted accesses)

Page 10: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 10

Implication f is the terminating object Objects that do not

appear must be dominated by f

Page 11: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 11

Phase 2 Find skyline from candidates in phase 1 During sequential scanning of sources, data

structures K1, K2, K3, …, Kn are created n is the no. of dimension

If source igetNext() returns a data object d1. create an entry in Ki2. update the lower_bound of the source i

Page 12: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 12

Phase 2: find skyline from candidates Ki

A lemma shows that “Objects can only be dominated by objects in the same set Ki”

Page 13: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 13

Motivations BDS returns skyline results in a batch In practice, it would be useful to return

skyline results progressively such that users could adjust their decisions right away

Consider the “next DB seminar” skyline example: minimize “Distance from HKU”, maximize “Area” <Nikos: 8km, 250m2> is first returned

From HKU to Nikos’s home needs to take a $50 bus! Add the “travel-expense” attribute into the skyline query

Page 14: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 14

Progressive Distributed Skylining (PDS) Goal:

Evaluates skyline queries progressively with minimal overhead

Overhead: Network/Data source accesses Computational time

Page 15: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 15

Enable progressiveness To identify a data point belongs to the final

skyline or not, we rely on the following lemma (assume the data values are distinct): If a data source Di returns data objects in a strictly

monotonic order, an object O retrieved from Di would only be dominated by objects that are retrieved from Di before O

Page 16: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 16

If an object O is retrieved from a data source by sorted access, we could only need to test if O is dominated by any objects that appears before O in the same source only

2 usages:1. We don’t need to consider objects appear in other data sources2. After the test, we can output O as a skyline immediately O must be a

skyline, we do not need to worry about objects appear later would dominate O

Page 17: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 17

An R-tree approach Build an r-tree Ri for each attribute/data

source i involved in the skyline query For each object O retrieved from source i, we

check to see if any object in Ri dominates O If no such objects exists, O is a skyline (output it

immediately) If some objects dominates O in Ri, O is not a

skyline object (O is discarded immediately)

Page 18: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 18

D3.getNext() the 1st time

SA on D3 returns e<1> e is a skyline (no object is better than e on D3),

e(7,4) is projected into r-tree R3

e(7,4)

D1

D2

Page 19: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 19

D3.getNext() the 2nd time

SA on D3 returns c<2> Construct a query Q(origin, c) on R3

Q returns no answer c is a skyline insert c into R3

e(7,4)c(2,5)

Page 20: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 20

D3.getNext() the 3rd time

SA on D3 returns j<3> Construct a query Q(origin, j) on R3

Q returns c as an answer j is dominated by c discard j

e(7,4)c(2,5)

j(6,10)

Page 21: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 21

D3.getNext() the 4th time

SA on D3 returns f<4>, construct a query Q(origin, f) on R3 Q returns no answer f is a skyline Delete e after insertion of f to make the R-tree more compact and

efficient

e(7,4)

c(2,5)

Page 22: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 22

The R-tree approach The R-tree is very small

in size since it stores skyline objects with highest pruning power

Containment query operation is very efficient

Page 23: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 23

A linear regression based heuristic The R-tree approach enable progressiveness

with better efficiency We use a linear regression based heuristic to

minimize the number of source accesses during the evaluation process

Page 24: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 24

A rank based approach

1. We use linear regression to estimate the rank of objects along the process

2. Assume the object with lowest rank is the real terminating object and probe the sources accordingly (rather than round-robin)

Page 25: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 25

Extensions Evaluation of top-K skyline queries Progress indicator (based on the estimated ranks)

An clipart of Kevin Yip

Page 26: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 26

Experimental results – Number of source accesses

Page 27: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 27

Experimental results – Number of source accesses

Random Distribution Denormalized Domain

Page 28: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 28

Experimental results – progressive behavior

Page 29: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 29

Experimental results – progress indicator

Page 30: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 30

Conclusion and future directions Skyline queries on the Web Return skyline points on-the-fly Future work:

Improve the usability of PDS by allowing the users to barter between progressiveness and efficiency

Compute skyline from real-time stream data Only 1 data source supports sorted access and

the rest support random access only

Page 31: Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

HKU CS DB Seminar 31

References S.Borzonyi, D.Kossmann, K.Stocker, The Skyline Operat

or, in ICDE 2001. D.Kossmann, F.Ramsak, S. Rost, Shooting Stars in the

Sky: An Online Algorithm for Skyline Queries, in VLDB 2002.

W.T.Balke, U.Guntzer, J.X. Zheng, Efficient Distributed Skylining for Web Information Systems, in EDBT 2004