1 Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron...

41
1 Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron Fagin Amnon Lotem
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of 1 Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron...

1

Aggregation Algorithms and Instance Optimality

Moni Naor

Weizmann Institute

Joint work with Ron Fagin Amnon Lotem

2

Aggregating information from several lists/sources

• Define the problem

• Ways to evaluate algorithms

• New algorithms

• Further Research

3

The problem

• Database DD of N objects

• An object R has m fields - (x1, x2, , xm)

– Each xi 0,1

• The objects are given in m lists L1, L2, , Lm

– list Li all objects sorted by xi value.

• An aggregation function t(x1,x2,…xm)

– t(x1,x2,…xm) - a monotone increasing function

Wanted: top k objects according to t

4

Goal

• Touch as few objects as possible

• Access to object? c1= 0.9

b1= 0.8

s1= 0.65

List L1

r1= 0.5

a1= 0.4

List L2

s2= 0.85

a2= 0.84

r2= 0.75

b2= 0.3

c2= 0.2

5

Where?

Problem arises when combining information from several sources/criteria

Concentrate on middleware complexity without changing subsystems

6

Example: Combining Fuzzy Information

Lists are results of query: ``find object with

color `red’ and shape `round’”• Subsystems for color and for shape.

– Each returns a score in [0,1] for each object

• Aggregating function t is how the middleware system should combine the two criteria– Example: t(R=(x1,x2 )) could be min(x1,x2 )

7

Example: scheduling pages

Each object - page in a data broadcast system

• 1st field - of users requesting the page

• 2nd field - longest time user is waiting

Combining function t - product of the two fields (geometric mean)

Goal: find the page with the largest product

8

Example: Information Retrieval

Query T1, T2, T3: find documents with largest sum of entries

D1 D2Dk

T2

Tn

Terms

Documents

W12

Aggregation function t is xi

9

Modes of Access to the Lists

• Sequential/sorted access: obtain next object in list Li

– cost cS

• Random access: for object R and i m obtain xi

– cost cR

Cost of an execution:

cS ( of seq. access) cR ( of random access)

10

Interesting Cases

• cR /cS is small

• cS cR or

• cR >> cS

Number of lists m - small

11

Fagin’s Algorithm - FA

• For all lists L1, L2, , Lm get next object in sorted order.

• Stop when there is set of k objects that appeared in all lists.

• For every object R encountered – retrieve all fields x1, x2, , xm.

– Compute t(x1,x2,…xm)

• Return top k objects

12

Correctness of FA...

For any monotone t and any database DD of objects, FA finds the top k objects.

Proof: any object in the real top k is better in at least one field than the objects in intersection.

13

Performance of FA

Performance : assuming that the fields are independent (N(m-1)/m).

Better performance - correlation between fields

Worse performance - negative correlation

Bad aggregating function: max

14

Goals of this work

• Improve complexity and analysis - worst case not meaningful

Instead consider Instance Optimality

• Expand the range of functionswant to handle all monotone

aggregating functions

• Simplify implementation

15

Instance Optimality

AA = class of algorithms,

DD = class of legal inputs.

For AAAA and DDD measure cost(AA,DD) 0.

• An algorithm AAAA is is instance optimalinstance optimal over over A A and and D if there are constants c1 and c2 s.t.

For every A’A’AA and DDD

cost(AA,DD) c1 cost(A’A’,DD) c2.

c1 is called the optimality ratio

16

…Instance Optimality

• Common in competitive online analysis– Compare an online decision making algorithm

to the best offline one.

• Approximation Algorithms– Compare the size that the best algorithm can

find to the one the approx. algorithm finds

In our case

– Offline Nondeterminism

17

…Instance Optimality

• We show algorithms that are instance optimal for a variety of– Classes of algorithms

• deterministic, Probabilistic, Approximate

– Databases – access cost functions

18

Guidelines for Design of Algorithms

• Format: do sequential/sorted access (with random access on other fields) until you know that you have seen the top k.

• In general: greedy gathering of information; If a query might allow you to know top k objects do it.

Works in all considered scenarios

19

The Threshold Algorithm - TA• For all lists L1, L2, , Lm get next object in

sorted order.• For each object R returned

– Retrieve all fields x1,x2,,xm.

– Compute t(x1,x2,…xm) – If one of top k answers so far - remember it.

1im let xi be bottom value seen in Li (so far)

– Define the threshold value to be t(x1,x2,…xm)

• Stop when found k objects with t value .– Return top k objects

20

c , t(c) = 1/12b , t(b) = 1/11• Top object (so far) = • Bottom values x1 = x2 =• Threshold =

Example: m=2, k=1, t is min

c1= 0.9 s2= 3/4

b1= 0.7 w2= 2/3

a1= 0.1 q2= 1/4

r1= 0.4 z2= 1/2

c = (0.9, 1/12)

b = (0.7, 1/11)

r = (0.4, 1/8)

a = (0.1, 1/13)

s = (0.05,3/4)

w = (0.07, 2/3)

z = (0.09, 1/2)

q = (0.08, 1/4)

r , t(r) =1/8

3/42/31/21/40.90.70.40.1

3/42/30.40.1

Maintained Information

21

Correctness of TA

For any monotone t and any database DD of objects, TA finds the top k objects.

Proof: If object z was not seen

1im zi xi

t(z1, z2,…zm) t(x1,x2,…xm)

22

Implementation of TA

Requires only bounded buffers:

• Top k objects

• Bottom m values x1,x2,…xm

23

Robustness of TA

Approximation: Suppose want an (1) approx. - for any R returned and R’ not returned

t(R’) (1) t(R)

Modified stopping condition: Stop when found k objects with t value at least /(1).

Early Stopping: can modify TA so that at any point user is– Given current view of top k list– Given a guarantee about approximation

24

Instance Optimality

Intuition: Cannot stop any sooner, since the next object to be explored might have the threshold value.

But, life is a bit more delicate...

25

Wild Guesses

Wild guesses: random access for a field i of object R that has not been sequentially accessed before

• Neither FA nor TA use wild guesses

• Subsystem might not allow wild guesses

More exotic queries: jth position in ith list...

26

Instance Optimality- No Wild Guesses

Theorem: For any monotone t let

• AA be the class of algorithms that – correctly find top k answers for every database

with aggregation function t. – Do not make wild guesses

• D be the class of all databases.

Then TATA is instance optimal over AA and D

Optimality ratio is m+m2 ·cR/cS - best possible!

27

Proof of Optimality

Claim: If TA gets to iteration d, then any (correct) algorithm A’ must get to depth d-1

Proof: let Rmax be top object returned by TA

(d) t(Rmax) (d-1)

There exists D’ with R’ at level d-1

R’ (x1(d-1), x2

(d-1),…xm(d-1) )

Where A’ fails

28

Do wild guesses help?

Aggregation function - min, k=1

Database - 1 2 … n n1 … 2n1

1 1 … 1 1 0 0 …0

0 0 … 0 1 1 1 …1

L1 : 1 2 … n n1 … 2n1

L2 : 2n1 … n1 n …1

Wild guess: access object n1 and top elements

29

Strict Monotonicity

• An aggregation function t is strictly monotone if when 1im xi x’i Then

t(x1, x2,…xm) t(x’1,x’2,…x’m)

Examples: min, max, avg...

30

Instance Optimality - Wild Guesses

Theorem: For any strictly monotone t let • AA be the class of algorithms that

– correctly find top k answers for every database.

• D be the class of all databases with distinct values in each field.

Then TATA is instance optimal over AA and D

Optimality Ratio is c · m where c=max{cR

/cS ,cS /cR }

31

Related Work

An algorithm similar to TA was discovered independently by two other groups

• Nepal and Ramakrishna• Gntzer, Balke and Kiessling

No instance optimality analysisHence proposed modifications that are not instance

optimal algorithm

Power of Abstraction?

32

Dealing with the Cost of Random Access

In some scenarios random access may be impossibleCannot ask a major search engine for it internal score on

some document

In some scenarios random access may be expensiveCost corresponds to disk access (seq. vs. random)

Need algorithms to deal with these scenarios

• NRA - No Random Access

• CA - Combined Algorithm

33

No Random Access - NRA

March down the lists getting the next object Maintain:• For any object R with discovered fields S1,..,m:

– W(R) t(x1,x2,…,x|S|,,0…0)

Worst (smallest) value t(R) can obtain– B(R) t(x1,x2,…,x|S|, x|S|+1,, …, xm)

Best (largest) value t(R) can obtain

34

…maintained information (NRA)

• Top k list, based on k largest W(R) seen so far– Ties broken according to B values

Define Mk to be the kth largest W(R) in top k list

• An object R is viable if B(R) Mk

Stop when there are no viable elements left I.e. B(R) Mk for all R top list

Return the top k list

35

Correctness

For any monotone t and any database DD of objects, NRA finds the top k objects.

Proof: At any point, for all objects t(R)B(R)

Once B(R) Ck for all but top list

no other objects with t(R) Ck

36

Optimality

Theorem: For any monotone t let

• AA be the class of algorithms that – correctly find top k answers for every database. – make only sequential access

• D be the class of all databases.

Then NRANRA is instance optimal over AA and D

Optimality Ratio is m !

37

Implementation of NRA

• Not so simple - need to update B(R) for all existing R when x1,x2,…xm changes

• For specific aggregation functions (min) good data structures

Open Problem: Which aggregation function have good data structures?

38

Combined Algorithm CA

Can combine TA and NRA

Let h = cR /cS

Maintain information as in NRA

For every h sequential accesses:• Do m random access on an objects from

each list. Choose top viable for which not all fields are known

39

Instance Optimality

Instance optimality statement a bit more complex

Under certain assumptions (including t = min, sum) CA is instance optimal with optimality ratio ~ 2m

40

Further Research• Middleware Scenario:

– Better implementations of NRA– Is large storage essential – Additional useful information in each list?

• How widely applicable is instance optimality?– String Matching, Stable Marriage...

• Aggregation functions and methods in other scenarios– Rank Aggregation of Search Engines

• P=NP?

41

More Details

See

www.wisdom.weizmann.ac.il/~naor/PAPERS/middle_agg.html