1 Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron...
-
date post
21-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of 1 Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron...
1
Aggregation Algorithms and Instance Optimality
Moni Naor
Weizmann Institute
Joint work with Ron Fagin Amnon Lotem
2
Aggregating information from several lists/sources
• Define the problem
• Ways to evaluate algorithms
• New algorithms
• Further Research
3
The problem
• Database DD of N objects
• An object R has m fields - (x1, x2, , xm)
– Each xi 0,1
• The objects are given in m lists L1, L2, , Lm
– list Li all objects sorted by xi value.
• An aggregation function t(x1,x2,…xm)
– t(x1,x2,…xm) - a monotone increasing function
Wanted: top k objects according to t
4
Goal
• Touch as few objects as possible
• Access to object? c1= 0.9
b1= 0.8
s1= 0.65
List L1
r1= 0.5
a1= 0.4
List L2
s2= 0.85
a2= 0.84
r2= 0.75
b2= 0.3
c2= 0.2
5
Where?
Problem arises when combining information from several sources/criteria
Concentrate on middleware complexity without changing subsystems
6
Example: Combining Fuzzy Information
Lists are results of query: ``find object with
color `red’ and shape `round’”• Subsystems for color and for shape.
– Each returns a score in [0,1] for each object
• Aggregating function t is how the middleware system should combine the two criteria– Example: t(R=(x1,x2 )) could be min(x1,x2 )
7
Example: scheduling pages
Each object - page in a data broadcast system
• 1st field - of users requesting the page
• 2nd field - longest time user is waiting
Combining function t - product of the two fields (geometric mean)
Goal: find the page with the largest product
8
Example: Information Retrieval
Query T1, T2, T3: find documents with largest sum of entries
D1 D2Dk
T2
Tn
Terms
Documents
W12
Aggregation function t is xi
9
Modes of Access to the Lists
• Sequential/sorted access: obtain next object in list Li
– cost cS
• Random access: for object R and i m obtain xi
– cost cR
Cost of an execution:
cS ( of seq. access) cR ( of random access)
11
Fagin’s Algorithm - FA
• For all lists L1, L2, , Lm get next object in sorted order.
• Stop when there is set of k objects that appeared in all lists.
• For every object R encountered – retrieve all fields x1, x2, , xm.
– Compute t(x1,x2,…xm)
• Return top k objects
12
Correctness of FA...
For any monotone t and any database DD of objects, FA finds the top k objects.
Proof: any object in the real top k is better in at least one field than the objects in intersection.
13
Performance of FA
Performance : assuming that the fields are independent (N(m-1)/m).
Better performance - correlation between fields
Worse performance - negative correlation
Bad aggregating function: max
14
Goals of this work
• Improve complexity and analysis - worst case not meaningful
Instead consider Instance Optimality
• Expand the range of functionswant to handle all monotone
aggregating functions
• Simplify implementation
15
Instance Optimality
AA = class of algorithms,
DD = class of legal inputs.
For AAAA and DDD measure cost(AA,DD) 0.
• An algorithm AAAA is is instance optimalinstance optimal over over A A and and D if there are constants c1 and c2 s.t.
For every A’A’AA and DDD
cost(AA,DD) c1 cost(A’A’,DD) c2.
c1 is called the optimality ratio
16
…Instance Optimality
• Common in competitive online analysis– Compare an online decision making algorithm
to the best offline one.
• Approximation Algorithms– Compare the size that the best algorithm can
find to the one the approx. algorithm finds
In our case
– Offline Nondeterminism
17
…Instance Optimality
• We show algorithms that are instance optimal for a variety of– Classes of algorithms
• deterministic, Probabilistic, Approximate
– Databases – access cost functions
18
Guidelines for Design of Algorithms
• Format: do sequential/sorted access (with random access on other fields) until you know that you have seen the top k.
• In general: greedy gathering of information; If a query might allow you to know top k objects do it.
Works in all considered scenarios
19
The Threshold Algorithm - TA• For all lists L1, L2, , Lm get next object in
sorted order.• For each object R returned
– Retrieve all fields x1,x2,,xm.
– Compute t(x1,x2,…xm) – If one of top k answers so far - remember it.
1im let xi be bottom value seen in Li (so far)
– Define the threshold value to be t(x1,x2,…xm)
• Stop when found k objects with t value .– Return top k objects
20
c , t(c) = 1/12b , t(b) = 1/11• Top object (so far) = • Bottom values x1 = x2 =• Threshold =
Example: m=2, k=1, t is min
c1= 0.9 s2= 3/4
b1= 0.7 w2= 2/3
a1= 0.1 q2= 1/4
r1= 0.4 z2= 1/2
c = (0.9, 1/12)
b = (0.7, 1/11)
r = (0.4, 1/8)
a = (0.1, 1/13)
s = (0.05,3/4)
w = (0.07, 2/3)
z = (0.09, 1/2)
q = (0.08, 1/4)
r , t(r) =1/8
3/42/31/21/40.90.70.40.1
3/42/30.40.1
Maintained Information
21
Correctness of TA
For any monotone t and any database DD of objects, TA finds the top k objects.
Proof: If object z was not seen
1im zi xi
t(z1, z2,…zm) t(x1,x2,…xm)
23
Robustness of TA
Approximation: Suppose want an (1) approx. - for any R returned and R’ not returned
t(R’) (1) t(R)
Modified stopping condition: Stop when found k objects with t value at least /(1).
Early Stopping: can modify TA so that at any point user is– Given current view of top k list– Given a guarantee about approximation
24
Instance Optimality
Intuition: Cannot stop any sooner, since the next object to be explored might have the threshold value.
But, life is a bit more delicate...
25
Wild Guesses
Wild guesses: random access for a field i of object R that has not been sequentially accessed before
• Neither FA nor TA use wild guesses
• Subsystem might not allow wild guesses
More exotic queries: jth position in ith list...
26
Instance Optimality- No Wild Guesses
Theorem: For any monotone t let
• AA be the class of algorithms that – correctly find top k answers for every database
with aggregation function t. – Do not make wild guesses
• D be the class of all databases.
Then TATA is instance optimal over AA and D
Optimality ratio is m+m2 ·cR/cS - best possible!
27
Proof of Optimality
Claim: If TA gets to iteration d, then any (correct) algorithm A’ must get to depth d-1
Proof: let Rmax be top object returned by TA
(d) t(Rmax) (d-1)
There exists D’ with R’ at level d-1
R’ (x1(d-1), x2
(d-1),…xm(d-1) )
Where A’ fails
28
Do wild guesses help?
Aggregation function - min, k=1
Database - 1 2 … n n1 … 2n1
1 1 … 1 1 0 0 …0
0 0 … 0 1 1 1 …1
L1 : 1 2 … n n1 … 2n1
L2 : 2n1 … n1 n …1
Wild guess: access object n1 and top elements
29
Strict Monotonicity
• An aggregation function t is strictly monotone if when 1im xi x’i Then
t(x1, x2,…xm) t(x’1,x’2,…x’m)
Examples: min, max, avg...
30
Instance Optimality - Wild Guesses
Theorem: For any strictly monotone t let • AA be the class of algorithms that
– correctly find top k answers for every database.
• D be the class of all databases with distinct values in each field.
Then TATA is instance optimal over AA and D
Optimality Ratio is c · m where c=max{cR
/cS ,cS /cR }
31
Related Work
An algorithm similar to TA was discovered independently by two other groups
• Nepal and Ramakrishna• Gntzer, Balke and Kiessling
No instance optimality analysisHence proposed modifications that are not instance
optimal algorithm
Power of Abstraction?
32
Dealing with the Cost of Random Access
In some scenarios random access may be impossibleCannot ask a major search engine for it internal score on
some document
In some scenarios random access may be expensiveCost corresponds to disk access (seq. vs. random)
Need algorithms to deal with these scenarios
• NRA - No Random Access
• CA - Combined Algorithm
33
No Random Access - NRA
March down the lists getting the next object Maintain:• For any object R with discovered fields S1,..,m:
– W(R) t(x1,x2,…,x|S|,,0…0)
Worst (smallest) value t(R) can obtain– B(R) t(x1,x2,…,x|S|, x|S|+1,, …, xm)
Best (largest) value t(R) can obtain
34
…maintained information (NRA)
• Top k list, based on k largest W(R) seen so far– Ties broken according to B values
Define Mk to be the kth largest W(R) in top k list
• An object R is viable if B(R) Mk
Stop when there are no viable elements left I.e. B(R) Mk for all R top list
Return the top k list
35
Correctness
For any monotone t and any database DD of objects, NRA finds the top k objects.
Proof: At any point, for all objects t(R)B(R)
Once B(R) Ck for all but top list
no other objects with t(R) Ck
36
Optimality
Theorem: For any monotone t let
• AA be the class of algorithms that – correctly find top k answers for every database. – make only sequential access
• D be the class of all databases.
Then NRANRA is instance optimal over AA and D
Optimality Ratio is m !
37
Implementation of NRA
• Not so simple - need to update B(R) for all existing R when x1,x2,…xm changes
• For specific aggregation functions (min) good data structures
Open Problem: Which aggregation function have good data structures?
38
Combined Algorithm CA
Can combine TA and NRA
Let h = cR /cS
Maintain information as in NRA
For every h sequential accesses:• Do m random access on an objects from
each list. Choose top viable for which not all fields are known
39
Instance Optimality
Instance optimality statement a bit more complex
Under certain assumptions (including t = min, sum) CA is instance optimal with optimality ratio ~ 2m
40
Further Research• Middleware Scenario:
– Better implementations of NRA– Is large storage essential – Additional useful information in each list?
• How widely applicable is instance optimality?– String Matching, Stable Marriage...
• Aggregation functions and methods in other scenarios– Rank Aggregation of Search Engines
• P=NP?