Optimal Top-k Generation of Attribute Combinations based on
Ranked Lists
Jiaheng Lu, Renmin University of China
Joint work with Pierre Senellart, Chunbin Lin, Xiaoyong Du, Shan Wang, and Xinxing Chen
Motivation & Problem Statement
Goal: Select a combination of three players including forward, center, and guard positions.
CenterForward GuardF1 F2 C1 C2 G1 G2
Juwan Howard LeBron James Chris Bosh Eddy Curry Dwyane Wade Terrel Harris
(G02, 8.91)(G08, 8.07)(G05, 7.54)(G10, 7.52)(G03, 6.14)(G01, 5.05)(G04, 5.01)(G09, 3.34)……(G06, 3.01)
(G01, 3.81)(G06, 3.59)(G04, 3.21)(G07, 3.03)(G09, 2.07)(G11, 1.70)(G10, 1.62)(G02, 1.59)……(G08, 1.19)
(G02, 6.59)(G03, 6.19)(G04, 5.81)(G05, 4.01)(G01, 3.38)(G09, 2.25)(G06, 1.52)(G08, 1.51)……(G07, 1.00)
(G09, 7.10)(G03, 6.01)(G04, 3.79)(G08, 3.02)(G05, 2.89)(G02, 2.52)(G01, 2.00)(G10, 1.59)……(G06, 1.52)
(a) Source data of three groups
(G01, 9.31)(G07, 9.02)(G03, 8.87)(G04, 5.02)(G11, 4.81)(G08, 4.02)(G06, 4.31)(G05, 3.59)……(G09, 2.06)
(G05, 7.21)(G02, 6.01)(G06, 5.58)(G10, 5.51)(G04, 5.00)(G11, 3.09)(G01, 2.06)(G08, 2.03)……(G09, 1.98)
Methods
select players with the highest score in each group
calculate the average scores of players across all games
… …
Limitation:overlook team spirit
Game IDScore
Motivation & Problem Statement
consider their combined scores in the same game
Select the top-k combinations according to top-m aggregate scores
Top-k,m Problem
Tuple aggregation functionInstance aggregation
function
Motivation & Problem StatementTop-1,2: select the top-1 combination of players according to top-2 aggregate scores for games where they played together.
F2C1G1 is the best combination, since (21.51 + 18.76) is the highest overall score.
CenterForward GuardF1 F2 C1 C2 G1 G2
Juwan Howard LeBron James Chris Bosh Eddy Curry Dwyane Wade Terrel Harris
(G02, 8.91)(G08, 8.07)(G05, 7.54)(G10, 7.52)(G03, 6.14)(G01, 5.05)(G04, 5.01)(G09, 3.34)……(G06, 3.01)
(G01, 3.81)(G06, 3.59)(G04, 3.21)(G07, 3.03)(G09, 2.07)(G11, 1.70)(G10, 1.62)(G02, 1.59)……(G08, 1.19)
(G02, 6.59)(G03, 6.19)(G04, 5.81)(G05, 4.01)(G01, 3.38)(G09, 2.25)(G06, 1.52)(G08, 1.51)……(G07, 1.00)
(G09, 7.10)(G03, 6.01)(G04, 3.79)(G08, 3.02)(G05, 2.89)(G02, 2.52)(G01, 2.00)(G10, 1.59)……(G06, 1.52)
(a) Source data of three groups
(b) Top-2 aggregate scores for each combination
(G01, 9.31)(G07, 9.02)(G03, 8.87)(G04, 5.02)(G11, 4.81)(G08, 4.02)(G06, 4.31)(G05, 3.59)……(G09, 2.06)
(G05, 7.21)(G02, 6.01)(G06, 5.58)(G10, 5.51)(G04, 5.00)(G11, 3.09)(G01, 2.06)(G08, 2.03)……(G09, 1.98)
F1C1G1
(G04, 15.83)(G05, 14.81)
F1C1G2 F1C2G1 F1C2G2 F2C1G1 F2C1G2 F2C2G1 F2C2G2
(G04, 13.81)(G05, 13.69)
(G01, 16.50)(G04, 14.04)
(G01, 15.12)(G07, 12.05)
(G02, 21.51)(G05, 18.76)
(G05, 17.64)(G02, 17.44)
(G02, 17.09)(G04, 14.03)
(G02, 13.02)(G09, 12.51)
… … … … … … … …
Difference between top-k queries and top-k,m queries
Top-k Top-k,m
Return the top-k tuplesReturn the top-k
combinations of attributes
Can be transformed into a SQL
Cannot be transformed into a SQL
ApplicationXML keyword refinement
ExampleExample
Q = {DB;UC Irvine; 2002} Groups: G1 = {"DB"; "database"}, G2={"UCI";"UC Irvine"}G3 = {"2002"}.
Answer:
Q’={DB, UCI, 2002}
Consider a top-1,2 query
Application (Cont.)
• Evidence combination mining in medical databases
• Package recommendation systems
• …
Top-k,m Query ProcessingAccess Model: Sorted Accesses
(a, 9.0)(b, 8.7)(c, 8.7)(d, 7.4)
(i, 5.3)
……
(i, 8.8)
(f, 6.9)
(a, 7.5)
(d, 4.7)
(c, 7.9)
……
Top-k,m Query ProcessingAccess Model: Random Accesses
(a, 9.0)(b, 8.7)(c, 8.7)(d, 7.4)
(i, 5.3)
……
(i, 8.8)
(f, 6.9)
(a, 7.5)
(d, 4.7)
(c, 7.9)
……
Top-k,m Query ProcessingBaseline Method: ETA
Compute top-m tuples for each
combinationThreshold Algorithm
(TA)
Calculate aggregate score for each combination
Return the top-k combinations
(a, 10)(b,7.4)
……
(c, 8.3)(d, 7.1)
……
(b, 9)(a, 8)
……
(e, 8)(f, 7)
……
A1 A2 B1 B2
G1 G2
…... …...(b, 9)(a, 8)
……
(e, 8)(f, 7)
……
C1 C2
G3
…...
(A1, B1, C1) 100 (A1, B1, C2) 105
Top-k,m Query ProcessingUpper and Lower bounds Algorithm: ULA
Consider top-m seen match instances
Lower Bound
Consider threshold value and top-m match instances
Upper Bound
Compute the upper and lower bounds for
each combination
Termination condition:
k combinations meet the hit-condition
A1 A2 B1 B2
Group 1 Group 2
(G1, 9.3)
(G2, 8.3)
(G5,7.8)
(G11,7.3)
(G2, 7.9)
(G1,7.0)
(G4,8.0)
(G8,7.3)
(G8, 3.0)
(G4, 2.6)
(G4, 1.8)
(G2, 1.5)
(G11, 4.2)
(G5, 3.3)
(G2, 4.4)
(G1, 2.3)
Upper and Lower bounds Algorithm: ULA
A1B1
A1 A2 B1 B2
Group 1 Group 2
(G1, 9.3)
(G2, 8.3)
(G5,7.8)
(G11,7.3)
(G2, 7.9)
(G1,7.0)
(G4,8.0)
(G8,7.3)
(G8, 3.0)
(G4, 2.6)
(G4, 1.8)
(G2, 1.5)
(G11, 4.2)
(G5, 3.3)
(G2, 4.4)
(G1, 2.3)
Upper and Lower bounds Algorithm: ULA
A1B1U: 34.4
L: 32.5
A1B2U: 34.6
L: 22.2
A2B1U: 31.4
L: 20.5
A2B2U: 31.6
L: 9.8
L: 32.5A1B1 ( G1, 9.3+7.0 ) , ( G2, 7.9+8.3)
U: 31.4A2B1 Threshold value=7.8+7.9=15.7, 15.7*2=31.4
A1 A2 B1 B2
Group 1 Group 2
(G1, 9.3)
(G2, 8.3)
(G5,7.8)
(G11,7.3)
(G2, 7.9)
(G1,7.0)
(G4,8.0)
(G8,7.3)
(G8, 3.0)
(G4, 2.6)
(G4, 1.8)
(G2, 1.5)
(G11, 4.2)
(G5, 3.3)
(G2, 4.4)
(G1, 2.3)
Upper and Lower bounds Algorithm: ULA
A1B1U: 32.5
L: 32.5
A1B2U: 31.2
L: 22.2
A1B1
Optimization heuristics (1)Pruning combinations without computing the bounds
(A3,B2) is dominated by (A2,B1)
6.3<7.1 and 8.0<8.2
Optimization heuristics (2)Reducing the number of accesses
Avoiding both sorted and random accesses for specific lists
(a, 10)(b,7.4)
……
(c, 8.3)(d, 7.1)
……
(a, 6.3)(d, 4)
……
(b, 9)(a, 8)
……
(e, 8)(f, 7)
……
A1 A2 A3 B1 B2
G1 G2
(A1,B1)and(A1,B2) cannot be part of answers, all sorted accesses and random accesses on list A1 are unnecessary.
Optimization heuristics (3)Reducing the number of accesses
Reducing random accesses across two lists
(a, 10)(b,7.4)
……
(c, 8.3)(d, 7.1)
……
(b, 9)(a, 8)
……
(e, 8)(f, 7)
……
A1 A2 B1 B2
G1 G2
(e, 9)(d, 7)
……
(k, 8)(f, 7)
……
C1 C2
G3
(A1,B1,C1)and(A1,B1,C2) cannot be part of answers, random accesses between A1 and B1 are unnecessary.
Optimization heuristics (4)Reducing the number of accesses
Eliminating random accesses for specific tuples
Random access from Le to Lt for tuple x is useless
Prune dominated
combinations
Compute upper and lower bounds for unterminated combinations
Terminate combinations by reducing number of accesses
Until k combinations meet hit-condition
Top-k,m Query Processing
ULA+
Interesting theoretical results Optimality properties
Instance OptimalityInstance Optimality
If wild guesses are not allowed, and the size of each group is treated as a constant, then ULA and ULA+ are instance-optimal.
The upper bound of the optimality ratio is tightThe upper bound of the optimality ratio is tight
for every instance there exist two constants a and bsuch that cost(A) <= a*cost(A’) + b
Interesting theoretical results (Cont.) Optimality properties
No Instance Optimal AlgorithmsNo Instance Optimal Algorithms
If wild guesses are allowed, Then there is no deterministic algorithm that is instance-optimal.
Experimental Results
Experimental Setup
Language: Java; OS: Windows XP; CPU: 2.0GHz; Disk:320GB
Data sets
Experimental ResultsExperimental results on NBA and YQL datasets
ULA+ outperforms ETA by 1-2 orders of magnitude both in running time and access number.
Experimental ResultsPerformance of optimization to reduce combinations
More than 60% combinations are pruned without computing their bounds
Experimental ResultsPerformance of different optimizations
Combination of all optimizations has the most powerful pruning capability.
Experimental ResultsExperimental results on XML DBLP dataset
XULA and XULA+ perform better than XETA and scale well in both running time and number of accesses.
U. Güntzer etc, VLDB2000 S. Nepal etc, ICDE1999
Fagin etc, PODS 2001
Top-k with both random and
sorted accesses
R. Fagin etc, JCSS2003 N. Mamoulis etc, TDS2007
Fagin etc, PODS 2001
Top-k with only sorted accesses
Related Works
I. F. Ilya etc, VLDB2002Top-k with no need for exact
aggregate score
C. Li etc, SIGMOD2006 M. L. Yiu etc, DKE2008
Ad-hoc top-k queries
Related Works
N. Bruno etc, ICDE2002 K. C. C. Chang etc, SIGMOD2002
Top-k with sorted access on
restricted lists
Related Works
T. Deng, W, Fan and F. Geerts,On the Complexity of Package
Recommendation Problems PODS 2012
Top-k Package recommendation
Conclusion
• Propose a new problem called top-k,m query evaluation
• Developed a family of efficient algorithms, including ULA and ULA+
• Study the optimality properties of our algorithms
• Apply top-k,m query to the context of XML keyword query refinement
Top Related