Post on 10-May-2015
description
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
Institute of Applied Informatics and Formal Description Methods (AIFB)
Top-k Linked Data Query ProcessingAndreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
2
Evaluation Results
Top-k Linked Data Query Processing
Introduction and Motivation
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
3
INTRODUCTION & MOTIVATION
Institute of Applied Informatics and Formal Description Methods (AIFB)
4
Linked Data Query Processing
Problems: Efficiency and Scalability
Linked Data Query Processing Engine
data
data sources
Src.
URI
HTTP lookup
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
5
Top-K Query Processing
Users are usually interested in only a few results
Top-K query processing addresses the efficiency and scalability issues
ex:sgt_pepper foaf:name "Sgt. Pepper";ex:song "Lucy".
ex:help foaf:name "Help!"; ex:song "Help!".
ex:beatles foaf:name "The Beatles"; ex:album ex:sgt_pepper; ex:album ex:help.
Src. 1Src. 2
Src. 3
SELECT * WHERE { ex:beatles ex:album ?album . ?album ex:song ?song .}
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
6
Contributions
Transfer top-k query processing to the Linked Data setting
Linked Data specific improvements of the top-k approach
Evaluation using real-world data
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
7
TOP-K LINKED DATA QUERY PROCESSING
Institute of Applied Informatics and Formal Description Methods (AIFB)
8
Linked Data Query Processing Engine
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Top-K Query Processing in a Linked Data Setting (1) – Requirements (1)
Source index mapping triple patterns to sources containing bindings (e.g., [1,2])
Ranking function determining the relevance of triple pattern bindings
ex:sgt_pepper foaf:name "Sgt. Pepper";ex:song "Lucy".
Src. 2
ex:help foaf:name "Help!"; ex:song "Help!".
Src. 3ex:beatles foaf:name "The Beatles"; ex:album ex:sgt_pepper; ex:album ex:help.
Src. 1
TP1: ex:beatles ex:album ?album . TP2: ?album ex:song ?song .
source index
score [2,3] ∈
score [1,2] ∈
score [0,1] ∈
TP1
TP2
TP2
Institute of Applied Informatics and Formal Description Methods (AIFB)
9
TP2: ?album ex:song ?song
Top-K Query Processing in a Linked Data Setting (2) – Requirements (2)
Sorted access on each join input
Src. 2
TP1:ex:beatles ex:album ?album
Bindings withdescendingscores
SchedulingStrategy
Src. 3score [2,3] ∈
2
3Src. 1
score [0,1] ∈
1
score [1,2] ∈
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
10
Top-K Query Processing in a Linked Data Setting (3) – Push Bound Rank Join (1)
Sorted Access forex:beatles ex:album ?album .
Sorted Access for?album ex:song ?song
Score Query Bindings – Output Queue
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Scheduling Strategy: Load source 1
ex:beatles foaf:name "The Beatles"; ex:album ex:sgt_pepper; ex:album ex:help.
Src. 1ex:help foaf:name "Help!"; ex:song "Help!".
Src. 3
Score Seen Triples (TP1) Score Seen Triples (TP2)Score Seen Triples (TP2)
3 ex:help ex:song "Help!"
Scheduling Strategy: Load source 3
Score Seen Triples (TP1)
1 ex:beatles ex:album ex:sgt_pepper
1 ex:beatles ex:album ex:help
Institute of Applied Informatics and Formal Description Methods (AIFB)
11
Top-K Query Processing in a Linked Data Setting (4) – Push Bound Rank Join (2)
Score Query Bindings – Output Queue
4 ex:beatles ex:album ex:help .ex:help ex:song "Help!" .
Threshold: 4
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Src. 2
Sorted Access forex:beatles ex:album ?album .
Sorted Access for?album ex:song ?song
Score Seen Triples (TP2)
3 ex:help ex:song "Help!"
Score Seen Triples (TP1)
1 ex:beatles ex:album ex:sgt_pepper
1 ex:beatles ex:album ex:help
Found query binding with score ≥ threshold
STOP
Institute of Applied Informatics and Formal Description Methods (AIFB)
12
Score Seen Triples (TP2)Score Seen Triples (TP1)
Improving the Threshold Estimation (1)
Threshold estimation:
max_1
min_1
max_2
min_2
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
+
max_1 + min_2
We improve the threshold estimation:Star-shaped entity query bounds
Look-ahead bounds
max_2 + min_1
upperbound seen
upperbound unseen
Threshold: max { , }
Institute of Applied Informatics and Formal Description Methods (AIFB)
13
Improving the Threshold Estimation (2) Star-shaped Entity Query Bounds
Observation: Results for entity queries come from one single source
Idea: Upper bound scores for triple pattern bindings via the maximal possible triple score
score [2,3] ∈
ex:help foaf:name "Help!"; ex:song "Help!".
Src. 3
ex:sgt_pepper foaf:name "Sgt. Pepper";ex:song "Lucy".
Src. 2
score [1,2] ∈
upper-bound for triple bindings: 3
?x
?y
?zfoaf:name
ex:song
upper-bound for triple bindings: 3
upper bound for entity query bindings: 3 + 3
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
14
Improving the Threshold Estimation (3) Look-ahead Bounds
Idea: Provide a more accurate upper bound for the unseen bindings scores via the „next possible“ score
Score Query Bindings – Output Queue
4 ex:beatles ex:album ex:help .ex:help ex:song "Help!" .
max_1 = 1
min_1 = 1
Threshold: max { 1 + 3 , 1 + 3 } = 4
max_2 = 3
min_2 = 3
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
score ∈ [1,2]
Threshold: max { 1 + 2 , 1 + 3 } = 4
Score Seen Triples (TP1)
1 ex:beatles ex:album ex:sgt_pepper
1 ex:beatles ex:album ex:help
Score Seen Triples (TP2)
3 ex:help ex:song "Help!"
Sorted Access for?album ex:song ?song
Sorted Access forex:beatles ex:album ?album .
Src. 2
Src. 3
min_2 = 2
Institute of Applied Informatics and Formal Description Methods (AIFB)
15
EVALUATION
Institute of Applied Informatics and Formal Description Methods (AIFB)
16
Evaluation – Setting
We implemented three systemsPush-based symmetric hash join operator [2,5]
Standard top-k operator [6]
Improved top-k operator
Query set: 20 queries (8 FedBench and 12 own queries), having varying result size (1 to ~10.000) and complexity (2 to 5 triple patterns)
Data set: ~ 2.000.000 triples, distributed over ~700.000 sources
Parameters: k {1,5,10,20} ∈ and score distributions ∈{uniform, normal, exponential}
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
17
Evaluation – Results (1)
Overall Results
Top-k strategies lead to runtime improvement of 35% on average (compared to standard Linked Data processing)
Tighter bounding lead to further improvements of 12% on average (compared to standard top-k processing)
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Overview of processing times for all queries (k = 1, d = n)
Institute of Applied Informatics and Formal Description Methods (AIFB)
18
Evaluation – Results (2)
Effect of K and Score Distributions
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
19
CONCLUSION
Institute of Applied Informatics and Formal Description Methods (AIFB)
20
Conclusion
We showed that top-k processing techniques are applicable to the Linked Data setting.
Top-k strategies lead to significant time savings w.r.t. small values of k (in our experiments 35% on average)
We showed that our improved top-k strategy lead to further runtime advantages (in our experiments 12% on average)
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
21
QUESTIONS
Institute of Applied Informatics and Formal Description Methods (AIFB)
22
REFERENCES
Institute of Applied Informatics and Formal Description Methods (AIFB)
23
References
[1] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In World Wide Web,
2010.
[2] G. Ladwig and T. Tran. Linked Data Query Processing Strategies. In ISWC, 2010.
[3] M. Wu, L. Berti-Equille, A. Marian, C. M. Procopiuc, and D. Srivastava. Processing top-k join queries. Proc. VLDB Endow., pages 860–870, 2010.
[4] A. Harth, S. Kinsella, and S. Decker. Using naming authority to rank data and
ontologies for web search. In ISWC, pages 277–292, 2009.
[5] G. Ladwig and T. Tran. SIHJoin: Querying Remote and Local Linked Data. In
ESWC, 2011.
[6] K. Schnaitter and N. Polyzotis. Optimal algorithms for evaluating rank joins in
database systems. ACM Trans. Database Syst., 35:6:1–6:47, 2010.
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
24
BACKUP SLIDES
Institute of Applied Informatics and Formal Description Methods (AIFB)
25
Early Pruning of Partial Results
Motivation: Top-k join processing can be quite costly in terms of memory consumption
Idea: Prune such partial query results that cannot contribute to a final top-k result
?x
?yex:song
foaf:name ?z
upper-bound for triple bindings: 3
Rank Triple Pattern Binding
1 ex:sgt_pepper ex:song "Getting Better".
Currently known top-2 results:
Rank Query Bindings – Output Queue
6 ex:help foaf:name "Help!".ex:help ex:song "Help!" .
4 ex:sgt_pepper foaf:name "Sgt. Pepper".ex:sgt_pepper ex:song "Lucy".
Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer
+
Currently known partial results:
maximal score: 3 + 1 = 4
≤