Post on 17-Feb-2017
Distributed Query Processing for Federated RDF Data Management
Olaf Görlitz
07.11.2014
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 2
The Linked Open Data Cloud
Use as one large database!
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 3
Life Science Scenario
Find drugs for nutritional supplementation
SELECT ?drug ?id ?title WHERE { ?drug drugbank:drugCategory category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title .}
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 4
Linked Data Querying Paradigms
Data Warehouse
Link Traversal
Federation
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 5
Linked Data Querying Paradigms
Requirements Data Warehouse Link Traversal Federation
Query Expressiveness
Schema Mapping
Data Freshness
Result Completeness
Scalability
Flexibility
Availability
Performance
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 6
Contributions
Large ScaleInformation Retrieval
RDF Federation &Query Optimization
Benchmarking RDFFederation Systems
PINTSPeer-to-Peer Statistics
Management
SPLENDIDDistributed SPARQLQuery Processing
SPLODGELinked Data Query
Generation
Görlitz, Staab: SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. COLD'11
Görlitz, Thimm, Staab: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data. ISWC'12
Görlitz, Sizov, Staab: PINTS: Peer-to-Peer Infrastructure for TaggingSystems. IPTPS'08
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 7
SPLENDID Federation
Federated Databases Federated RDF● Relational Schema ● Implicit Schema, Ontologies● Specific Data Wrappers ● SPARQL endpoints● Rich Data Statistics ● Limited Statistics (voiD)
Execute complex SPARQL queriesover federated RDF data sources
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 8
SPLENDID Federation
SPARQL Query
SourceSelection
QueryOptimization
QueryExecution
SELECT ?drug ?id ?title WHERE { ?drug drugbank:drugCategory category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug bio2rdf:xRef ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug purl:title ?title .}
⋈?drug⋈? id
⋈?keggDrug⋈?keggDrug
? drugdrugbank :drugCategory category :micronutrient
? drugdrugbank :casRegistryNumber ? id
? keggDrug rdf : type kegg :Drug
? keggDrugbio 2 rdf : xRef ? id
? keggDrugpurl : title? title
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 9
Source Selection Objectives
SPARQLQuery
SourceSelection
QueryOptimization
QueryExecution
Determine all relevant data sources
DARQ FedX SPLENDID● Explicit 'capabilities'● Query restrictions
(bound predicates)
● ASK queries + cachingmany (initial) requests
● Sub query aggregation
● VoiD descriptions+ ASK queries
● Sub query aggregation
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 10
voiD voiD voiDvoiD
Source Selection Example
SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title .}
→ KEGG, DBpedia, ChEBI→ KEGG
→ DrugBank
SPARQLASK
→ DrugBank, ChEBI
→ KEGG
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 11
Source Selection Result
⋈?drug
⋈? id
⋈?keggDrug
⋈?keggDrug
? drugdrugbank :drugCategory category :micronutrient
? drugdrugbank :casRegistryNumber ? id
? keggDrug rdf : type kegg :Drug
? keggDrugbio 2 rdf : xRef ? id
? keggDrugpurl : title? title
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 12
Query Optimization
SPARQLQuery
SourceSelection
QueryOptimization
QueryExecution
Find best (fastest) query execution plan
DARQ FedX SPLENDID● Dynamic Programming● Custom Statistics● Only bound predicates● Bind Join
● Join Order Heuristics● No Statistics● Join Chains● Bind Join
● Dynamic Programming● Extended voiD statistics● Bind + Hash Join
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 13
Dynamic Programming
● iterate over all possible execution plans● compare cost (execution time)
BindJoin,HashJoin
⋈?drug
⋈? id
⋈?keggDrug
⋈?keggDrug
? drugdrugbank :drugCategory category :micronutrient
? drugdrugbank :casRegistryNumber ? id
? keggDrug rdf : type kegg :Drug
? keggDrugbio 2 rdf : xRef ? id
? keggDrugpurl : title? title
Cost Modelcost send−query
cost receive−tuple
card (R (qi))
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 14
Cardinality Estimation
⋈?drug
⋈? id
⋈?keggDrug
⋈?keggDrug
? drugdrugbank :drugCategory category :micronutrient
? drugdrugbank :casRegistryNumber ? id
? keggDrug rdf : type kegg :Drug
? keggDrugbio 2 rdf : xRef ? id
? keggDrugpurl : title? title
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 15
Cardinality Estimation (Triple Pattern)
cardd (s , p ,o) = |d|⋅seld(s)⋅seld (p)⋅seld(o), d∈D
Assuming independence of s, p ,o
cardd (? ,p ,? )
cardd (s ,? ,? )
cardd (? ,? ,o)
cardd (s ,? ,o)
cardd (s ,p ,? )
cardd (? ,p ,o)
cardd (? , ? , ?) cardd (s,p,o)= voiDd→|d| = 1
= voiDd→p
=voiDd→|d|
voiDd→|s|
=voiDd→|d|
voiDd→|o|
= 1
=voiDd→p
voiDd→|sp|
=voiDd→p
voiDd→|op|
cardd (? , rdf : type ,T ) = voiDd→T
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 16
Cardinality Estimation (Basic Graph Pattern)
Star Pattern Path Pattern
kegg:Drug
?keggDrugrn:R01786
?title
rdf:Type
purl:title
bio2rdf:xRef
drugbank:Drug
?keggDrug
rdf:Type
owl:sameAs
?drug kegg:Drug
rdf:Type
cardd*(P1⋈P2⋈P3) =
min(cardd (P1) , cardd (P2))
⋅voiDd→p3
voiDd→|sp3|
cardd ,d '~
(P1⋈P2) =
cardd (P1)⋅cardd '(P2)
⋅seld ,d '(P1⋈P2)
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 17
Query Optimization
SPARQLQuery
SourceSelection
QueryOptimization
QueryExecution
⋈?drug
⋈B(? id )
⋈?keggDrug
⋈H(? keggDrug)
? drugdrugbank :drugCategory category :micronutrient
? drugdrugbank :casRegistryNumber ? id
? keggDrug rdf : type kegg :Drug
? keggDrugbio 2 rdf : xRef ? id
? keggDrugpurl : title? title
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 18
Evaluation Methodology
Compare with state-of-the-art federation systems
– Use Multiple linked datasets
– With representative characteristics
– Execute 'typical' SPARQL queries
– In a reproducible benchmark setup
FedBench
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 19
Evaluation Results
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 20
Conclusion
● Federation for Linked Open Data– Database + Semantic Web technology
– Efficient Distributed Query Processing
– Extension of voiD statistics
● Query generation for Federation Benchmarks● Efficient statistics management in P2P networks
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 21
Thank You
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 22
VoiD Descriptions/Statistics
}}
}
} General Information
Basic statisticstriples = 732744
Type statisticschebi:Compound = 50477
Predicate statisticsbio:formula = 39555
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 23
VoiD statistics extension
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 24
State of the Art
DARQ AliBaba FedX SPLENDID
Statistics ServiceDesc – – VoiD
Source Selection
Statistics(predicates)
All sources ASK queries Statistics + ASK queries
Query Optimization
DynProg Heuristics Heuristics DynProg
Query Execution
Bind join Bind join Bound Join + parallelization
Bind Join + Hash Join
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 25
SPARQL limitations
● Query protocol● Only SPARQL endpoints● Endpoint limitations
– SPARQL version
– Result size
– Data rate
– Availability
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 26
Join Implementation
R1 R2 R1 R2
⋈B ⋈H
Bind Join Hash Join
?id ?y
1 42
2 13
3 20
4 50
5 3
?id ?x
1 'A'
1 'G'
4 'A'
7 'A'
7 'C'
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 27
Join Cost Model
R (q1) R (q2 ' ) R (q1) R (q2)
⋈B ⋈H
Bind Join Hash Join
cost⋈B(q1, q2) = |R (q1)|⋅cost tuple +
|R (q1)|⋅costquery +
|R (q2 ' )|⋅cost tuple
cost⋈H(q1,q2) = |R (q1)|⋅cost tuple +
|R (q2)|⋅cost tuple +
2⋅costquery
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 28
SPARQL Semi Join
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 29
SPLENDID Architecture
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 30
FedBench Datasets
● Cross Domain
● Life Science
● Linked Data
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 31
Data Source Selection: Requests
Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management
07.11.2014Slide 32
Conclusion
Linked Open Data voiD
Web-scale Query Processing
SPLENDID