Distributed Query Processing for Federated RDF Data Management

32
Distributed Query Processing for Federated RDF Data Management Olaf Görlitz 07.11.2014

Transcript of Distributed Query Processing for Federated RDF Data Management

Page 1: Distributed Query Processing for Federated RDF Data Management

Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz

07.11.2014

Page 2: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 2

The Linked Open Data Cloud

Use as one large database!

Page 3: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 3

Life Science Scenario

Find drugs for nutritional supplementation

SELECT ?drug ?id ?title WHERE {  ?drug drugbank:drugCategory category:micronutrient .  ?drug drugbank:casRegistryNumber ?id .  ?keggDrug rdf:type kegg:Drug .  ?keggDrug bio2rdf:xRef ?id .  ?keggDrug purl:title ?title .}

Page 4: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 4

Linked Data Querying Paradigms

Data Warehouse

Link Traversal

Federation

Page 5: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 5

Linked Data Querying Paradigms

Requirements Data Warehouse Link Traversal Federation

Query Expressiveness

Schema Mapping

Data Freshness

Result Completeness

Scalability

Flexibility

Availability

Performance

Page 6: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 6

Contributions

Large ScaleInformation Retrieval

RDF Federation &Query Optimization

Benchmarking RDFFederation Systems

PINTSPeer-to-Peer Statistics

Management

SPLENDIDDistributed SPARQLQuery Processing

SPLODGELinked Data Query

Generation

Görlitz, Staab: SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. COLD'11

Görlitz, Thimm, Staab: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data. ISWC'12

Görlitz, Sizov, Staab: PINTS: Peer-to-Peer Infrastructure for TaggingSystems. IPTPS'08

Page 7: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 7

SPLENDID Federation

Federated Databases Federated RDF● Relational Schema ● Implicit Schema, Ontologies● Specific Data Wrappers ● SPARQL endpoints● Rich Data Statistics ● Limited Statistics (voiD)

Execute complex SPARQL queriesover federated RDF data sources

Page 8: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 8

SPLENDID Federation

SPARQL Query

SourceSelection

QueryOptimization

QueryExecution

SELECT ?drug ?id ?title WHERE {  ?drug drugbank:drugCategory category:micronutrient .  ?drug drugbank:casRegistryNumber ?id .  ?keggDrug bio2rdf:xRef ?id .  ?keggDrug rdf:type kegg:Drug .  ?keggDrug purl:title ?title .}

⋈?drug⋈? id

⋈?keggDrug⋈?keggDrug

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Page 9: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 9

Source Selection Objectives

SPARQLQuery

SourceSelection

QueryOptimization

QueryExecution

Determine all relevant data sources

DARQ FedX SPLENDID● Explicit 'capabilities'● Query restrictions

(bound predicates)

● ASK queries + cachingmany (initial) requests

● Sub query aggregation

● VoiD descriptions+ ASK queries

● Sub query aggregation

Page 10: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 10

voiD voiD voiDvoiD

Source Selection Example

SELECT ?drug ?title WHERE {  ?drug drugbank:drugCategory category:micronutrient .  ?drug drugbank:casRegistryNumber ?id .  ?keggDrug rdf:type kegg:Drug .  ?keggDrug bio2rdf:xRef ?id .  ?keggDrug purl:title ?title .}

→ KEGG, DBpedia, ChEBI→ KEGG

→ DrugBank

SPARQLASK

→ DrugBank, ChEBI

→ KEGG

Page 11: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 11

Source Selection Result

⋈?drug

⋈? id

⋈?keggDrug

⋈?keggDrug

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Page 12: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 12

Query Optimization

SPARQLQuery

SourceSelection

QueryOptimization

QueryExecution

Find best (fastest) query execution plan

DARQ FedX SPLENDID● Dynamic Programming● Custom Statistics● Only bound predicates● Bind Join

● Join Order Heuristics● No Statistics● Join Chains● Bind Join

● Dynamic Programming● Extended voiD statistics● Bind + Hash Join

Page 13: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 13

Dynamic Programming

● iterate over all possible execution plans● compare cost (execution time)

BindJoin,HashJoin

⋈?drug

⋈? id

⋈?keggDrug

⋈?keggDrug

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Cost Modelcost send−query

cost receive−tuple

card (R (qi))

Page 14: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 14

Cardinality Estimation

⋈?drug

⋈? id

⋈?keggDrug

⋈?keggDrug

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Page 15: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 15

Cardinality Estimation (Triple Pattern)

cardd (s , p ,o) = |d|⋅seld(s)⋅seld (p)⋅seld(o), d∈D

Assuming independence of s, p ,o

cardd (? ,p ,? )

cardd (s ,? ,? )

cardd (? ,? ,o)

cardd (s ,? ,o)

cardd (s ,p ,? )

cardd (? ,p ,o)

cardd (? , ? , ?) cardd (s,p,o)= voiDd→|d| = 1

= voiDd→p

=voiDd→|d|

voiDd→|s|

=voiDd→|d|

voiDd→|o|

= 1

=voiDd→p

voiDd→|sp|

=voiDd→p

voiDd→|op|

cardd (? , rdf : type ,T ) = voiDd→T

Page 16: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 16

Cardinality Estimation (Basic Graph Pattern)

Star Pattern Path Pattern

kegg:Drug

?keggDrugrn:R01786

?title

rdf:Type

purl:title

bio2rdf:xRef

drugbank:Drug

?keggDrug

rdf:Type

owl:sameAs

?drug kegg:Drug

rdf:Type

cardd*(P1⋈P2⋈P3) =

min(cardd (P1) , cardd (P2))

⋅voiDd→p3

voiDd→|sp3|

cardd ,d '~

(P1⋈P2) =

cardd (P1)⋅cardd '(P2)

⋅seld ,d '(P1⋈P2)

Page 17: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 17

Query Optimization

SPARQLQuery

SourceSelection

QueryOptimization

QueryExecution

⋈?drug

⋈B(? id )

⋈?keggDrug

⋈H(? keggDrug)

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Page 18: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 18

Evaluation Methodology

Compare with state-of-the-art federation systems

– Use Multiple linked datasets

– With representative characteristics

– Execute 'typical' SPARQL queries

– In a reproducible benchmark setup

FedBench

Page 19: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 19

Evaluation Results

Page 20: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 20

Conclusion

● Federation for Linked Open Data– Database + Semantic Web technology

– Efficient Distributed Query Processing

– Extension of voiD statistics

● Query generation for Federation Benchmarks● Efficient statistics management in P2P networks

Page 21: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 21

Thank You

Page 22: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 22

VoiD Descriptions/Statistics

}}

}

} General Information

Basic statisticstriples = 732744

Type statisticschebi:Compound = 50477

Predicate statisticsbio:formula = 39555

Page 23: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 23

VoiD statistics extension

Page 24: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 24

State of the Art

DARQ AliBaba FedX SPLENDID

Statistics ServiceDesc – – VoiD

Source Selection

Statistics(predicates)

All sources ASK queries Statistics + ASK queries

Query Optimization

DynProg Heuristics Heuristics DynProg

Query Execution

Bind join Bind join Bound Join + parallelization

Bind Join + Hash Join

Page 25: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 25

SPARQL limitations

● Query protocol● Only SPARQL endpoints● Endpoint limitations

– SPARQL version

– Result size

– Data rate

– Availability

Page 26: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 26

Join Implementation

R1 R2 R1 R2

⋈B ⋈H

Bind Join Hash Join

?id ?y

1 42

2 13

3 20

4 50

5 3

?id ?x

1 'A'

1 'G'

4 'A'

7 'A'

7 'C'

Page 27: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 27

Join Cost Model

R (q1) R (q2 ' ) R (q1) R (q2)

⋈B ⋈H

Bind Join Hash Join

cost⋈B(q1, q2) = |R (q1)|⋅cost tuple +

|R (q1)|⋅costquery +

|R (q2 ' )|⋅cost tuple

cost⋈H(q1,q2) = |R (q1)|⋅cost tuple +

|R (q2)|⋅cost tuple +

2⋅costquery

Page 28: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 28

SPARQL Semi Join

Page 29: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 29

SPLENDID Architecture

Page 30: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 30

FedBench Datasets

● Cross Domain

● Life Science

● Linked Data

Page 31: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 31

Data Source Selection: Requests

Page 32: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 32

Conclusion

Linked Open Data voiD

Web-scale Query Processing

SPLENDID