FedX - Optimization Techniques for Federated Query Processing on Linked Data

of 23 /23
Optimization Techniques for Federated Query Processing on Linked Data FedX: Andreas Schwarte 1 , Peter Haase 1 , Katja Hose 2 , Ralf Schenkel 2 , and Michael Schmidt 1 1 fluid Operations AG, Walldorf 2 Max-Planck Institute for Informatics, Saarbrücken ISWC 2011 – October 26 th

Embed Size (px)

description

The final slides of our talk about FedX at the 10th International Semantic Web Conference in Bonn. For details about FedX see http://www.fluidops.com/fedx/

Transcript of FedX - Optimization Techniques for Federated Query Processing on Linked Data

  • 1. FedX:Optimization Techniques for Federated Query Processing on Linked DataISWC 2011 October 26th Andreas Schwarte1, Peter Haase1, Katja Hose2, Ralf Schenkel2, and Michael Schmidt1 1fluidOperations AG, Walldorf2Max-Planck Institute for Informatics, Saarbrcken

2. Outline Motivation Federated Query Processing FedX Query Processing Model Optimization Techniques Evaluation Conclusion & Outlook2 3. Motivation Query processing involving multiple distributed datasources, e.g. Linked Open Data cloudNew York TimesDBpedia Query both data collections in an integrated way3 4. Federated Query ProcessingFederation mediator at the server Virtual integration of (remote) data sources Communication via SPARQL protocol SPARQLData Source SPARQLFederationQuery Data Source Mediator SPARQLData Source4 5. Federated Query Processing Example Query Find US presidents and associated news articles SELECT ?President ?Party ?TopicPage WHERE {?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates .?nytPresident owl:sameAs ?President .?President dbpedia:party ?Party .?nytPresident nytimes:topicPage ?TopicPage . } 5 6. Federated Query ProcessingQuery:SELECT ?President ?Party ?TopicPage WHERE {?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates .?nytPresident owl:sameAs ?President . SPARQL...}DBpediaFederationBarack Obama Mediator George W. Bush...?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates . SPARQL Barack Obama George W. Bush NYTimes ...6 7. Federated Query ProcessingQuery: SELECT ?President ?Party ?TopicPage WHERE { ?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates . ?nytPresident owl:sameAs ?President .SPARQL ... } DBpedia Federationyago:ObamaMediator?nytPresident owl:sameAs Barack Obama .SPARQLBarack ObamaInput:George W. Bush NYTimes...Barack Obama, yago:Obama nyt:ObamaOutput:Barack Obama, nyt:Obama 7 8. Federated Query ProcessingQuery: SELECT ?President ?Party ?TopicPage WHERE { ?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates . SPARQL ... } DBpedia FederationMediator ?nytPresident owl:sameAs George W. Bush . SPARQL Barack ObamaInput: George W. BushNYTimes ... Barack Obama, yago:Obama nyt:BushOutput: Barack Obama, nyt:Obama George W. Bush, nyt:Bush ... and so on for the other intermediatemappings and triple patterns ... 8 9. FedX Query Processing ModelScenario: Efficient SPARQL query processing on multiple distributed sources Data sources are known and accessible as SPARQL endpoints FedX is designed to be fully compatible with SPARQL 1.0 No a-priori knowledge about data sources No local preprocessing of the data sources required No need for pre-computed statistics On-demand federation setup9 10. Challenges to Federated Query Processing1) Involve only relevant sources in the evaluation Avoid: Subqueries are sent to all sources, although potentially irrelevant2) Compute joins close to the data Avoid: All joins are executed locally in a nested loop fashion3) Reduce remote communication Avoid: Nested loop join that causes many remote requests10 11. Optimization Techniques1. Source Selection:Idea:Triple patterns are annotated with relevant sources Sources that can contribute information for a particular triple pattern SPARQL ASK requests in conjunction with a local cache After a warm-up period the cache learns the capabilities of the data sources During Source Selection remote requests can be avoided2. Exclusive Groups:Idea:Group triple patterns with the same single relevant source Evaluation in a single (remote) subquery Push join to the endpoint11 12. Optimization Techniques (2)Example: Source Selection + Exclusive GroupsSELECT ?President ?Party ?TopicPage WHERE {Source Selection ?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates . @ DBpedia ?President dbpedia:party ?Party [email protected] DBpedia Exclusive Group ?nytPresident owl:sameAs ?President [email protected] DBpedia, NYTimes ?nytPresident nytimes:topicPage ?TopicPage . @ NYTimes} Advantages: Avoid sending subqueries to sources that are not relevant Delegate joins to the endpoint by forming exclusive groups (i.e. executing the respective patterns in a single subquery) 12 13. Optimization Techniques (3)3. Join Order:Idea:Iteratively determine the join order based on count-heuristic: Count free variables of triple patterns and groups Consider "resolved" variable mappings from earlier iteration4. Bound Joins:Idea:Compute joins in a block nested loop fashion: Reduce the number of requests by "vectored" evaluation of a set of input bindings Renaming and Post-Processing technique for SPARQL 1.013 14. Optimization Techniques (4)Example: Bound JoinsSELECT ?President ?Party ?TopicPage WHERE { ?President rdf:type dbpedia:PresidentsOfTheUnitedStates . ?President dbpedia:party ?Party . ?nytPresident owl:sameAs ?President . ?nytPresident nytimes:topicPage ?TopicPage .}Assume that the following intermediate results have been computed as input for the last triple pattern Block Input Before (NLJ) Barack ObamaSELECT ?TopicPage WHERE { Barack Obama nytimes:topicPage ?TopicPage } George W. BushSELECT ?TopicPage WHERE { George W. Bush nytimes:topicPage ?TopicPage } Now: Evaluation in a single remote request using a SPARQL UNION construct + local post processing (SPARQL 1.0) 14 15. Optimization Example1 SPARQL Query 2 Unoptimized Internal RepresentationCompute Micronutrients using Drugbank and KEGG 4x Local JoinSELECT ?drug ?title WHERE { = ?drug drugbank:drugCategory drugbank-cat:micronutrient . ?drug drugbank:casRegistryNumber ?id . 4x NLJ ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug dc:title ?title .}3Optimized Internal RepresentationExlusive Group Remote Join 15 16. EvaluationBased on FedBench benchmark suite 14 queries from the Cross Domain (CD) and Life Science (LS) collections Real-World Data from the Linked Open Data cloud Federation with 5 (CD) and 4 (LS) data sources Queries vary in complexity, size, structure, and sources involvedBenchmark environment HP Proliant 2GHz 4Core, 32GB RAM 20GB RAM for server (federation mediator) Local copies of the SPARQL endpoint to ensure reproducibility andreliability of the service Provided by the FedBench Framework16 17. Evaluation (2)a) Evaluation times of Cross Domain (CD) and Life Science (LS) queries 17 18. Evaluation (3)b) Number of requests sent to the endpointsRuntimesAliBaba: >600s DARQ: >600s FedX: 0.109sRuntimesAliBaba: >600s DARQ: 133s FedX: 1.4s 18 19. FedX The Bigger Picture Information Workbench:Integration of Virtualized Data Sources as a ServiceApplication LayerSemantic WikiCollaboration Reporting & AnalyticsVisual ExplorationVirtualization Layer Transparent & On-Demand Integration of Data SourcesData LayerSPARQLSPARQLSPARQLEndpointEndpointEndpoint Data Registries Metadata CKAN, data.gov, etc.Registry Data Source Data Source+ Enterprise Data Data Source19 20. Conclusion and OutlookFedX: Efficient SPARQL query processing in federated setting Available as open source software: http://www.fluidops.com/fedx Implementation as a Sesame SAIL Novel join processing strategies, grouping techniques, source selection Significant improvement of query performance compared to state-of-the-art systems On-demand federation setup without any preprocessing of data sources Compatible with existing SPARQL 1.0 endpointsOutlook SPARQL 1.1 Federation Extension in Sesame 2.6 Ongoing integration of FedX optimization techniques into Sesame Federation Extensions in FedX SERVICE keyword for optional user input to improve source selection BINDINGS clauses for bound joins (for SPARQL 1.1 endpoints) (Remote) statistics to improve join order (e.g. VoID) 20 21. Visit our exhibitor boothor arrange a private live demo!Further information on FedXhttp://www.fluidops.com/fedx THANK YOUCONTACT:fluid OperationsAltrottstr. 31Walldorf, GermanyEmail: [email protected]: www.fluidOps.comTel.: +49 6227 3846-527 22. Query Cross Domain 3 (CD3)Find US presidents, their party and associated newsSELECT ?president ?party ?page WHERE {?president .?president .?president ?party .?x ?page .?x ?president .} 23. Query Life Science 3 (LS3) Find drugs and interaction drugs (+ the effect)SELECT ?Drug ?IntDrug ?IntEffect WHERE {?Drug .?y ?Drug .?Int ?y .?Int ?IntDrug .?Int ?IntEffect .}