Download - balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information

Transcript
Page 1: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

DESWeb 2014ICDE 2014, Chicago IL, USA, March 3

balloon FusionSPARQL Rewriting Based on

Unified Co-Reference Information

Kai Schlegel ([email protected])Florian Stegmaier, Sebastian Bayerl, Michael Granitzer, Harald Kosch

Page 2: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

2

Motivation

SPARQL Rewriting & Federation

Intermediate Results

Outline

supported by the European Commission under the Seventh Framework Program

Page 3: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

3

Linked Data isthe heart of Semantic Web

“- W3C Semantic Web Group

Page 4: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

4

Huge Potential!

Page 5: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

5

Developing withLinked Open Data

Page 6: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

6

• Easy access to Linked Data• Query Linked Open Data with SPARQL

• Plethora of tools available

• Problems: • Business oriented

• Complex setup

• Maintenance

• „Paper-only“

• Not developer friendly

• Simple and „instant“ SPARQL Query Federation (-as-a-Service)

Motivation

Nothing-as-a-Service

Page 7: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

7

• How to get information about the German City „Passau“?

• Problem: LOD is not a single database!

Querying LOD

SPARQL

SPARQL

RDF

RDFRDF

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

de.dbpedia.org

Relations, Coordinates, Leader, etc.

What about the population?

SPARQL

Page 8: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

8

• Problem: Selection of appropriate endpoints

• Send query to some endpoints and aggregate the results?

Distributed Querying!

SPARQL

SPARQL

RDF

RDFRDF

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

de.dbpedia.org

SPARQL

linkedgeodata.org

WHAT ?

Page 9: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

9

• Problem: Different identifier for the same semantic concept

Misunderstanding: Co-Referencing

SPARQL

SPARQL

RDF

RDFRDF

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

de.dbpedia.org

SPARQL

linkedgeodata.org

WHAT ?

Known problem in linguistic:

It’s a spud! “What?“

I mean potato! “

Co-Referencing: Multiple expressions refer to the same thing.

Page 10: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

10

Problem = Solution?

SPARQL-based crawling of co-reference information

Exploit co-reference information for• accomplishing immediate SPARQL rewriting

• performing endpoint selection

• execute automatic query federation

Basic idea: Focusing distributed co-reference information

Main principle: Semantic entites over identifier!

Page 11: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

11

Components

balloon toolsuite

Page 12: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

12

balloon Overflight• SPARQL based crawling of LOD endpoints

• Query: Ask for subjects and objects which are related with special predicate

• Simplified global view on• Equivalence: owl:SameAs, skos:exactMatch,

coref:coreferenceData, ...

• Graph-Database Neo4j• Equivalence Cluster:

Multiple synonym URIs representing the same semantic entity including Provenance

Page 13: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

13

balloon Fusion

SPARQL Federation setup using co-reference information

SPARQL Transformation for each BGP1. Determine synonym URIs

2. Select suitable endpoints

3. Adapt sub-queries to endpoints

4. Federated querying

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

SPARQL

Page 14: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

141. Determine synonym URIs

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

SPARQL

Page 15: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

15

2. Select suitable endpoints

• Provenance based selection (PBS)• Endpoints which are involved in cluster composition

• Namespace based selection (NBS)• Prefix and Namespace matching of synonym URLs

Summarized: origin of co-reference information and origin of synonym URIs

Page 16: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

162. Select suitable endpoints (2)

Assumption: • Provenance information only contains „linkedgeodata.org“

as co-reference origin• Namespaces for freebase and dbpedia available (datahub.io)

PBS:Linked-Geo-Data

Endpoint

NBS:DBPedia endpoint

NBS:Freebaseendpoint

Page 17: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

17

3. Adapt sub-queries to endpoints

PBS:Linked-Geo-Data

Endpoint

NBS:DBPedia endpoint

NBS:Freebaseendpoint

SELECT ?p ?o WHERE {<http://rdf.freebase.com/

ns/m.01h5td> ?p ?o.}

SPARQL

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

SPARQL

SELECT ?p ?o WHERE { { <http://rdf.freebase.com/ns/m.01h5td> ?p ?o. } UNION { <http://linkedgeodata.org/triplify/node240057351> ?p ?o. } UNION { <http://de.dbpedia.org/resource/Passau> ?p ?o. }}

SPARQL

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

SPARQL

Page 18: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

18

• W3C SPARQL 1.1 Federated Query Extension (SERVICE)• (Partial) Query can be executed against a remote SPARQL

endpoint

• Distributed sub-queries don‘t contain SPARQL 1.1 features

4. Federated Querying

SPARQL

SELECT ?p ?o WHERE { SERVICE <http://dbpedia.org/sparql> { <http://de.dbpedia.org/resource/Passau> ?p ?o. } UNION { SERVICE <http://www.freebase.com/base/sparql> { <http://rdf.freebase.com/ns/m.01h5td> ?p ? } } UNION { SERVICE <http://linkedgeodata.org/sparql/> { { <http://rdf.freebase.com/ns/m.01h5td> ?p ?o. } UNION { <http://linkedgeodata.org/triplify/node240057351> ?p ?o. } UNION { <http://de.dbpedia.org/resource/Passau> ?p ?o. }}}}

Page 19: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

19

• Endpoint status check• Check routine in terms of availability and latency

• Minimize sub-queries• Group sub-queries with common endpoint

• Push join to endpoint

• SPARQL Features• Condense PBS UNION-construct of synonym URIs

• SPARQL 1.1 VALUES or FILTER with IN operator

• Not well implemented in Linked Data endpoints

Optimizations (ongoing)

Page 20: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

20

balloon Overflight Results

Page 21: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

21Results from a sounding balloon

Page 22: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

22balloon toolsuite

Page 23: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

23

Statistics• Datahub.io: Linked Open Data Cloud catalog• 337 datasets in total

• 237 expose a SPARQL endpoint

• 112 successfully queried for co-reference information

• Balloon Dataset (first run)

• 17.6M co-reference statements

• 22.4M distinct URLs

• 8.4M equivalence cluster (~ 2.68 identifier per cluster)

• Pending Analysis• Distribution of cluster sizes, Number of different Hosts per

cluster

• Main representative per cluster & False-Friends

Page 24: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

24

Open Source:

• Demo, information and sources available (MIT License)• X as a Service

• SPARQL Rewriting (HTTP API)

• Query Federation (SPARQL)

http://schlegel.github.io/balloon

Page 25: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

25

Summary:• SPARQL-based crawling of distributed co-reference

information

• Exploit co-reference information for SPARQL federation

Single Point of Access

Page 26: balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information

26

Any questions?

“Research is formalized curiosity. It is poking and prying with a purpose. - Zora Neale Hurston