Web of Data Usage Mining

51

Transcript of Web of Data Usage Mining

Web of Data Usage Mining

Markus Luczak-Roesch @mluczak | http://markus-luczak.de

What you should learn:

•  describe the architectural differences between content negotiation and Linked Data queries;

•  develop applications that use different strategies to consume Linked Data;

•  develop usage mining methods that exploit the atomic parts of the SPARQL query language.

Linked Data principles 1. Use URIs as names for

“Things” (resources).

2.  Use HTTP URIs to allow the access to resources on the Web.

3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).

4.  Set RDF links to resources published by other parties to allow the discovery of more resources.

http://dbpedia.org/resource/Berlin �

��

�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S

owl:sameAs P dbpedia:Berlin O

h"p://www.w3.org/DesignIssues/LinkedData.html

Content Negotiation

Linked Data principles 1. Use URIs as names for

“Things” (resources).

2.  Use HTTP URIs to allow the access to resources on the Web.

3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).

4.  Set RDF links to resources published by other parties to allow the discovery of more resources.

h"p://www.w3.org/DesignIssues/LinkedData.html

http://dbpedia.org/resource/Berlin �

��

�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S

owl:sameAs P dbpedia:Berlin O

Content Negotiation

Linked Data principles 1. Use URIs as names for

“Things” (resources).

2.  Use HTTP URIs to allow the access to resources on the Web.

3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).

4.  Set RDF links to resources published by other parties to allow the discovery of more resources.

h"p://www.w3.org/DesignIssues/LinkedData.html

http://dbpedia.org/resource/Berlin �

��

�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S

owl:sameAs P dbpedia:Berlin O

Content Negotiation

Linked Data principles 1. Use URIs as names for

“Things” (resources).

2.  Use HTTP URIs to allow the access to resources on the Web.

3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).

4.  Set RDF links to resources published by other parties to allow the discovery of more resources.

h"p://www.w3.org/DesignIssues/LinkedData.html

http://dbpedia.org/resource/Berlin �

��

�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S

owl:sameAs P dbpedia:Berlin O

Content Negotiation

Linked Data principles 1. Use URIs as names for

“Things” (resources).

2.  Use HTTP URIs to allow the access to resources on the Web.

3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).

4.  Set RDF links to resources published by other parties to allow the discovery of more resources.

h"p://www.w3.org/DesignIssues/LinkedData.html

http://dbpedia.org/resource/Berlin �

��

�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S

owl:sameAs P dbpedia:Berlin O

Content Negotiation

Linked Data exploits RDF

h"p://markus-luczak.de#me

“MarkusLuczak-Roesch“

foaf:name

h"p://markus-luczak.de#me

h"p://hannes.muehleisen.org#me

foaf:knows

Linked Data vocabularies

•  Vocabulary reuse: –  Geo –  FOAF –  GoodRelations –  SIOC –  DOAP –  …

•  Vocabulary development: –  Thing

•  Person –  OfficeHolder –  …

•  …

http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/OfficeHolder http://xmlns.com/foaf/0.1/knows

Linked Data vocabularies

•  Mixing: – Geo –  FOAF – Dublin Core – DBpedia Ontology –  ...

http://xmlns.com/foaf/0.1/Person

http://www.w3.org/2003/01/geo/wgs84_pos#lat

http://dbpedia.org/ontology/leader http://dbpedia.org/ontology/City

Linked Data is self-descriptive

Instancelevel Schemalevel

int:resA

ont:ClassAowl:sameAs

„ABC“

foaf:name

ext:resA

int:resB

rdf:type

owl:equivalentClass

rdf:type

foaf:name

rdf:type

rdf:type

rdf:type

rdfs:subClassOf

foaf:Agentrdf:type

foaf:Person

rdfs:subClassOf

owl:sameAs

owl:equivalentClass

h"p://markus-luczak.de#me

“MarkusLuczak-Roesch“

rdf:type

u_id firstname surname

45 Markus Luczak-Roesch

… … …

foaf:name

foaf:Person

“3.375.222“

dbpedia:Berlin

c_id city country inhabitants

67 Berlin Germany 3.375.222

… … …

dbp:populaVon

h"p://markus-luczak.de#me

“MarkusLuczak-Roesch“

rdf:type

foaf:name

dbp:birthPlace

foaf:Person

“3.375.222“dbp:populaVon

dbpedia:Berlin

h"p://markus-luczak.de#me

foaf:basedNear

dbp:birthPlace

h"p://markus-luczak.de/res/Soton

dbpedia:CiVes_in_Europe

skos:subject

dbpedia:Berlin

skos:subject

dbpedia:Southampton

h"p://markus-luczak.de#me

foaf:basedNear

dbp:birthPlace

h"p://markus-luczak.de/res/Soton

dbpedia:CiVes_in_Europe

skos:subject

dbpedia:Berlin

skos:subject

dbpedia:Southampton

rdfs:seeAlso

h"p://markus-luczak.de#me

foaf:basedNear

h"p://markus-luczak.de/res/Soton

rdfs:seeAlso

rdf:type

foaf:Person

owl:equivalentClass

dbp:Person

rdf:type

dbpedia:Southampton

dbp:birthPlace

dbpedia:Benny_Hill

Linked Data Infrastructure

Imagesource:Tom

Heathand

ChrisV

anBize

r(2011)LinkedDa

ta:EvolvingtheWeb

intoa

Glob

alDataSpace(1stediVo

n).SynthesisLecturesontheSemanVcW

eb:The

oryand

Techno

logy,1:1,1-136.M

organ&Claypoo

l.

Consuming Linked Data

•  stateless •  request-response

t

Client Server

request

response

TCPlifecycle

derivedfromR.Tolksdorf

Open connection

Close connection

Consuming Linked Data GET / HTTP/1.1 User-Agent: Mozilla/5.0 … Firefox/10.0.3 Host: markus-luczak.de:80 Accept: */*

HTTP/1.1 200 OK Server: Apache/2.0.49 Content-Language: en Content-Type: text/html Content-length: 2990 <!DOCTYPE html> <html xml:lang="en" …

Clie

nt Server

derivedfromR.Tolksdorf

Server

Consuming Linked Data

Representation 1 index.html

Representation 2 index.rdf

Information Resource

http://example.com/content/index

Client

HTTP GET

Consuming Linked Data

•  Discover URIs –  Lookup services

•  http://rkbexplorer.com

– Web of Data search engines •  http://sindice.com •  http://ws.nju.edu.cn/falcons/objectsearch/index.jsp

Consuming Linked Data

•  Discover additional data for the resource at hand •  follow links („follow your nose“)

–  rdfs:seeAlso –  owl:sameAs

•  Co-Reference services –  http://sameas.org

•  Web of Data search engines

Linked Data

Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/

The server can trace this usage.

Linked Data is queryable

?s

“MarkusLuczak-Roesch“

foaf:name

h"p://markus-luczak.de#me

?o

foaf:knows

SPARQL-recap

•  Basic principle: pattern matching – describe pattern – query RDF triple set („RDF graph“) – matching subset comes into results

?s

http://dbpedia.org/resource/Berlin

SPARQL-recap

?s

dbp:Klaus_Wowereit

dbp:Reinhard_Mey

dbp:Klaus_Wowereit

dbp:Berlin

dbp:birthPlace

dbp:Reinhard_Mey

Berlino

dbp:Axel_Springer

SPARQL queries on the Web

•  RESTful service endpoint GET /sparql?query=PREFIX+rdf… HTTP/1.1 Host: dbpedia.org

h"p://www.w3.org/TR/rdf-sparql-XMLres/ h"p://www.w3.org/TR/rdf-sparql-json-res/

Querying Linked Data

dbp:Klaus_Wowereit

dbp:Berlin

dbp:birthPlace

dbp:Reinhard_Mey

http://www.markus-luczak.de/me

dbp:birthPlace

Querying Linked Data •  distribution of data creates challenges for querying them

•  Query approaches –  follow-up queries ß application-dependent, proprietary –  query a central data repository (e.g. LOD cache) ß trivial –  federated queries ß more interesting

•  idea: query a mediator that distributes the sub-queries and returns aggregated result (as of SPARQL 1.1)

–  link traversal ß very interesting •  idea: follow links in the results retrieved from a source to expand the data

dynamically

Dataset

UserClient/ApplicaVon

QueryPa"ernAccess

ResourceCenteredAccessHTTP

QueryProcessing

GraphCreaVonandContentNegoVaVon

GET/resou

rce/resA

GET/sparql?qu

ery=SELECT

applicaV

on/rdf+xml,…

Evaluateand

pe

rformque

ry,

createre

sultset

Processa

nd

selectre

sult

text/xml,…

DataPublisherDataConsum

erDa

taPub

lishe

rDa

taCon

sumer

h"p://www.flickr.com/photos/therichbrooks/4040197666/,CC-BY2.0,h"ps://creaVvecommons.org/licenses/by/2.0/

A game of pairs with SPARQL

SPARQL queries are self-descriptive data themselves

{?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person

}

TPTP BGP

SPARQL queries are self-descriptive data themselves

{?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person

}

h"p://markus-luczak.de#me

“MarkusLuczak-Roesch“

rdf:type

foaf:name

foaf:Person

✔✗

SPARQL queries are self-descriptive data themselves

{dbpedia:Benny_Hilldbp:birthPlace?o1.?sdbp:basedNear?o1.?sfoaf:name?o2

}

✔✗

✗✗

SPARQL queries are self-descriptive data themselves

{dbpedia:Benny_Hilldbp:birthPlace?o1

}✔

SPARQL queries are self-descriptive data themselves

{?sdbp:basedNear?o1

}✔

SPARQL queries are self-descriptive data themselves

{?sfoaf:name?o2

}✔

allTP

allTPinsu

ccessfulBGP

allTPinsuccessfulqueries

allTPinfailingqueries

allTPinfailingBGP

Statistical analysis

missingfacts

inconsistentdata

•  ns:Bandns:knownFor?x•  ns:Bandns:naVonality?y

•  ns:Bandns:instrument?x•  ns:Bandns:genre?y•  ns:Bandns:associatedBand?z

Statistical analysis

(a) SWC (b) DBpedia (c) LGD

Abbildung 20: Nutzung der Konzepte der Multi-Ontologien (Kanten sind ausgeblendet),Quelle: eigene Darstellung

dieser Datensets besitzt noch ein großes Verbesserungspotential. Beispielsweise sind dieMoglichkeiten gegeben, eine hohere Anzahl an speziellen Konzepten zu nutzen. Eben-so konnen theoretisch mehr Konzepte aus anderen Bereichen als Personen, Orte undDokumente verwendet werden.

Experiment 3 In diesem Experiment wurde evaluiert, welche Ergebnisse mit dem ent-wickelten ubKCE Algorithmus in Abhangigkeit der Gewichtung der Aspekte Dichte undNutzung erhalten werden. Damit soll es moglich sein, die Frage zu beantworten, ob sichSchlusselkonzepte uberhaupt auf Basis von Nutzungsdaten ermitteln lassen. Dies kanneindeutig mit Ja beantwortet werden. Schon anhand der Ergebnisse von Experiment 2ließ sich erkennen, dass viele der KCE - Schlusselkonzepte ebenfalls am starksten vonNutzern des Datensets verwendet werden. Je nach gewahlter Gewichtung der Aspektevariiert jedoch die Ubereinstimmung der Ergebnisse von ubKCE mit den von KCE er-mittelten Schlusselkonzepten. So werden beispielsweise bei einer Gewichtung von 100%Nutzung zu 0% Dichte im Allgemeinen andere Schlusselkonzepte als bei einem 30:70Verhaltnis ermittelt. Bezuglich der Ubereinstimmung zu KCE kann gesagt werden, dassdiese im Allgemeinen steigt, je hoher die Gewichtung der Dichte in ubKCE ist. Dies istnicht verwunderlich, da der Dichte-Aspekt aus KCE ubernommen wurde.[27] gab zur Evaluation des KCE - Algorithmus an, dass es ab einer 70% Ubereinstimmungder KCE - Ergebnisse zu den von Experten ermittelten Schlusselkonzepten nicht mehrmoglich ist zu erkennen, ob ein bestimmtes Ergebnis von einem Menschen oder vomAlgorithmus bestimmt wurde. Um eine Gewichtung zu finden, die moglichst gut mitdem KCE - Ergebnis ubereinstimmt, jedoch gleichzeitig so stark wie moglich auf demNutzungsaspekt beruht, werden auch im Rahmen der vorliegenden Arbeit diese 70%Ubereinstimmung angestrebt. Nach Aussage von [27] bedeutet diese Ubereinstimmung,dass man nicht unterscheiden konnte, ob ein Ergebnis von einem Experten, von KCEoder von ubKCE stammt.Durch die unterschiedlichen Gewichtungen der Aspekte kann die Zusammenfassung der

91

Source:MasterthesisofMarkusBischoff

Estimating the effects of change

Usage-dependent maintenance of structured Web data sets

to be added to the DBpedia 3.4 data set conforming to our approach16.

Table 7.14: Recommended predicates to be added to the data set and the estimatede↵ects of change.

Primitive to add E↵ects of change Exists in data set

dbp:manufacturer 0.004505372 x

dbp:firstFlight 0.004505372 x

dbp:introduced 0.004505372 x

dbp:nationalOrigin 0.004505372

dbo:thumbnail 0.021986718 x

dbo:director 0.025047524

dbp:director 0.02503915 x

dbp:abstract 0.025797024 x

dbo:starring 0.034066643

dbp:starring 0.034066643 x

dbp:stars 0.034066643 x

skos:Concept 0.040946128 x

skos:broader 0.04116386 x

dbp:redirect 0.066441677 x

Since this change recommendation is only additive it is clear that no negativee↵ects are estimated. However, it is possible to estimate the positive potential of achange and consequently to prioritize the changes to be performed in case of conflict-ing or contradicting recommendations.

More complex and also subtractive change recommendations may emerge fromadditive ones. This is typified by the recommendation to add dbo:director anddbp:director for example to the data set which appear to be contradicting. Hence,they should be either matched to each other by an owl:equivalentProperty relationor one of the two should be eliminated.

7.3.3 Further data set analysis

Our case study has shown how the usage-dependent data set maintenance approachperforms in the context of a cross-domain data set like DBpedia. We will now presentresults from our studies with SWDF and LGD as two di↵erent domain-specific datasets.

16To save space we apply the following namespace prefixes in addition to the ones defined before:dbo:http://dbpedia.org/ontology/, dbp:http://dbpedia.org/property/.

178

Logfiles

Selectedlogfiles

Preprocessedqueries

Decomposedqueriesand

transac<ontables

Pa=erns

Changerecommenda<ons

[0,1]

What’s in your SPARQL shopping bag? {

?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person

}

{dbpedia:Benny_Hilldbp:birthPlace?o1.?sdbp:basedNear?o1.?sfoaf:name?o2

}

{?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person

}

{dbpedia:Benny_Hilldbp:birthPlace?o1.?sdbp:basedNear?o1.?sfoaf:name?o2

}

{?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person

}

{dbpedia:Benny_Hilldbp:birthPlace?o1.?sdbp:basedNear?o1.?sfoaf:name?o2

}

T1

T2T1

…30m

ins.,sam

eIP,sam

euseragent

Usa

ge

-d

ep

en

de

nt

ma

in

te

na

nc

eo

fstru

ctu

re

dW

eb

da

ta

se

ts

Figure 7.20: Visualization of association rules computed by application of the unknown predicates restriction in thecontext of the LGD log file (size: support, color: lift).

184

LGD

Linked Data

Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/

The server can trace this usage.

SPARQL

7. Evaluation

The visualization shows how primitives on the left hand side (LHS) of a rule implyparticular ones on the right hand side (RHS) and which likelihood such an associa-tion has. In our specific case this allows us to analyze which primitives are queriedtogether frequently in failing queries. We spot two characteristic usage patterns: (1)the properties and classes queried in the context of http://dbpedia.org/ontology/Aircraft; (2) the properties and classes queried in the context of an object variable.These can be further analyzed by exporting the association rules to GraphML and vi-sualizing the network by use of a network visualization and analysis tool like Gephi15

for example. Figure 7.13 depicts one filtered network representation for our examplecase. Nodes with a degree lower than 5 are filtered out (k-core network with k = 5)to derive a well-arranged visualization of the most important primitives in failingqueries. Nodes represent LHS and RHS of the computed rules. Edges point from theLHS to the RHS of the particular rules.

^"SUHGBYDULDEOH�KWWS���GESHGLD�RUJ�SURSHUW\�QDPH`

^"SUHGBYDULDEOH�KWWS���[POQV�FRP�IRDI�����GHSLFWLRQ`

^KWWS���GESHGLD�RUJ�RQWRORJ\�$LUFUDIW`

^KWWS���GESHGLD�RUJ�SURSHUW\�DEVWUDFW�KWWS���GESHGLD�RUJ�SURSHUW\�QDPH`

^KWWS���GESHGLD�RUJ�SURSHUW\�DEVWUDFW�KWWS���[POQV�FRP�IRDI�����GHSLFWLRQ`

^KWWS���GESHGLD�RUJ�SURSHUW\�ILUVW)OLJKW`

^KWWS���GESHGLD�RUJ�SURSHUW\�LQWURGXFHG`

^KWWS���GESHGLD�RUJ�SURSHUW\�PDQXIDFWXUHU`

^KWWS���GESHGLD�RUJ�SURSHUW\�QDPH�KWWS���GESHGLD�RUJ�SURSHUW\�W\SH`

^KWWS���GESHGLD�RUJ�SURSHUW\�QDPH�KWWS���[POQV�FRP�IRDI�����GHSLFWLRQ`

^KWWS���GESHGLD�RUJ�SURSHUW\�QDWLRQDO2ULJLQ`

^KWWS���GESHGLD�RUJ�SURSHUW\�W\SH�KWWS���[POQV�FRP�IRDI�����GHSLFWLRQ`

Figure 7.13: Filtered visualization of the association rule network (k-core 5 filterapplied to reduce nodes with degree lower than 5).

Table 7.14 lists the an exemplary set of primitives which would be recommended

15http://gephi.org/

177

{ ?s1 foaf:name “Markus Luczak-Roesch”. ?s1 rdf:type dbp:Person

} h"p://markus-luczak.de#me

“MarkusLuczak-Roesch“

rdf:type

foaf:name

foaf:Person

✔ ✗

✗ query applied to dataset

The server can trace detailed usage.

Linked Data Fragments Querying Datasets on the Web with High Availability 5

generic requests

high client effort

high server availability

specific requests

high server effort

low server availability

data

dump

Linked Data

document

sparqlresult

triple pattern

fragments

various types of

Linked Data Fragments

Fig. 1: All http triple interfaces offer Linked Data Fragments of a dataset. They differin the specificity of the data they contain, and thus the effort needed to create them.

3.2 Formal definitions

As a basis for our formalization, we use the following concepts of the rdf datamodel [16] and the sparql query language [12]. We write U , B, L, and V todenote the sets of all uris, blank nodes, literals, and variables, respectively.Then, T = (U [ B)⇥ U ⇥ (U [ B [ L) is the (infinite) set of all rdf triples. Anytuple tp 2 (U [ V)⇥ (U [ V)⇥ (U [ L [ V) is a triple pattern. Any finite set ofsuch triple patterns is a basic graph pattern (bgp). Any more complex sparql

graph pattern, typically denoted by P , combines triple patterns (or bgps) usingspecific operators [12,20]. The standard (set-based) query semantics for sparql

defines the query result of such a graph pattern P over a set of rdf triplesG ✓ T as a set that we denote by [[P ]]G and that consists of partial mappingsµ : V ! (U [ B [ L), which are called solution mappings. An rdf triple t isa matching triple for a triple pattern tp if there exists a solution mapping µ

such that t = µ[tp], where µ[tp] denotes the triple (pattern) that we obtain byreplacing the variables in tp according to µ.

For the sake of a more straightforward formalization, in this paper, we as-sume without loss of generality that every dataset G published via some kind offragments on the Web is a finite set of blank-node-free rdf triples; i.e., G ✓ T ⇤

where T ⇤= U ⇥ U ⇥ (U [ L). Each fragment of such a dataset contains triples

that somehow belong together; they have been selected based on some condition,which we abstract through the notion of a selector:

Definition 1 (selector). A selector is a partial function s : 2

T ! {true, false}.A more concrete type of this abstract notion are triple pattern selectors, whichselect triples that match a certain triple pattern:

Definition 2 (triple pattern selector). Given a triple pattern tp, the triplepattern selector for tp is the selector stp that, for any singleton set {t}✓2

T, is

defined by

stp({t}) =(true if t is a matching triple for tp,

false else.

When publishing data on the Web, we should equip its representations withhypermedia controls [1, 8, 9]. We encounter them on a daily basis when browsinghtml pages; they are usually present as hyperlinks or forms. What all thesecontrols have in common is that, given some (possibly empty) input, they resultin our browser performing a request for a specific url.

Definition 3 (control). A control is a function that maps from some set to U .

Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com.

xxx.xxx.xxx.xxx - - [17/Oct/2014:07:43:02 +0000] "GET /2014/en?subject=&predicate=&object=dbpedia%3AAustin HTTP/1.1" 200 1309 "http://fragments.dbpedia.org/2014/en" …

10 Ruben Verborgh et al.

Data: (predecessor Ip, bgp B = {tp1, . . . , tpn} with n � 2, start page �0)1 I nil; c the triple pattern control in the control set C0 of �0;2 Function BasicGraphPatternIterator.next()3 µ nil;4 while µ = nil do

5 while I = nil do

6 µp Ip.next();7 return nil if µp = nil;8 � {�i

1 | �i1 = http GET first fragment page using url c(µp[tpi])};

9 ✏ i such that cnt�i1 = min({cnt�1

1 , . . . , cnt�n1 });

10 I✏ TriplePatternIterator(StartIterator(), µp[tp✏],�✏1);

11 I BasicGraphPatternIterator(I✏, {µ[tp] | tp 2 B \ {tp✏}},�✏1);

12 µ I.next();13 return µ [ µp;

Algorithm 1: For all mappings µp of a predecessor Ip, a bgp iterator fora pattern B = {tp1, . . . , tpn} creates a triple pattern iterator I✏ for the leastfrequent pattern tp✏, passed to a bgp iterator for the remainder of P .

fetches the first page of the corresponding ldf. This page contains the cnt meta-data, which tells us how many matches the dataset has for each triple pattern.The pattern is then decomposed by evaluating it using a) a triple pattern iter-ator for the triple pattern with the smallest number of matches, and b) a newbgp iterator for the remainder of the pattern. This results in a dynamic pipelinefor each of the mappings of its predecessor, as visualized in Fig. 2. Each pipelineis optimized locally for a specific mapping, reducing the number of requests.

To evaluate a sparql query over a triple pattern fragment collection, we pro-ceed as follows. For each bgp of the query, a bgp iterator is created. Dedicatediterators are necessary for other sparql constructs such as UNION and OPTIONAL,but their implementation need not be ldf-specific; they can reuse the triplepattern fragment bgp iterators. The predecessor of the first iterator is a startiterator. We continuously pull solution mappings from the last iterator in thepipeline and output them as solutions of the query, until the last iterator re-sponds with nil. This pull-based process is able to deliver results incrementally.

...

B00= { Drago_Ibler a Architect. }

Alen_PeternacDrago_IblerJuraj_Neidhardt...

?person birthPlace Zagreb.

B0= { ?person a Architect. ?person birthPlace Zagreb. }

ZagrebBudapestRome...

?city subjectCapitals_in_Europe.

B= { ?person a Architect. ?person birthPlace ?city. ?city subject Capitals_in_Europe. }

Fig. 2: A bgp iterator decomposes a bgp B = {tp1, . . . , tpn} into a triple patterniterator for an optimal tpi and, for each resulting solution mapping µ of tpi, createsa bgp iterator for the remaining pattern B0 = {tp | tp = µ[tpj ] ^ tpj 2 B} \ {µ[tpi]}.

Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com.

Querying Datasets on the Web with High Availability 9

4.2 Dynamic iterator pipelines

A common approach to implement query execution in database systems is throughiterators that are typically arranged in a tree or a pipeline, based on which queryresults are computed recursively [10]. Such a pipelined approach has also beenstudied for Linked Data query processing [13, 15]. In order to enable incremental

results and allow the straightforward addition of sparql operators, we imple-ment a triple pattern fragments client using iterators.

The previous algorithm, however, cannot be implemented by a static iteratorpipeline. For instance, consider a query for architects born in European capitals:

SELECT ?person ?city WHERE {?person a dbpedia-owl:Architect. # tp1

?person dbpprop:birthPlace ?city. # tp2

?city dc:subject dbpedia:Capitals_in_Europe. # tp3

} LIMIT 100

Suppose the pipeline begins by finding ?city mappings for tp3. It then needsto choose whether it will next consider tp1 or tp2. The optimal choice, however,differs depending on the value of ?city:– For dbpedia:Paris, there are ±1,900 matches for tp2, and ±1,200 matches

for tp1, so there will be less http requests if we continue with tp1.– For dbpedia:Vilnius, there are 164 matches for tp2, and ±1,200 matches for

tp1, so there will be less http requests if we continue with tp2.With a static pipeline, we would have to choose the pipeline structure in advanceand subsequently reuse it.

In order to generate an optimized pipeline for each (sub-)query, we proposea divide-and-conquer strategy in which a query is decomposed dynamically intosubqueries depending on partial solution mappings. The main function of aniterator is next(), which either returns a mapping or nil if no mappings are left.

We first introduce a trivial start iterator, which outputs the empty map-ping µ0 on the first call to next(), and nil on all subsequent calls.

Next, we implement a previously defined triple pattern iterator [15] for triplepattern fragments. This iterator Itp is initialized with a predecessor iterator Ip,a triple pattern tp, and a page �0 of an arbitrary triple pattern fragment of a col-lection F . The iterator then extends mappings from its predecessor by readingtriples from the ldf corresponding to triple pattern tp. The url of this ldf is re-trieved through the collection control in the start page �0. Each call to Itp.next()results in mappings for tp in F , depending on the predecessor’s mappings.

To solve bgps of sparql queries, we introduce a triple pattern fragmentbgp iterator. Such a bgp iterator is initialized with a predecessor Ip, a bgp B =

{tp1, . . . , tpn}, and an arbitrary triple pattern fragment page �0 of a collection F .For an empty pattern (n = 0), a bgp iterator is equal to a start iterator. Fora pattern length n = 1, it is constructed by creating a triple pattern iteratorfor (Ip, tp1,�0). For n � 2, a bgp iterator uses Algorithm 1.

bgp iterators evaluate a bgp by recursively decomposing it into smaller itera-tors. For each triple pattern in the bgp mapped by each result of Ip, the iterator

Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com.

Wikidata

•  API access to •  items •  edit history •  items’ discussions •  items’ access statistics •  and more

•  Linked Data interface •  MediaWiki API •  Wikidata Query •  SPARQL •  Linked Data Fragments

Access to more than “just” usage.

Thank you very much! @mluczak | http://markus-luczak.de

h"p://www.flickr.com/photos/therichbrooks/4040197666/,CC-BY2.0,h"ps://creaVvecommons.org/licenses/by/2.0/

References •  Luczak-Rösch, M., & Bischoff, M. (2011). Statistical analysis of web of data usage. In Joint Workshop on Knowledge Evolution and

Ontology Dynamics (EvoDyn2011), CEUR WS. •  Luczak-Rösch, M. (2014). Usage-dependent maintenance of structured Web data sets (Doctoral dissertation, Freie Universität Berlin,

Germany), http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000096138. •  Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Collective

Query Patterns. COLD, 782. •  Elbedweihy, K., Wrigley, S. N., & Ciravegna, F. (2012). Improving Semantic Search Using Query Log Analysis. Interacting with Linked Data

(ILD 2012), 61. •  Raghuveer, A. (2012). Characterizing machine agent behavior through SPARQL query mining. In Proceedings of the International

Workshop on Usage Analysis and the Web of Data, Lyon, France. •  Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. arXiv

preprint arXiv:1103.5043. •  Hartig, O., Bizer, C., & Freytag, J. C. (2009). Executing SPARQL queries over the web of linked data (pp. 293-309). Springer Berlin

Heidelberg. •  Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying

datasets on the web with high availability. In The Semantic Web–ISWC 2014 (pp. 180-196). Springer International Publishing. •  Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., & Van de Walle, R. (2014, April). Web-Scale Querying through

Linked Data Fragments. In LDOW.