Scalable Integration and Processing of Linked Data

80
Tutorial at WWW 2011 Scalable Integration and Processing of Linked Data Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani

description

Scalable Integration and Processing of Linked Data. Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani. Outline. Session 1: Introduction to Linked Data Foundations and Architectures Crawling and Indexing Querying Session 2: Integrating Web Data with Reasoning - PowerPoint PPT Presentation

Transcript of Scalable Integration and Processing of Linked Data

Page 1: Scalable Integration and Processing of Linked Data

Tutorial at WWW 2011

Scalable Integration and Processing of Linked Data

Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani

Page 2: Scalable Integration and Processing of Linked Data

2

Outline

Session 1: Introduction to Linked DataFoundations and ArchitecturesCrawling and IndexingQuerying

Session 2: Integrating Web Data with ReasoningIntroduction to RDFS/OWL on the WebIntroduction and Motivation for Reasoning

Session 3: Distributed Reasoning: Because Size MattersProblems and ChallengesMapReduce and WebPIE

Session 4: Putting Things Together (Demo)The LarKC PlatformImplementing a LarKC Workflow

Page 3: Scalable Integration and Processing of Linked Data

3

PART I: How can we query Linked Data?

PART 2: How can we reason over Linked Data? (start of Session 2)

Page 4: Scalable Integration and Processing of Linked Data

4

Answer: SPARQL (W3C Rec. 2008)

…SPARQL 1.1 upcoming (W3C Rec. 201?)

Page 5: Scalable Integration and Processing of Linked Data

5

SPARQL Protocol and RDF Query Language (SPARQL)

Introducing SPARQL

Standardised query language (and supporting recommendations) for querying RDF

~SQL-like language…but only if you squint…and without the vendor-specific headaches

Page 6: Scalable Integration and Processing of Linked Data

6

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

SELECT ?name ?expertise

FROM NAMED <http://data.southampton.ac.uk/>

WHERE { ?person foaf:name ?name . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

ORDER BY ?surname

The anatomy of a typical SPARQL query

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

PREFIX DECLARATIONS

RESULT CLAUSE

QUERY CLAUSE

SOLUTION MODIFIERS

DATASET CLAUSE

; foaf:familyName ?surname .

Page 7: Scalable Integration and Processing of Linked Data

7

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

SELECT ?name ?expertise

FROM NAMED <http://data.southampton.ac.uk/>

WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

ORDER BY ?surname

The anatomy of a typical SPARQL query

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

PREFIX DECLARATIONS

RESULT CLAUSE

QUERY CLAUSE

SOLUTION MODIFIERS

DATASET CLAUSE

Page 8: Scalable Integration and Processing of Linked Data

8

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

Prefix Declarations

foaf:Person ⇔ <http://xmlns.com/foaf/0.1/Person>

Use http://prefix.cc/ …

PREFIX DECLARATIONS

Page 9: Scalable Integration and Processing of Linked Data

9

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

SELECT ?name ?expertise

FROM NAMED <http://data.southampton.ac.uk/>

WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

ORDER BY ?surname

The anatomy of a typical SPARQL query

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

PREFIX DECLARATIONS

RESULT CLAUSE

QUERY CLAUSE

SOLUTION MODIFIERS

DATASET CLAUSE

Page 10: Scalable Integration and Processing of Linked Data

10

SELECT ?name ?expertise

Result Clause

1. SELECT2. CONSTRUCT (RDF)

3. ASK 4. DESCRIBE (RDF)

RESULT CLAUSE

Page 11: Scalable Integration and Processing of Linked Data

11

Return all tuples for the bindings of the variables ?name and ?expertise

-----------------------------------------------------------| “Professor Robert Allen” | “Control engineering” || “Professor Robert Allen” | “Biomedical engineering” || “Prof Carl Leonetto Amos” | || “Professor Peter Ashburn” | “Silicon technology” || “Professor Robert Allen” | “Control engineering” |-----------------------------------------------------------

Result Clause 1. SELECT…SELECT ?name ?expertise RESULT CLAUSE

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

Page 12: Scalable Integration and Processing of Linked Data

12

Return all tuples for the bindings of the variables ?name and ?expertise

-----------------------------------------------------------| “Professor Robert Allen” | “Control engineering” || “Professor Robert Allen” | “Biomedical engineering” || “Prof Carl Leonetto Amos” | || “Professor Peter Ashburn” | “Silicon technology” || “Professor Robert Allen” | “Control engineering” |-----------------------------------------------------------

?name ?expertiseSELECT

Result Clause 1. SELECT DISTINCT…DISTINCT

unique

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

Page 13: Scalable Integration and Processing of Linked Data

13

CONSTRUCT { ?person foaf:name ?name ; ex:expertise ?expertise .}

Return RDF using bindings for the variables: ex:RAllen foaf:name “Professor Robert Allen” ; ex:expertise “Biomedical engineering” , “Control engineering” .ex:PAshburn foaf:name “Peter Ashburn ” ; ex:expertise “Silicon technology” .

Result Clause 2. CONSTRUCT…

RESULT CLAUSE

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

Page 14: Scalable Integration and Processing of Linked Data

14

ASK

… WHERE { … }

Is there any results?

Returns:true or false

Result Clause 3. ASK…RESULT CLAUSE

Page 15: Scalable Integration and Processing of Linked Data

15

DESCRIBE ?person

… WHERE { ?person … }

Returns some RDF which “describes” the given resource…

No standard for what to return! Typically returns:

Result Clause 4. DESCRIBE…RESULT CLAUSE

all triples where the given resource appears as subject and/or objectOR

Concise Bounded Descriptions…

Page 16: Scalable Integration and Processing of Linked Data

16

DESCRIBE ex:RAllen

(…can give URIs directly without need for a WHERE clause.)

Result Clause 4. DESCRIBE (DIRECT)…

RESULT CLAUSE

Page 17: Scalable Integration and Processing of Linked Data

17

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

SELECT ?name ?expertise

FROM NAMED <http://data.southampton.ac.uk/>

WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

ORDER BY ?surname

The anatomy of a typical SPARQL query

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

PREFIX DECLARATIONS

RESULT CLAUSE

QUERY CLAUSE

SOLUTION MODIFIERS

DATASET CLAUSE

Page 18: Scalable Integration and Processing of Linked Data

18

FROM NAMED <http://data.southampton.ac.uk/>

Dataset clause (FROM/FROM NAMED)DATASET CLAUSE

(Briefly)

Restrict the dataset against which you wish to querySPARQL stores named graphs: sets of triples which are associated with (URI) namesCan match across graphs!Named graphs typically corrrespond with data provenance (i.e., documents)! Default graph typically corresponds to the merge of all graphsMany engines will typically dereference a graph if not available locally!

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

Page 19: Scalable Integration and Processing of Linked Data

19

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

SELECT ?name ?expertise

FROM NAMED <http://data.southampton.ac.uk/>

WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

ORDER BY ?surname

The anatomy of a typical SPARQL query

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

PREFIX DECLARATIONS

RESULT CLAUSE

SOLUTION MODIFIERS

DATASET CLAUSE

WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }} QUERY CLAUSE

Page 20: Scalable Integration and Processing of Linked Data

20

foaf:Person

foaf:name?person ?namerdf:type

foaf:title

?titleoo:availableToCommentOn

?expertiseURI rdfs:label ?expertise

[FILTER “^Prof”]foaf:familyName

?surname

WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

Query clause (WHERE)

QUERY CLAUSE

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

“Professor Peter Ashburn”

“Silicon technology”“Professor”

✓ex:PAshburn

ex:Silicon

“Ashburn”

Page 21: Scalable Integration and Processing of Linked Data

21

WHERE { … {?person oo:availableToCommentOn ?expertiseURI . } UNION {?person foaf:interest ?expertiseURI . }…}

Quick mention for UNION

QUERY CLAUSE

Represent disjunction (OR)

Useful when there’s more than one property/class that represents the same information you’re interested in (heterogenity)

Reasoning can also help, assuming terms are mapped (more later)

Page 22: Scalable Integration and Processing of Linked Data

22

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

SELECT ?name ?expertise

FROM NAMED <http://data.southampton.ac.uk/>

WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

ORDER BY ?surname

The anatomy of a typical SPARQL query

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

PREFIX DECLARATIONS

RESULT CLAUSE

SOLUTION MODIFIERS

DATASET CLAUSE

QUERY CLAUSE

Page 23: Scalable Integration and Processing of Linked Data

23

ORDER BY ?surnameSolution Modifiers

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

SOLUTION MODIFIERS

Order output results by surname (as you probably guessed)

LIMIT

OFFSET

ORDER BY ?surname LIMIT 10 SOLUTION MODIFIERS

ORDER BY ?surname LIMIT 10 OFFSET 20 SOLUTION MODIFIERS

Only return 10 results

Return results 20‒30

…also…

Page 24: Scalable Integration and Processing of Linked Data

24

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

SELECT ?name ?expertise

FROM NAMED <http://data.southampton.ac.uk/>

WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

ORDER BY ?surname

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

PREFIX DECLARATIONS

RESULT CLAUSE

QUERY CLAUSE

SOLUTION MODIFIERS

DATASET CLAUSE

What are you looking for?

Which results do you want?Where should we look?

How should results be ordered/split?

Shortcuts for URIs

The summary of a typical SPARQL query

Page 25: Scalable Integration and Processing of Linked Data

25

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/>

SELECT ?name ?expertise

FROM NAMED <http://data.southampton.ac.uk/>

WHERE { ?person foaf:name ?name . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise }}

ORDER BY ?surname

Trying out a typical SPARQL query

Give me a list of names of professors in Southampton and their expertise (if available), in

order of their surname

; foaf:familyName ?surname .

Page 26: Scalable Integration and Processing of Linked Data

26

SparqlEndpoints (W3C Wiki)

http://www.w3.org/wiki/SparqlEndpoints(or just use Google)

List of Public SPARQL Endpoints:

Page 27: Scalable Integration and Processing of Linked Data

27

SPARQL 1.1

Currently a W3C Working Draft

http://www.w3.org/TR/sparql11-query/ (or just use Google)

Coming Soon:

Page 28: Scalable Integration and Processing of Linked Data

28

“SPARQL by example”

By Cambridge SemanticsLee Feigenbaum & Eric Prud'hommeaux

http://www.cambridgesemantics.com/2008/09/sparql-by-example/ (or just use Google)

Highly recommend checking out:

Page 29: Scalable Integration and Processing of Linked Data

29

After the break…

Session 1: Introduction to Linked DataFoundations and ArchitecturesCrawling and IndexingQuerying

Session 2: Integrating Web Data with ReasoningIntroduction to RDFS/OWL on the WebIntroduction and Motivation for Reasoning

Session 3: Distributed Reasoning: Because Size MattersProblems and ChallengesMapReduce and WebPIE

Session 4: Putting Things Together (Demo)The LarKC PlatformImplementing a LarKC Workflow

Page 30: Scalable Integration and Processing of Linked Data

30

Question: Find the people who have won both an academy award for best director and a raspberry award for worst director

Endpoint: (that is, if you want to use SPARQL… feel free to use whatever) http://dbpedia.org/sparql/ or http://google.com/ (to make it fair)

Hint: Look at http://dbpedia.org/page/Michael_Bay and http://dbpedia.org/page/Woody_Allen for examples (The same prefixes therein are understood by the endpoint, …so no need to declare them in the query)

During the break…

Page 31: Scalable Integration and Processing of Linked Data

31

The Winning (?) Query:SELECT DISTINCT ?nameWHERE{ ?director dcterms:subject category:Worst_Director_Golden_Raspberry_Award_winners , category:Best_Director_Academy_Award_winners ; foaf:name ?name .}

The Answer:

And the answer is…

Page 32: Scalable Integration and Processing of Linked Data

32

PART I: How can we query Linked Data?

PART 2: How can we reason over

Linked Data?…and why?!

Page 33: Scalable Integration and Processing of Linked Data

33

… A Web of Data

Images from: http://richard.cyganiak.de/2007/10/lod/; Cyganiak, JentzschSeptember 2010

August 2007

November 2007 February 2008

March 2008

September 2008

March 2009

July 2009

Page 34: Scalable Integration and Processing of Linked Data

34

Reasoning

explicit data

implicit data

How can consumers query the

implicit data

Page 35: Scalable Integration and Processing of Linked Data

35

…so what’s The Problem?…

…heterogeneity

…need to integrate data from different sources

Page 36: Scalable Integration and Processing of Linked Data

36

Take Query Answering…

Gimme webpages relating to

Tim Berners-Lee

foaf:page

timbl:i timbl:i foaf:page ?pages .

Page 37: Scalable Integration and Processing of Linked Data

37

Hetereogenity in schema…

webpage: properties

foaf:page

foaf:homepage

foaf:isPrimaryTopicOf

foaf:weblog

doap:homepage

foaf:topic

foaf:primaryTopic

mo:musicBrainz

mo:myspace

= rdfs:subPropertyOf

= owl:inverseOf

Page 38: Scalable Integration and Processing of Linked Data

38

Linked Data, RDFS and OWL: Linked Vocabularies

SKOS

…Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman

Page 39: Scalable Integration and Processing of Linked Data

39

Hetereogenity in naming…

Tim Berners-Lee: URIs

timbl:i

dblp:100007

identica:45563

adv:timblfb:en.tim_berners-lee

db:Tim-Berners_Lee

= owl:sameAs

Page 40: Scalable Integration and Processing of Linked Data

40

Returning to our simple query…

Gimme webpages relating to

Tim Berners-Lee

foaf:page

timbl:i timbl:i foaf:page ?pages .

... 7 x 6 = 42 possible patterns

foaf:homepage foaf:isPrimaryTopicOf

doap:homepage foaf:topic foaf:primaryTopic

mo:myspace SKOS

dblp:100007

identica:45563adv:timbl

fb:en.tim_berners-leedb:Tim-Berners_Lee

Page 41: Scalable Integration and Processing of Linked Data

41

…reasoning to the rescue?

Page 42: Scalable Integration and Processing of Linked Data

42

Challenges……what (OWL) reasoning is feasible for Linked Data?

Page 43: Scalable Integration and Processing of Linked Data

43

Linked Data Reasoning: Challenges

Page 44: Scalable Integration and Processing of Linked Data

44

ScalabilityAt least tens of billions of statements (for the moment)

Near linear scale!!!

Noisy dataInconsistencies galorePublishing errors

Linked Data Reasoning: Challenges

Page 45: Scalable Integration and Processing of Linked Data

45

Challenges (Semantic Web Wikipedia Article)Some of the challenges for the Semantic Web include vastness, vagueness, uncertainty, inconsistency and deceit. Automated reasoning systems will have to deal with all of these issues in order to deliver on the promise of the Semantic Web.Vastness: The World Wide Web contains at least 48 billion pages as of this writing (August 2, 2009). The SNOMED CT medical terminology ontology contains 370,000 class names, and existing technology has not yet been able to eliminate all semantically duplicated terms. Any automated reasoning system will have to deal with truly huge inputs.Vagueness: These are imprecise concepts like "young" or "tall". This arises from the vagueness of user queries, of concepts represented by content providers, of matching query terms to provider terms and of trying to combine different knowledge bases with overlapping but subtly different concepts. Fuzzy logic is the most common technique for dealing with vagueness.Uncertainty: These are precise concepts with uncertain values. For example, a patient might present a set of symptoms which correspond to a number of different distinct diagnoses each with a different probability. Probabilistic reasoning techniques are generally employed to address uncertainty.Inconsistency: These are logical contradictions which will inevitably arise during the development of large ontologies, and when ontologies from separate sources are combined. Deductive reasoning fails catastrophically when faced with inconsistency, because "anything follows from a contradiction". Defeasible reasoning and paraconsistent reasoning are two techniques which can be employed to deal with inconsistency.Deceit: This is when the producer of the information is intentionally misleading the consumer of the information. Cryptography techniques are currently utilized to ameliorate this threat.

Linked Data Reasoning: Challenges

Page 46: Scalable Integration and Processing of Linked Data

46

Proposition 1 Web data is noisy.

Proof: 08445a31a78661b5c746feff39a9db6e4e2cc5cf

sha1-sum of ‘mailto:’common value for foaf:mbox_sha1sum

An inverse-functional (uniquely identifying) property!!!Any person who shares the same value will be considered the same

Q.E.D.

Noisy Data: Omnipotent Being

Page 47: Scalable Integration and Processing of Linked Data

47

Alternate proof (courtesy of http://www.eiao.net/rdf/1.0)

rdf:type rdf:type owl:Property .rdf:type rdfs:label “type”@en .rdf:type rdfs:comment “Type of resource” .rdf:type rdfs:domain eiao:testRun .rdf:type rdfs:domain eiao:pageSurvey .rdf:type rdfs:domain eiao:siteSurvey .rdf:type rdfs:domain eiao:scenario .rdf:type rdfs:domain eiao:rangeLocation .rdf:type rdfs:domain eiao:startPointer .rdf:type rdfs:domain eiao:endPointer .rdf:type rdfs:domain eiao:header .rdf:type rdfs:domain eiao:runs .

Noisy Data: Redefining everything …and home in time for tea

Page 48: Scalable Integration and Processing of Linked Data

48

foaf:Person owl:disjointWith foaf:Document .

Inconsistent Data: Cannot compute…

Page 49: Scalable Integration and Processing of Linked Data

49

…herein, we look at (monotonic) rules.

Expressive reasoning (also) possible through tableaux, but yet to demonstrate desired scale

Page 50: Scalable Integration and Processing of Linked Data

50

Rules

IF ⇒ THENBody/Antecedent/Condition Head/Consequent

?c1 rdfs:subClassOf ?c2 . ?x rdf:type ?c1 . ⇒ ?x rdf:type ?c2 .

foaf:Person rdfs:subClassOf foaf:Agent .timbl:me rdf:type foaf:Person .⇒ timbl:me rdf:type foaf:Agent .

Schema/Terminology/Ontological

Instance/Assertional

Page 51: Scalable Integration and Processing of Linked Data

51

Rules (Inconsistencies [a.k.a. Contradictions])

IF ⇒ THEN?c1 owl:disjointWith ?c2 .

?x rdf:type ?c1 . ?x rdf:type ?c2 .

⇒ false

foaf:Person owl:disjointWith foaf:Document .ex:sleepygirl rdf:type foaf:Person .ex:sleepygirl rdf:type foaf:Document .

⇒ false

Body/Antecedent/Condition Head/Consequent

Page 52: Scalable Integration and Processing of Linked Data

52

Materialisation (Forward-Chaining):

Write the consequences of the rules down

Executing rules: Materialisation

Page 53: Scalable Integration and Processing of Linked Data

53

Materialisation

Forward-chaining MaterialisationAvoid runtime expense

Users taught impatience by GooglePre-compute for quick retrievalWeb-scale systems should scale well

More data = more disk-space/machines

Page 54: Scalable Integration and Processing of Linked Data

54

INPUT:• Flat file of triples

(quads)

OUTPUT:• Flat file of (partial)

inferred triples (quads)

Page 55: Scalable Integration and Processing of Linked Data

55

“Standard”RDFSOWL 2 RL (W3C Rec: 27 Oct. 2009)

“Non-standard”DLPpD* (OWL Horst)OWL–

What rulesets?

Page 56: Scalable Integration and Processing of Linked Data

56

Let’s look at a recent corpus of Linked Data and see what schema’s inside

(and what the rulesets support)

Open-domain crawl May 2010 1.1 billion quadruples 3.985 million sources (docs) 780 pay-level domains (e.g., dbpedia.org) Ran “special” PageRank over documents

86 thousand docs contained some RDFS/OWL schema data (2.2% of docs... but <0.2% of triples)Summated ranks of docs using each primitive

What rules?

Page 57: Scalable Integration and Processing of Linked Data

57

Survey of Linked Data schema: Top 15 ranks

# Axiom Rank(Σ) RDFS Horst O2R1. rdfs:subClassOf 0.295 ✓ ✓ ✓2. rdfs:range 0.294 ✓ ✓ ✓3. rdfs:domain 0.292 ✓ ✓ ✓4. rdfs:subPropertyOf 0.090 ✓ ✓ ✓5. owl:FunctionalProperty 0.063 ✘ ✓ ✓6. owl:disjointWith 0.049 ✘ ✘ ✓7. owl:inverseOf 0.047 ✘ ✓ ✓8. owl:unionOf 0.035 ✘ ✘ ✓9. owl:SymmetricProperty 0.033 ✘ ✓ ✓10. owl:TransitiveProperty 0.030 ✘ ✓ ✓11. owl:equivalentClass 0.021 ✘ ✓ ✓12. owl:InverseFunctionalProperty 0.030 ✘ ✓ ✓13. owl:equivalentProperty 0.030 ✘ ✓ ✓14. owl:someValuesFrom 0.030 ✘ ✓ ✓15. owl:hasValue 0.028 ✘ ✓ ✓

Page 58: Scalable Integration and Processing of Linked Data

58

What about noise? ……need to consider the provenance of Web data

Page 59: Scalable Integration and Processing of Linked Data

59

Consider source of schema data

Class/property URIs dereference to their authoritative documentFOAF spec authoritative for foaf:Person ✓MY spec not authoritative for foaf:Person ✘

Allow “extension” in third-party documentsmy:Person rdfs:subClassOf foaf:Person . (MY spec) ✓

BUT: Reduce obscure membershipsfoaf:Person rdfs:subClassOf my:Person . (MY spec) ✘

ALSO: Protect specificationsfoaf:knows a owl:SymmetricProperty . (MY spec) ✘

Authoritative Reasoning

Page 60: Scalable Integration and Processing of Linked Data

60

More proof (courtesy of http://www.eiao.net/rdf/1.0)

rdf:type rdf:type owl:Property .rdf:type rdfs:label “type”@en .rdf:type rdfs:comment “Type of resource” .rdf:type rdfs:domain eiao:testRun .rdf:type rdfs:domain eiao:pageSurvey .rdf:type rdfs:domain eiao:siteSurvey .rdf:type rdfs:domain eiao:scenario .rdf:type rdfs:domain eiao:rangeLocation .rdf:type rdfs:domain eiao:startPointer .rdf:type rdfs:domain eiao:endPointer .rdf:type rdfs:domain eiao:header .rdf:type rdfs:domain eiao:runs .

60

Noisy Data: Redefining everything …and home in time for tea

Page 61: Scalable Integration and Processing of Linked Data

61

Gong Cheng, Yuzhong Qu. "Integrating Lightweight Reasoning into Class-Based Query Refinement for

Object Search." ASWC 2008.

Aidan Hogan, Andreas Harth, Axel Polleres. "Scalable Authoritative OWL Reasoning for the Web." IJSWIS 2009. Aidan Hogan, Jeff Z. Pan, Axel Polleres and Stefan Decker. "SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion

Linked Data Triples." ISWC 2010.

My thesis: http://aidanhogan.com/docs/thesis/ (or use Google).

Authoritative Reasoning: read more …w/ essential plugs

Page 62: Scalable Integration and Processing of Linked Data

62 62

Quarantined reasoning!

Separate and cache hierarchy of schema documents/dependencies…

Alternative to Authoritative Reasoning?

Page 63: Scalable Integration and Processing of Linked Data

63 63

Quarantined Reasoning [Delbru et al.; 2008]

Page 64: Scalable Integration and Processing of Linked Data

64 64

Quarantined Reasoning [Delbru et al.; 2008]

Page 65: Scalable Integration and Processing of Linked Data

65 65

Quarantined Reasoning [Delbru et al.; 2008]

Page 66: Scalable Integration and Processing of Linked Data

66 66

A-Box / Instance Data (e.g, a FOAF file)

T-Box / Ontology Data (e.g., the FOAF ontology and its indirect imports)

Quarantined Reasoning [Delbru et al.; 2008]

Page 67: Scalable Integration and Processing of Linked Data

67

More proof (courtesy of http://www.eiao.net/rdf/1.0)

rdf:type rdf:type owl:Property .rdf:type rdfs:label “type”@en .rdf:type rdfs:comment “Type of resource” .rdf:type rdfs:domain eiao:testRun .rdf:type rdfs:domain eiao:pageSurvey .rdf:type rdfs:domain eiao:siteSurvey .rdf:type rdfs:domain eiao:scenario .rdf:type rdfs:domain eiao:rangeLocation .rdf:type rdfs:domain eiao:startPointer .rdf:type rdfs:domain eiao:endPointer .rdf:type rdfs:domain eiao:header .rdf:type rdfs:domain eiao:runs .

Noisy Data: Redefining everything …and home in time for tea

Page 68: Scalable Integration and Processing of Linked Data

68

R. Delbru, A. Polleres, G. Tummarello and S. Decker. "Context Dependent Reasoning for Semantic Documents in Sindice. “ 4th

International Workshop on Scalable Semantic Web Knowledge Base Systems, 2008.

Quarantined Reasoning: read more

Page 69: Scalable Integration and Processing of Linked Data

69

…what about owl:sameAs?

Page 70: Scalable Integration and Processing of Linked Data

70 70

Consolidation for Linked Data

Page 71: Scalable Integration and Processing of Linked Data

71

Use provided owl:sameAs mappings in the data

timbl:i owl:sameas identica:45563 .dbpedia:Berners-Lee owl:sameas identica:45563 .

Store “equivalences” found

timbl:i ->identica:45563 ->dbpedia:Berners-Lee ->

timbl:iidentica:45563dbpedia:Berners-Lee

Consolidation: Baseline

Page 72: Scalable Integration and Processing of Linked Data

72

For each set of equivalent identifiers, choose a canonical term

timbl:iidentica:45563dbpedia:Berners-Lee

Consolidation: Baseline

Page 73: Scalable Integration and Processing of Linked Data

73

Afterwards, rewrite identifiers to their canonical version:

Canonicalisation

timbl:i rdf:type foaf:Person .identica:48404 foaf:knows identica:45563 .

dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .

dbpedia:Berners-Lee rdf:type foaf:Person .identica:48404 foaf:knows dbpedia:Berners-Lee .

dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .

timbl:iidentica:45563dbpedia:Berners-Lee

Page 74: Scalable Integration and Processing of Linked Data

74

Infer owl:sameAs through reasoning (OWL 2 RL/RDF)1. explicit owl:sameAs (again)2. owl:InverseFunctionalProperty3. owl:FunctionalProperty4. owl:cardinality 1 / owl:maxCardinality 1

foaf:homepage a owl:InverseFunctionalProperty .timbl:i foaf:homepage w3c:timblhomepage .adv:timbl foaf:homepage w3c:timblhomepage .

⇒timbl:i owl:sameas adv:timbl .

…then apply consolidation as before

Extended Consolidation

Page 75: Scalable Integration and Processing of Linked Data

75

For our Linked Data corpus: 1. ~12 million explicit owl:sameAs triples (as before)2. ~8.7 million thru. owl:InverseFunctionalProperty3. ~106 thousand thru. owl:FunctionalProperty4. none thru. owl:cardinality/owl:maxCardinality

In terms of equivalences found (baseline vs. extended):~2.8 million sets of equivalent identifiers (1.31x baseline)~14.86 million identifiers involved (2.58x baseline)~5.8 million URIs !!(1.014x baseline)!!

Consolidation: Results

Page 76: Scalable Integration and Processing of Linked Data

76

Conclusion…

Page 77: Scalable Integration and Processing of Linked Data

77

Heterogeneity poses a significant problem for consuming Linked Data1. Heterogenity in schema2. Heterogenity in naming

…but we can use the mappings provided by publishers to integrate heterogeneous Linked Data corpora (with a little caution)

3. Lightweight rule-based reasoning can go a long way4. Deceit/Noise ≠ End Of World

Consider source of data!5. Inconsistency ≠ End Of World

Useful for finding noise in fact!6. Explicit owl:sameAs vs. extended consolidation:

Extended consolidation mostly (but not entirely) for consolidating blank-nodes from older FOAF exporters

Conclusions

Page 78: Scalable Integration and Processing of Linked Data

78

How can we reason at Web scale?

Scalable/distributed rule-based materialisation over MapReduce using the WebPIE system

Next up…

Page 79: Scalable Integration and Processing of Linked Data

79

timbl:i foaf:page ?pages .

timbl:iidentica:45563dbpedia:Berners-Lee

dbpedia:Berners-Lee foaf:page ?pages .

Page 80: Scalable Integration and Processing of Linked Data

80 80

Authoritative Reasoning (Appendix)

OWL 2 RL rule prp-inv1?p1 owl:inverseOf ?p2 . ?x ?p1 ?y . ⇒ ?y ?p2 ?x .

OWL 2 RL rule prp-inv2?p1 owl:inverseOf ?p2 . ?x ?p2 ?y . ⇒ ?y ?p1 ?x .

TBOX:foo:doesntKnow owl:inverseOf

foaf:knows . (from foo:)

ABOX:bar:Aidan foo:doesntKnow bar:Axel . bar:Stefan foaf:knows bar:Jeff .

AUTHORITATIVE INFERENCE:bar:Axel foaf:knows bar:Aidan .bar:Jeff foo:doesntKnow

bar:Stefan .

✓✘