4V - WP3 Progress Report (TIN2013-46238)

Post on 12-Apr-2017

123 views 1 download

Transcript of 4V - WP3 Progress Report (TIN2013-46238)

4V: Volumen, Velocidad, Variedad y Validez en la gestión innovadora de datos

(TIN2013-46238)

Progress Report – WP3Zaragoza, 15 de Junio 2016

Ontology Engineering Group (OEG)Escuela Técnica Superior de Ingenieros Informáticos

Universidad Politécnica de MadridCampus de Montegancedo,

Boadilla del Monte, 28660, Spain

2

Outline

• Loupe• On-going work

• Quality Assessment and Repair• Conciseness• Consistency

• Collaborations • A two-fold quality assurance approach for dynamic KBs: The

3cixty use case

Nandana Mihindukulasooriya, OEG

3

Loupe - An Online Tool for Inspecting Datasets in the Linked Data CloudDemo @ ISWC2015

Nandana Mihindukulasooriya, OEG

4

Loupe - Overview

Nandana Mihindukulasooriya, OEG

Explore the vocabularies used and the abstract triple patterns in 2+ billion triples including all Dbpedia datasets, Wikidata, Linked Brainz, Bio2RDF.

Loupe helps to understand data, uncover patterns, formulate queries, and detect quality issues

Loupe - An Online Tool for Inspecting Datasets in the Linked Data CloudDemo @ ISWC2015.

5

Loupe – Google Analytics

Ontology Engineering Group, Universidad Politécnica de Madrid

6

Loupe – Google Analytics (II)

• Users from 84 countries• Spain(23.76%), US (16.69%), Germany (10.64%), UK

(9.14%), Italy (4.51%)

Ontology Engineering Group

7

Loupe On-going work

Nandana Mihindukulasooriya, OEG

8

Loupe – Use Case Analysis

• Dataset Descriptions• Dataset statistics • Dataset profiling

• Dataset exploration• Class/property browsing • Triple pattern browsing

• Dataset discovery and recommendation• keywords, vocabularies• SPARQL queries • RDF shapes

Ontology Engineering Group

• Quality assessment• Consistency• Misused vocabularies

• Guided SPARQL query generation• auto-complete based on

abstract triple patterns• Vocabulary reuse and

recommendation• Recommendation of

vocabularies based on popularity

• Ontology development feedback• Common properties

9

Loupe – LOD Laundromat integration

Nandana Mihindukulasooriya, OEG

• Current status of Loupe• 2 billion triples from 32 datasets

• LOD Laundromat• 32 billion triples from 650K documents • cleaned for syntax errors and duplicates• coverage of smaller documents

• Collaboration with VU University Amsterdam• Steps

• Fully automatic dataset download, SPARQL endpoint creation, indexing, clean up

• UI changes to handle large number of datasets• Vocabulary usage datasets

10

Loupe Ontology – Vocabulary Usage Statistics of LOD

• Analysis of existing metrics • VoID • DCAT • RDFStats• LODStats • VoID-Ext

• Analysis of use case requirements• Statistics • Profiling • Discovery• Recommendation

Nandana Mihindukulasooriya, OEG

11

Loupe Ontology

Nandana Mihindukulasooriya, OEG

12

An Analysis of the Quality Issues of the Properties Available in the Spanish DbpediaCAEPIA 2015, Albacete

Nandana Mihindukulasooriya, OEG

13

Analyzed Quality Dimensions

Nandana Mihindukulasooriya, OEG

An Analysis of the Quality Issues of the Properties Available in the Spanish Dbpedia CAEPIA2015.

A. Conciseness. A dataset does not contain redundant concepts with different identifiers.

B. Consistency. A dataset does not contain conflicting or contradictory data.

C. Syntactic Validity. Values belong to the legal value range for the represented domain and do not violate the syntactic rules.

D. Semantic Accuracy. Values correctly represent real world facts

14Ontology Engineering Group, Universidad Politécnica de Madrid

Conciseness

• Many redundant properties in esDBpedia• 97.93% are auto-generated

• Causes• Capitalization (857): partidosEnPrimera,partidosenprimera• Synonyms: causaDeMuerte, causaDeFallecimiento• Prepositions: causaDeFallecimiento, causaFallecimiento• Spelling (7,495): apeliido, apelldio, apellid• Singular/plural: apellido, apellidos• Gender: administrador, administradora• Accent usage (1,252): administracion, administración• Parsing (107): altitudMin/máx, residencia/trabajo, idioma/s

15

Consistency

• Diverse and incorrect domain and range types • esdbpedia:edad has range of type dbo:Place • esdbpedia:lugarmuerte has range of type dbo:Person• esdbpedia:pais has range of type dbo:Actor

• OWL properties with IRI and literal values• 3,380 properties• Use of strings and URL interchangeably

• esdbpedia:lugarDeEntierro• "Madrid"@es• http://es.dbpedia.org/resource/Madrid

Ontology Engineering Group, Universidad Politécnica de Madrid

16

Conciseness

Nandana Mihindukulasooriya, OEG

17

How to query for the birth place of a person in DBpedia?

Nandana Mihindukulasooriya, OEG

DBpedia (lang)

Syntactically Similar Semantically Similar

English birthplace, birthplace, placeofbirth, birthplace, birthdplace, birthPalce, birthplace, PlaceOfBirth, laceOfBirth, oplaceOfBirth, birthplace, birthplace, birthPalce, birthPlae, birthPace, birthPlaxe, birtPlace, birthPlcace, bithPlace, brithPlace, nbirthPlace, birthplace, birghPlace, birthdplace, biRthPlace, birth, placebirth, placeOfBirth, placOfBirth, birthPlaceOf, birthPlae

cityofbirth, cityofbirthPlace, cityOfBirth, birthLocation

Spanish birthPlace, placeOfBirth, birthPlace, birthplacelugarDeNacimiento, lugarNacimiento, lugarNacimiento, lugarnacimiento, lugardenacimiento, lugarNacimento, lugarNaciento

ciudaddenacimiento, ciudadDenacimiento, paisdenacimiento, paisNacimiento

German geburtsort, birthplace, birthPlace, placeOfBirth placeofbirth

geburtsland, countryofbirth

18

Conciseness

• Less-concise datasets• Multiple identifiers with same semantics

• Issues • Harder to understand data and vocabularies used• Harder to write queries • Harder to reuse

• Causes• Less concise mappings

• Diverse distributed mappings created by multiple teams• No policies or guidance of consistent vocabulary usage• No tools for recommending class / properties

• Crowd-sourced ontologies• No or minimum labels / descriptions

Nandana Mihindukulasooriya, OEG

19

RDF generation process

Nandana Mihindukulasooriya, OEG

Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)

structured dataunstructured

Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)

SPARQL Endpoint(e.g., Virtuoso, Fuseki)

RDF Dumps

Linked DataResources(e.g,, Pubby, ELDA)

Triple Store Web Server

SPARQL Clients Linked Data Clients

Data sources

Transformation

Storage

Access

20

DBpedia extraction process

Nandana Mihindukulasooriya, OEG

mappings

infobox

RDF Triplestore

Ren

derin

g

21

Issues in DBpedia mappings

• 16 DBpeida chapters• Crowd-sourced mappings using mapings wiki

• 5553 template mappings• Mostly using DBpedia ontology

• 739 classes, 3049 properties • In-concise usage of similar properties

• elevation & height, formationYear & foundingYear, team & club, occupation & profession, foundedBy & founder

• Plan for repair• Detection of inconsistent property usage

• Feedback to the ontology team• Feedback and guidance to the mapping teams

• Automatic cleaning of the mappings (in RML)

Nandana Mihindukulasooriya, OEG

22

Repairing conciseness issues in mappings

Nandana Mihindukulasooriya, OEG

Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)

structured dataunstructured

Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)

SPARQL Endpoint(e.g., Virtuoso, Fuseki)

RDF Dumps

Linked DataResources(e.g,, Pubby, ELDA)

Triple Store Web Server

SPARQL Clients Linked Data Clients

Data sources

Transformation

Storage

Access

23

Detecting in-concise mapping based on data

dbr:Adobe_Systems dbo:formationYear “1982” ^^xsd:gYear

Ontology Engineering Group

dbr:Adobe_Systems dbo:foundingYear “1982” ^^xsd:gYearDBpedia EN

DBpedia ES

Detection of in-concise mappings

24Nandana Mihindukulasooriya, OEG

SC P1 ?o

Graph 1 (e.g., Dbpedia EN) Graph 2 (e.g., Dbpedia ES)

SC P2 ?oM1(C,P1,P2)

M2(C,P1,P2) SC P1 O SC P2 O

M3(C,P1,P2) SC P1 O1 SC P2 O2

M4(G1,C,P1,P2)

M5(G2,C,P1,P2)SC

P1 ?o

P2 ?o

SC

P1 ?o

P2 ?o

C P1 P1 M1 M2/M1

M3/M1

M4/M1

M5/M1

Company foundingYear formationYear 170 0.72 0.24 0 0.05

Person activeYearsEndYear year 150 0.84 0.16 0 0

Person birthPlace deathPlace 2845 0.59 0.43 0.53 0.31

in-concise mappings

1

2

3

4

5

25

RDF generation process

Nandana Mihindukulasooriya, OEG

Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)

structured dataunstructured

Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)

SPARQL Endpoint(e.g., Virtuoso, Fuseki)

RDF Dumps

Linked DataResources(e.g,, Pubby, ELDA)

Triple Store Web Server

SPARQL Clients Linked Data Clients

Data sources

Transformation

Storage

Access

26

Property Maps

Property MapGeneration

• Step 1: group properties into clusters according to their domain and range

• Step 2: Multilingual NL preprocessing

• Step 3: aggregate properties by similarity (syntactic and semantic)

Ontology Engineering Group

27

Enhance SPARQL queries with property mappings

Ontology Engineering Group

28

Consistency

Nandana Mihindukulasooriya, OEG

29

Consistency

• Consistent data does not contain conflicting or contradictory data.

Nandana Mihindukulasooriya, OEG

@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .

dbo:City a owl:Class ; rdfs:subClassOf

[ a owl:Restriction ; owl:onProperty dbo:populationTotal ; owl:maxCardinality "1"^^xsd:nonNegativeInteger ], [ a owl:Restriction ; owl:onProperty dbo:mayor;

owl:maxCardinality "1"^^xsd:nonNegativeInteger ] .

dbo:country a owl:ObjectProperty ; rdfs:domain dbo:City; rdfs:range dbo:Country .

30

Consistency (II)

• Consistency issues• Data does not comply with the formal definitions or schema

Nandana Mihindukulasooriya, OEG

@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .

dbr:Zaragoza a dbo:City; dbo:populationTotal 666058;

dbo:populationTotal 684953; dbo:country dbr:Aragón; dbo:mayor dbr:Juan_Alberto_Belloch; dbo:mayor dbr:Pedro_Santisteve_Roche .

dbr:Aragón a dbo:AutonomousCommunity .

12

3

31

populationTotal - Cardinality Violation

Nandana Mihindukulasooriya, OEG

1

32

Consistency – (Incorrect) inferences

Nandana Mihindukulasooriya, OEG

dbr:Juan_Alberto_Belloch owl:sameAs dbr:Pedro_Santisteve_Roche .

dbr:Aragón a dbo:Country .

• Open World Assumption and Non-Unique Name Assumption• Works better for inferencing than validation

2

3

33

Consistency – Rich Semantics

• Checking consistency with OWL.

Nandana Mihindukulasooriya, OEG

@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .@prefix dbo: <http://www.w3.org/2002/07/owl#>.

dbo:City a owl:Class ; rdfs:subClassOf [ a owl:Restriction ; owl:onProperty dbo:populationTotal ; owl:maxCardinality "1"^^xsd:nonNegativeInteger ], [ a owl:Restriction ; owl:onProperty dbo:mayor;

owl:maxCardinality "1"^^xsd:nonNegativeInteger ] .dbo:country a owl:ObjectProperty; rdfs:domain dbo:Place; rdfs:range dbo:Country .

dbo:AutonomousCommunity owl:disjointWith dbo:Country .

dbr:Juan_Alberto_Belloch owl:differentFrom dbr:Pedro_Santisteve_Roche .

2

3

34

Consistency – SHACAL constraints

• Checking consistency with W3C SHACL.

Nandana Mihindukulasooriya, OEG

@prefix sh: <http://www.w3.org/ns/shacl#>@prefix dbo: <http://dbpedia.org/ontology/> .

_:cityShape a sh:Shape; sh:scopeClass dbo:City; sh:property [ sh:predicate dbo:mayor; sh:maxCount 1; sh:nodeKind sh:IRI; sh:classIn (dbo:Person schema:Person foaf:Person) ] ; sh:property [ sh:predicate dbo:country; sh:maxCount 1; sh:minCount 1; sh:nodeKind sh:IRI; sh:classIn (dbo:Country); sh:stem “http://dbpedia.org/” ] .

35

Data validation with semi-automatically generated RDF Shapes

Nandana Mihindukulasooriya, OEG

PatternExtraction

Domain ExpertReview

RDF ShapeGeneration

DataValidation

Data Repair

SHACL Shapes

36

Cardinality constraints example

Nandana Mihindukulasooriya, OEG

schema:Place Min Max P1 P99 Mean 0 1 2 3 4 5rdf:type 1 2 1 1 1.0002 0 99.9793 0.0207 0 0 0rdfs:label 1 6 1 6 4.2508 0 4.4048 36.6743 1.7445 0.4831 0rdfs:seeAlso 0 4 1 2 1.5717 0.0340 42.7702 57.1905 0.0041 0.0011 0owl:sameAs 0 6 0 0 0.0058 99.4455 0.5339 0.0146 0.0041 0.0015 0schema.org:review 0 2 0 2 0.0329 98.3175 0.0717 1.6108 0 0 0schema.org:url 0 40 0 10 0.5085 89.8340 1.8947 3.7013 0.3008 1.2155 0.3434events:poster 0 23 0 1 0.0155 98.9609 0.5900 0.4237 0.0097 0.0120 0.0007dc:publisher 0 2 0 2 1.0677 39.1777 14.8776 45.9447 0 0 0events:businessType 0 4 0 2 1.5273 4.1889 38.9255 56.8673 0.0041 0.0142 0schema:description 0 28 1 12 3.0573 0.0886 30.5193 32.8359 1.9605 19.1139 0.1226geo:location 0 24 0 4 0.2040 92.7525 0.6819 3.2436 0.2634 2.9831 0.0060

Property cardinalities of schema:Place class (extracted from data)

Pat. Min Max Description A 0 N No restrictions B 0 1 Maximum 1 C 1 N Minimum 1D 1 1 Exactly 1

Common cardinalities

CardinalityClassifier

schema:Place Classrdf:type D (Exactly 1)rdfs:label C (Minimum 1)rdfs:seeAlso C (Minimum 1)owl:sameAs A (No restrictions)schema.org:review A (No restrictions) Expert Review

schema:Place Classrdf:type C (Minimum 1)rdfs:label C (Minimum 1)rdfs:seeAlso C (Minimum 1)owl:sameAs A (No restrictions)schema.org:review A (No restrictions)

_:placeShape a sh:Shape; sh:scopeClass schema:Place; sh:property [ sh:predicate rdf:type; sh:minCount 1 ] ; sh:property [ sh:predicate rdfs:label; sh:minCount 1 ] ; sh:property [ sh:predicate rdfs:seeAlso; sh:minCount 1 ] ;

Approved PatternsExtracted Patterns

Restrictions in SHACL

37

W3C SHACL restrictions

• Value type constraints • sh:class, sh:classIn, sh:datatype, sh:datatypeIn,

sh:nodeKind• Cardinality constraints

• sh:minCount, sh:maxCount• Value range constraints

• sh:minInclusive, sh:minExclusive, sh:maxInclusive, sh:maxExclusive

• String based constraints• sh:minLength, sh:maxLength, sh:pattern, sh:stem,

sh:uniqueLang• Property pair constraints

• sh:equals, sh:disjoint, sh:lessThan, sh:lessThanOrEquals

Ontology Engineering Group

38

A Two-Fold Quality Assurance Approach for Dynamic Knowledge Bases: The 3cixty Use Case

Nandana Mihindukulasooriya, OEG

39

Continuous Integration is essential

Ontology Engineering Group, Universidad Politécnica de Madrid

40

Exploratory testing with Loupe

Ontology Engineering Group, Universidad Politécnica de Madrid

Automated testing with SPARQL Interceptor

41Ontology Engineering Group, Universidad Politécnica de Madrid

• a set of user-defined SPARQL queries (as unit tests)• Knowledge-based specific

TestSPARQLQueries

SystemRequirements

Schema Constraints

Conventions and other

restrictions

Inputs from Exploratory

Testing

42

SPARQL Interceptor

Ontology Engineering Group, Universidad Politécnica de Madrid

Designed and implemented by Localidata.