Assessing and Refining Mappings to RDF to Improve Dataset Quality

48
Assessing and Refining Mappings to RDF to Improve Dataset Quality [email protected] @jimkont Anastasia Dimou 1 , Dimitris Kontokostas 2 , Markus Freudenberg 2 , Ruben Verborgh 1 , Jens Lehmann 2 , Erik Mannens 1 , Sebastian Hellmann 2 , Rik Van de Walle 1 [email protected] @natadimou 1 Ghent University – iMinds – MMLab 2 AKSW – Leipzig University http://RML.io ● http://RDFUnit.aksw.org

Transcript of Assessing and Refining Mappings to RDF to Improve Dataset Quality

Assessing and Refining Mappings to RDF to Improve Dataset Quality

[email protected]

@jimkont

Anastasia Dimou1, Dimitris Kontokostas2, Markus Freudenberg2, Ruben Verborgh1, Jens Lehmann2, Erik Mannens1,

Sebastian Hellmann2, Rik Van de Walle1

[email protected]

@natadimou

1Ghent University – iMinds – MMLab

2AKSW – Leipzig University

http://RML.io ● http://RDFUnit.aksw.org

Linked Open Data

semantically annotated using different vocabularies or ontologies and interlinked data representations

published in the form of RDF datasets

derive from originally heterogeneous (semi-)structured data

RDF Dataset Quality

varies significantly ranging from expensively curated to relatively low quality datasets

RDF Dataset Quality - Intrinsic Dimension

determines the RDF Dataset Quality by assessing it for possible violations with respect to accuracy (e.g. malformed datatype literals)

consistency (e.g. disjoint classes/properties)

RDF Dataset Quality Assessment (DQA) DQA with RDFUnit

Mappings Quality Assessment (MQA) MQA with RDFUnit over RML

Mapping & Dataset Quality Assessment Workflow Mapping Refinements

Mappings & Quality Assessment Results

RDF Dataset Quality Assessment (DQA) DQA with RDFUnit

Mappings Quality Assessment (MQA) MQA with RDFUnit over RML

Mapping & Dataset Quality Assessment Workflow Mapping Refinements

Mappings & Quality Assessment Results

Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies)

dbo:birthDate range xsd:date dbo:birthDate domain dbo:Person

http://example.com/ Chuck_Bednarik

dbo:Event

"1925-05-01" xsd:gYear

dbo:birthDate

RDF DQA with RDFUnit

test-driven data-debugging framework

based on SPARQL-patterns

http://rdfunit.aksw.org

D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. Zaveri Test-driven evaluation of linked data quality In Proceedings of the 23rd International Conference on World Wide Web

RDF DQA with RDFUnit

test-driven data-debugging framework

based on SPARQL-patterns

dbo:birthDate http://example.com/

Chuck_Bednarik dbo:Event

"1925-05-01" xsd:gYear

http://rdfunit.aksw.org

RDF DQA with RDFUnit

…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }

http://rdfunit.aksw.org

…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }

dbo:birthDate http://example.com/

Chuck_Bednarik dbo:Event

"1925-05-01" xsd:gYear

Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies) Similar violations occur repeatedly within a single RDF dataset

http://example.com/ Giddeon_Massie

dbo:Event

"1981-08-27" xsd:gYear

http://example.com/ Brick_Bronsky

dbo:Event

"1964" xsd:gYear

http://example.com/ Steve_Meilinger

dbo:Event

"1930-12-12" xsd:gYear

dbo:birthDate http://example.com/

Chuck_Bednarik dbo:Event

"1925-05-01" xsd:gYear

http://example.com/ Matt_McBride

dbo:Event

"1985-05-23" xsd:gYear

dbo:birthDate

dbo:birthDate

dbo:birthDate

dbo:birthDate

sets of triples of a dataset have repetitive patterns

http://example.com/ Brick_Bronsky

dbo:Event

"1964" xsd:gYear

http://example.com/ Steve_Meilinger

dbo:Event

"1930-12-12" xsd:gYear

dbo:birthDate http://example.com/

Chuck_Bednarik dbo:Event

"1925-05-01" xsd:gYear

http://example.com/ Matt_McBride

dbo:Event

"1985-05-23" xsd:gYear

dbo:birthDate

dbo:birthDate

dbo:birthDate

dbo:birthDate http://example.com/

Chuck_Bednarik dbo:Event

"1925-05-01"

xsd:gYear

http://example.com/ Matt_McBride

dbo:Event

"1985-05-23" xsd:gYear

dbo:birthDate http://example.com/ {Name}_{Surname}

dbo:Event

"Birth" xsd:gYear

sets of triples of a dataset have repetitive patterns

dbo:birthDate

sets of triples of a dataset have repetitive patterns

dbo:birthDate http://example.com/ {Name}_{Surname}

dbo:Event

“Birth" xsd:gYear

Mapping languages formalize patterns into rules to generate the RDF dataset from the original data

Instead of applying Quality Assessment to the already published RDF dataset

as part of data consumption

Apply Quality Assessment to the Mappings that generate the RDF dataset

Incorporate Quality Assessment

in the publishing workflow

DQA: Dataset Quality Assessment

is applied by third parties to already published RDF dataset

violations DQA

DQA: Dataset Quality Assessment

Adjustments to the dataset are manually but rarely applied but not at the root (hard to identify)

are overwritten if a new version of

the original data is mapped & published

violations DQA

RDF Dataset Quality Assessment (DQA) DQA with RDFUnit

Mappings Quality Assessment (MQA) MQA with RDFUnit over RML

Mapping & Dataset Quality Assessment Workflow Mapping Refinements

Mappings & Quality Assessment Results

sets of triples of a dataset have repetitive patterns

dbo:birthDate http://example.com/ {Name}_{Surname}

dbo:Event

“Birth" xsd:gYear

Mapping languages formalize patterns into rules to generate the RDF dataset from the original data

sets of triples of a dataset have repetitive patterns

Name Surname Birth

Chuck Bednarik 1925-05-01

Matt McBride 1985-05-23

Steve Meilinger 1930-12-12

Brick Bronsky 1964

Giddeon Massie 1981-08-27

dbo:birthDate http://example.com/ {Name}_{Surname}

dbo:Event

“Birth" xsd:gYear

RDF Mapping Language (RML) specify the mapping definitions to generate RDF representation from heterogeneous data sources

extends the W3C-recommended R2RML

http://rml.io

A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.

RDF Mapping Language (RML)

http://rml.io

<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}_{Surname}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .

dbo:birthDate http://example.com/ {Name}_{Surname}

dbo:Event

“Birth" xsd:gYear

http://rml.io

data map doc

Mapping Processor

RDF Mapping Language (RML)

data map doc

Mapping Processor

violations DQA

http://rml.io

DQA: Dataset Quality Assessment

MQA with RDFUnit over RML

dbo:birthDate http://example.com/ {Name}_{Surname}

dbo:Event

“Birth" xsd:gYear

…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }

…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }

… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }

<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}_{Surname}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Age" ; rr:datatype xsd:gYear ] ] .

data map doc

Mapping Processor

violations MQA

MQA: Mapping Quality Assessment

data map doc

Mapping Processor

violations MDQA

MDQA: Uniform Mapping & Dataset Quality Assessment

MQA: Mapping Quality Assessment

discover violations before they are even generated

specify the origin of the violation

RDFUnit over RML

dbo:birthDate http://example.com/

Chuck_Bednarik dbo:Event

"1925-05-01" xsd:gYear

… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }

RDFUnit over RML

<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .

… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate dbo:birthDate; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != xsd:date) }

dbo:birthDate http://example.com/

Chuck_Bednarik dbo:Event

"1925-05-01" xsd:gYear

MQA: Mapping Quality Assessment

discover violations before they are even generated

specify the origin of the violation

easily apply structural adjustments to the mapping definitions

RDF Dataset Quality Assessment (DQA) DQA with RDFUnit

Mappings Quality Assessment (MQA) MQA with RDFUnit over RML

Mapping & Dataset Quality Assessment Workflow Mapping Refinements

Mappings & Quality Assessment Results

data map doc

Mapping Processor

violations MDQA

MDQA: Uniform Mapping & Dataset Quality Assessment

<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .

data map doc

Mapping Processor

violations MDQA

MDQA: Uniform Mapping & Dataset Quality Assessment

<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .

DEL: <#ObjectMap> rr:datatype xsd:gYear ADD: <#ObjectMap> rr:datatype xsd:date

data map doc

Mapping Processor

Mapping Refinements

violations MDQA

Uniform Mapping & Dataset Quality Assessment Workflow

MQA with RDFUnit over RML

dbo:birthDate http://example.com/

Chuck_Bednarik

dbo:Person

"1925-05-01"

xsd:date

DEL: <#ObjectMap> rr:datatype xsd:gYear ADD: <#ObjectMap> rr:datatype xsd:date

<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int .

data

new map doc

map doc

Mapping Processor

Mapping Refinements

violations MDQA

(optional)

Uniform Mapping & Dataset Quality Assessment Workflow

data

new map doc

map doc

Mapping Processor

Mapping Refinements

violations MDQA

(optional)

Uniform Mapping & Dataset Quality Assessment Workflow

Beyond Mapping Quality Assessment

certain test cases inevitably require the RDF Dataset cardinality,

functionality, symmetricity

Beyond Mapping Quality Assessment

certain test cases inevitably require the RDF Dataset cardinality,

functionality, symmetricity

reflect to the data, DO NOT affected by the mapping definitions

Mapping Quality Assessment (MQA)

prevent the violations generation

prevent same violations to appear repeatedly over distinct entities

allow intuitively combining different ontologies and vocabularies

RDF Dataset Quality Assessment (DQA) DQA with RDFUnit

Mappings Quality Assessment (MQA) MQA with RDFUnit over RML

Mapping & Dataset Quality Assessment Workflow Mapping Refinements

Mappings & Quality Assessment Results

Dataset Vs Mapping Quality Assessment Number of Violations Dataset Quality Assessment Mapping Quality Assessment

#fail test cases #violations #fail test cases #violations

DBPedia EN 1,128 3.2M 1 160

DBPedia NL 683 815k 1 124

DBLP 7 8.1M 2 8

*Dbpedia and D2RQ Mappings were translated to RML mappings

Dataset Vs Mapping Quality Assessment Time Dataset Quality Assessment Mapping Quality Assessment

size time size time

DBPedia EN 62M 16h 115K 11s

DBPedia NL 21M 1.5h 53K 6s

DBLP 12M 12h 368 12s

CEUR-WS* 2.4k 6s 702 5s

iLastic 150k 12s 825 15s

*CEUR-WS submission to the ESWC Semantic Publishing Challenge (2014 Vs 2015)

Mapping Quality Assessment

Mapping Quality Assessment

size time

DBPedia EN 115K 11s

DBPedia NL 53K 6s

DBPedia All 511K 32s

* http://mappings.dbpedia.org/validation

Live update of DBpedia Mapping Quality Assessment results every night!

Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies)

Similar violations occur repeatedly within a single RDF dataset

The situation aggravates the more ontologies and vocabularies are reused and combined

Quality Assessment shifted from data consumption to data publication

integrated systematically in the publishing workflow

violations are identified, resolved and will not re-appear

RDF dataset of higher Quality is generated