Mappings Validation

73
Mappings Validation Data Quality Tutorial - SEMANTICS2016 Anastasia Dimou [email protected] @natadimou Ghent University – iMinds

Transcript of Mappings Validation

Mappings ValidationData Quality Tutorial - SEMANTICS2016

Anastasia Dimou

[email protected] ● @natadimou

Ghent University – iMinds

Linked (Open) Data

semantically annotated & interlinked data using different vocabularies or ontologies

published in the form of RDF datasets

Linked (Open) Data

derive from originally heterogeneous(semi-)structured data

e.g. Eurostat from TSVDBLP from DBLP databaseDBpedia from WikipediaLinkedBrainz from MusicBrainz database... … …

Linked Data Quality

in the context of Linked Datageneration and publication workflow

Linked Data Quality dimensions

Representational dimension

Intrinsic dimension

Accessibility dimension

Contextual dimension

A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for Linked Data: A Survey. Semantic Web Journal, 2016.

Linked Data Quality dimensions

Representational dimensiondata modeling

Intrinsic dimensionLinked Data generation

Accessibility dimensionLinked Data publication

Contextual dimensionLinked Data consumption

Linked Data Quality dimensions

Representational dimensiondata modeling

Intrinsic dimensionLinked Data generation

Accessibility dimensionLinked Data publishing

Contextual dimensionLinked Data consumption

Linked Data Quality - Intrinsic Dimension

determines the RDF Dataset Quality by assessing it for possible violations

with respect toaccuracy (e.g. malformed datatype literals)

consistency (e.g. disjoint classes/properties)

Instead of applying Quality Assessment to the already published Linked Dataas part of Linked Data consumption

Apply Quality Assessment to the Mappings

that generate the Linked Dataas part of Linked Data production

Linked Dataset Quality Assessment (DQA)

Mappings Quality Assessment (MQA)

Mapping & Dataset Quality Assessment Workflow

Mappings & Quality Assessment Evaluation Results

Linked Dataset Quality Assessment (DQA)

Mappings Quality Assessment (MQA)

Mapping & Dataset Quality Assessment Workflow

Mappings & Quality Assessment Evaluation Results

dbo:Person

dbo:Personxsd:date

dbo:Personxsd:date

Linked Data Quality Assessment

Linked Data Quality Assessment (DQA)

RDFUnit http://rdfunit.aksw.org

test-driven data-debugging framework

based on SPARQL-patterns

D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. ZaveriTest-driven evaluation of linked data quality. In Proceedings of the 23rd International Conference on World Wide Web

DQA with RDFUnit

…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }

…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }

10 domain violations

10 datatype violations

1,000,000 domain violations!!!

1,000,000 datatype violations!!!

Linked Data Quality Assessment (DQA)

Similar violations occur repeatedlywithin a single Linked Data set

Linked Data Quality Assessment (DQA)

Sets of triples of a dataset have repetitive patterns

Linked Data Quality Assessment (DQA)

Sets of triples of a dataset have repetitive patterns

DQA: Linked Data Quality Assessment

is applied by third partiesto already published Linked Data sets

violationsDQA

DQA: Linked Data Quality Assessment

Adjustments is NOT appliedat the root of the problem

violationsDQA

DQA: Linked Data Quality Assessment

Adjustments are overwritten if a new version of the original data is annotated and published as Linked Data

violationsDQA

Instead of applying Quality Assessment to the already published Linked Data set

as part of data consumption

Apply Quality Assessment to the Mappingsthat generate the Linked Data

A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De WalleAssessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015

Linked Dataset Quality Assessment (DQA)

Mappings Quality Assessment (MQA)

Mapping & Dataset Quality Assessment Workflow

Mappings & Quality Assessment Evaluation Results

Mapping languagesformalize patterns into rules to generate Linked Data from some original data

RDF Mapping Language (RML) http://rml.io

extends the W3C-recommended R2RML

specify the mapping rules to generate Linked Datafrom heterogeneous data sources

mapping rules are Linked Data sets too!

A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.

RDF Mapping Language (RML) http://rml.io

<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .

RDF Mapping Language (RML) http://rml.io

data map doc

Mapping Processor

RDF Mapping Language (RML) http://rml.io

data map doc

Mapping Processor

violationsDQA

DQA: Linked Data Quality Assessment

data map doc

Mapping Processor

violationsDQA

DQA: Linked Data Quality Assessment

data map doc

Mapping Processor

violationsDQA

DQA: Linked Data Quality Assessment

data map doc

Mapping Processor

violationsMQA

MQA: Mapping Quality Assessment

DQA with RDFUnit over RML

…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }

…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }

D→MQA with RDFUnit over RML

…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }

…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }

D→MQA with RDFUnit over RML

…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }

…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }

<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .

D→MQA with RDFUnit over RML

…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }

…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }

… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }

<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .

data map doc

Mapping Processor

violationsMQA

MQA: Mapping Quality Assessment

MQA with RDFUnit over RML

…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }

…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }

… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }

<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .

1 ONLY domain violations!!!

1 ONLY datatype violations!!!

data map doc

Mapping Processor

violationsMDQA

MDQA: Uniform Mapping & Dataset Quality Assessment

Linked Dataset Quality Assessment (DQA)

Mappings Quality Assessment (MQA)

Mapping & Dataset Quality Assessment Workflow

Mappings & Quality Assessment Evaluation Results

MQA: Mapping Quality Assessment

discover not only the violationsbut also their origin before they are even generated

MQA: Mapping Quality Assessment

easily apply structural adjustments

prevent same violations to appear repeatedly over distinct entities

allow intuitively combiningdifferent ontologies and vocabularies

data map doc

Mapping Processor

violationsMDQA

MDQA: Uniform Mapping & Dataset Quality Assessment

data map doc

Mapping Processor

violationsMDQA

MDQA: Uniform Mapping & Dataset Quality Assessment

<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .

data map doc

Mapping Processor

Mapping Refinements

violationsMDQA

Uniform Mapping & Dataset Quality Assessment Workflow

Correcting MQA violations with RML Editor

Correcting MQA violations with RML Editor

Correcting MQA violations with RML Editor

data map doc

Mapping Processor

violationsMDQA

MDQA: Uniform Mapping & Dataset Quality Assessment

<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .

DEL: <#ObjectMap> rr:datatype xsd:gYear.ADD: <#ObjectMap> rr:datatype xsd:date.

MQA with RDFUnit over RML

<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int .

DEL: <#ObjectMap> rr:datatype xsd:gYear.ADD: <#ObjectMap> rr:datatype xsd:date.

DEL: <#SubjectMap> rr:class dbo:Event.ADD: <#SubjectMap> rr:class dbo:Person.

MQA with RDFUnit over RML

<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int .

DEL: <#ObjectMap> rr:datatype xsd:gYear.ADD: <#ObjectMap> rr:datatype xsd:date.

<#Mapping> rr:subjectMap [ rr:class dbo:Person rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:date ] ] .

DEL: <#SubjectMap> rr:class dbo:Event.ADD: <#SubjectMap> rr:class dbo:Person.

data

new map doc

map doc

Mapping Processor

Mapping Refinements

violationsMDQA

(optional)

Uniform Mapping & Dataset Quality Assessment Workflow

data

new map doc

map doc

Mapping Processor

Mapping Refinements

violationsMDQA

(optional)

Uniform Mapping & Dataset Quality Assessment Workflow

Uniform Mapping & Dataset Quality Assessment Workflow

Mapping Quality Assessment: Limitations

Mapping Quality Assessment: Limitations

certain test cases inevitably require the complete Linked Data set

Mapping Quality Assessment: Limitations

certain test cases inevitably require the complete Linked Data set

cardinality, functionality, symmetricity

Mapping Quality Assessment: Limitations

certain test cases inevitably require the complete Linked Data set

cardinality, functionality, symmetricity

on Mappings defense: more data issue NOT affected by the mapping rules

Linked Dataset Quality Assessment (DQA)

Mappings Quality Assessment (MQA)

Mapping & Dataset Quality Assessment Workflow

Mappings & Quality Assessment Evaluation Results

Dataset Vs Mapping Quality Assessment Number of Violations

*Dbpedia and DBLP D2RQ Mappings were translated to RML mappings

#violations - Quality Assessment

Dataset Assessment Mappings Assessment

DBpedia EN 3.2M 160

DBLP 8.1M 8

A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De WalleAssessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015

Dataset Vs Mapping Quality Assessment Time

Dataset Quality Assessment Mappings Quality Assessment

size time size time

DBPedia EN 62M 16h 115K 11s

DBPedia NL 21M 1.5h 53K 6s

DBLP 12M 12h 368 12s

A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De WalleAssessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015

Mapping Quality Assessment

* http://mappings.dbpedia.org/validation

Live update of DBpedia Mapping Quality Assessment results every night! ☺

Mapping Quality Assessment

size time

DBpedia EN 115K 11s

DBpedia NL 53K 6s

DBpedia All 511K 32s

A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De WalleAssessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015

* http://mappings.dbpedia.org/validation

DBpedia Mappings Quality Assessment

A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. HelmannDBpedia Mappings Quality Assessment. To be published in Proceedings of the 15th International Semantic Web Conference: Posters and Demos 2016

Live update of DBpedia Mapping Quality Assessment results every night! ☺

Linked Dataset Quality Assessment (DQA)

Mappings Quality Assessment (MQA)

Mapping & Dataset Quality Assessment Workflow

Mappings & Quality Assessment Evaluation Results

Violations are related to the dataset's schema (vocabularies or ontologies)

occur repeatedlywithin a single RDF dataset

The situation aggravates the more ontologies and vocabularies are reused and combined

Linked Data Quality Assessmentshifted from data consumption to data publication

integrated systematically in the publishing workflow

violations are identified,resolved and will not re-appear

Linked Data of higher Quality is generated!!!

Mappings ValidationData Quality Tutorial - SEMANTICS2016

Anastasia Dimou

[email protected] ● @natadimou

Ghent University – iMinds