Post on 16-Apr-2017
Mappings ValidationData Quality Tutorial - SEMANTICS2016
Anastasia Dimou
Anastasia.Dimou@ugent.be ● @natadimou
Ghent University – iMinds
Linked (Open) Data
semantically annotated & interlinked data using different vocabularies or ontologies
published in the form of RDF datasets
Linked (Open) Data
derive from originally heterogeneous(semi-)structured data
e.g. Eurostat from TSVDBLP from DBLP databaseDBpedia from WikipediaLinkedBrainz from MusicBrainz database... … …
Linked Data Quality dimensions
Representational dimension
Intrinsic dimension
Accessibility dimension
Contextual dimension
A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for Linked Data: A Survey. Semantic Web Journal, 2016.
Linked Data Quality dimensions
Representational dimensiondata modeling
Intrinsic dimensionLinked Data generation
Accessibility dimensionLinked Data publication
Contextual dimensionLinked Data consumption
Linked Data Quality dimensions
Representational dimensiondata modeling
Intrinsic dimensionLinked Data generation
Accessibility dimensionLinked Data publishing
Contextual dimensionLinked Data consumption
Linked Data Quality - Intrinsic Dimension
determines the RDF Dataset Quality by assessing it for possible violations
with respect toaccuracy (e.g. malformed datatype literals)
consistency (e.g. disjoint classes/properties)
Instead of applying Quality Assessment to the already published Linked Dataas part of Linked Data consumption
Apply Quality Assessment to the Mappings
that generate the Linked Dataas part of Linked Data production
Linked Dataset Quality Assessment (DQA)
Mappings Quality Assessment (MQA)
Mapping & Dataset Quality Assessment Workflow
Mappings & Quality Assessment Evaluation Results
Linked Dataset Quality Assessment (DQA)
Mappings Quality Assessment (MQA)
Mapping & Dataset Quality Assessment Workflow
Mappings & Quality Assessment Evaluation Results
Linked Data Quality Assessment (DQA)
RDFUnit http://rdfunit.aksw.org
test-driven data-debugging framework
based on SPARQL-patterns
D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. ZaveriTest-driven evaluation of linked data quality. In Proceedings of the 23rd International Conference on World Wide Web
DQA with RDFUnit
…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }
…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
Linked Data Quality Assessment (DQA)
Similar violations occur repeatedlywithin a single Linked Data set
DQA: Linked Data Quality Assessment
is applied by third partiesto already published Linked Data sets
violationsDQA
DQA: Linked Data Quality Assessment
Adjustments is NOT appliedat the root of the problem
violationsDQA
DQA: Linked Data Quality Assessment
Adjustments are overwritten if a new version of the original data is annotated and published as Linked Data
violationsDQA
Instead of applying Quality Assessment to the already published Linked Data set
as part of data consumption
Apply Quality Assessment to the Mappingsthat generate the Linked Data
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De WalleAssessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015
Linked Dataset Quality Assessment (DQA)
Mappings Quality Assessment (MQA)
Mapping & Dataset Quality Assessment Workflow
Mappings & Quality Assessment Evaluation Results
RDF Mapping Language (RML) http://rml.io
extends the W3C-recommended R2RML
specify the mapping rules to generate Linked Datafrom heterogeneous data sources
mapping rules are Linked Data sets too!
A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.
RDF Mapping Language (RML) http://rml.io
<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
DQA with RDFUnit over RML
…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }
…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
D→MQA with RDFUnit over RML
…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }
…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
D→MQA with RDFUnit over RML
…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }
…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
D→MQA with RDFUnit over RML
…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }
…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }
<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
MQA with RDFUnit over RML
…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }
…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }
<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
1 ONLY domain violations!!!
1 ONLY datatype violations!!!
Linked Dataset Quality Assessment (DQA)
Mappings Quality Assessment (MQA)
Mapping & Dataset Quality Assessment Workflow
Mappings & Quality Assessment Evaluation Results
MQA: Mapping Quality Assessment
discover not only the violationsbut also their origin before they are even generated
MQA: Mapping Quality Assessment
easily apply structural adjustments
prevent same violations to appear repeatedly over distinct entities
allow intuitively combiningdifferent ontologies and vocabularies
data map doc
Mapping Processor
violationsMDQA
MDQA: Uniform Mapping & Dataset Quality Assessment
<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .
data map doc
Mapping Processor
Mapping Refinements
violationsMDQA
Uniform Mapping & Dataset Quality Assessment Workflow
data map doc
Mapping Processor
violationsMDQA
MDQA: Uniform Mapping & Dataset Quality Assessment
<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .
DEL: <#ObjectMap> rr:datatype xsd:gYear.ADD: <#ObjectMap> rr:datatype xsd:date.
MQA with RDFUnit over RML
<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int .
DEL: <#ObjectMap> rr:datatype xsd:gYear.ADD: <#ObjectMap> rr:datatype xsd:date.
DEL: <#SubjectMap> rr:class dbo:Event.ADD: <#SubjectMap> rr:class dbo:Person.
MQA with RDFUnit over RML
<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int .
DEL: <#ObjectMap> rr:datatype xsd:gYear.ADD: <#ObjectMap> rr:datatype xsd:date.
<#Mapping> rr:subjectMap [ rr:class dbo:Person rr:template "http://example.com/{Name}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:date ] ] .
DEL: <#SubjectMap> rr:class dbo:Event.ADD: <#SubjectMap> rr:class dbo:Person.
data
new map doc
map doc
Mapping Processor
Mapping Refinements
violationsMDQA
(optional)
Uniform Mapping & Dataset Quality Assessment Workflow
data
new map doc
map doc
Mapping Processor
Mapping Refinements
violationsMDQA
(optional)
Uniform Mapping & Dataset Quality Assessment Workflow
Mapping Quality Assessment: Limitations
certain test cases inevitably require the complete Linked Data set
Mapping Quality Assessment: Limitations
certain test cases inevitably require the complete Linked Data set
cardinality, functionality, symmetricity
Mapping Quality Assessment: Limitations
certain test cases inevitably require the complete Linked Data set
cardinality, functionality, symmetricity
on Mappings defense: more data issue NOT affected by the mapping rules
Linked Dataset Quality Assessment (DQA)
Mappings Quality Assessment (MQA)
Mapping & Dataset Quality Assessment Workflow
Mappings & Quality Assessment Evaluation Results
Dataset Vs Mapping Quality Assessment Number of Violations
*Dbpedia and DBLP D2RQ Mappings were translated to RML mappings
#violations - Quality Assessment
Dataset Assessment Mappings Assessment
DBpedia EN 3.2M 160
DBLP 8.1M 8
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De WalleAssessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015
Dataset Vs Mapping Quality Assessment Time
Dataset Quality Assessment Mappings Quality Assessment
size time size time
DBPedia EN 62M 16h 115K 11s
DBPedia NL 21M 1.5h 53K 6s
DBLP 12M 12h 368 12s
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De WalleAssessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015
Mapping Quality Assessment
* http://mappings.dbpedia.org/validation
Live update of DBpedia Mapping Quality Assessment results every night! ☺
Mapping Quality Assessment
size time
DBpedia EN 115K 11s
DBpedia NL 53K 6s
DBpedia All 511K 32s
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De WalleAssessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of The Semantic Web - ISWC 2015
* http://mappings.dbpedia.org/validation
DBpedia Mappings Quality Assessment
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. HelmannDBpedia Mappings Quality Assessment. To be published in Proceedings of the 15th International Semantic Web Conference: Posters and Demos 2016
Live update of DBpedia Mapping Quality Assessment results every night! ☺
Linked Dataset Quality Assessment (DQA)
Mappings Quality Assessment (MQA)
Mapping & Dataset Quality Assessment Workflow
Mappings & Quality Assessment Evaluation Results
Violations are related to the dataset's schema (vocabularies or ontologies)
occur repeatedlywithin a single RDF dataset
The situation aggravates the more ontologies and vocabularies are reused and combined
Linked Data Quality Assessmentshifted from data consumption to data publication
integrated systematically in the publishing workflow
violations are identified,resolved and will not re-appear
Linked Data of higher Quality is generated!!!