Assessing and Refining Mappings to RDF to Improve Dataset Quality
Transcript of Assessing and Refining Mappings to RDF to Improve Dataset Quality
Assessing and Refining Mappings to RDF to Improve Dataset Quality
@jimkont
Anastasia Dimou1, Dimitris Kontokostas2, Markus Freudenberg2, Ruben Verborgh1, Jens Lehmann2, Erik Mannens1,
Sebastian Hellmann2, Rik Van de Walle1
@natadimou
1Ghent University – iMinds – MMLab
2AKSW – Leipzig University
http://RML.io ● http://RDFUnit.aksw.org
Linked Open Data
semantically annotated using different vocabularies or ontologies and interlinked data representations
published in the form of RDF datasets
derive from originally heterogeneous (semi-)structured data
RDF Dataset Quality
varies significantly ranging from expensively curated to relatively low quality datasets
RDF Dataset Quality - Intrinsic Dimension
determines the RDF Dataset Quality by assessing it for possible violations with respect to accuracy (e.g. malformed datatype literals)
consistency (e.g. disjoint classes/properties)
RDF Dataset Quality Assessment (DQA) DQA with RDFUnit
Mappings Quality Assessment (MQA) MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow Mapping Refinements
Mappings & Quality Assessment Results
RDF Dataset Quality Assessment (DQA) DQA with RDFUnit
Mappings Quality Assessment (MQA) MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow Mapping Refinements
Mappings & Quality Assessment Results
Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies)
dbo:birthDate range xsd:date dbo:birthDate domain dbo:Person
http://example.com/ Chuck_Bednarik
dbo:Event
"1925-05-01" xsd:gYear
dbo:birthDate
RDF DQA with RDFUnit
test-driven data-debugging framework
based on SPARQL-patterns
http://rdfunit.aksw.org
D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. Zaveri Test-driven evaluation of linked data quality In Proceedings of the 23rd International Conference on World Wide Web
RDF DQA with RDFUnit
test-driven data-debugging framework
based on SPARQL-patterns
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01" xsd:gYear
http://rdfunit.aksw.org
RDF DQA with RDFUnit
…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }
http://rdfunit.aksw.org
…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01" xsd:gYear
Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies) Similar violations occur repeatedly within a single RDF dataset
http://example.com/ Giddeon_Massie
dbo:Event
"1981-08-27" xsd:gYear
http://example.com/ Brick_Bronsky
dbo:Event
"1964" xsd:gYear
http://example.com/ Steve_Meilinger
dbo:Event
"1930-12-12" xsd:gYear
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01" xsd:gYear
http://example.com/ Matt_McBride
dbo:Event
"1985-05-23" xsd:gYear
dbo:birthDate
dbo:birthDate
dbo:birthDate
dbo:birthDate
sets of triples of a dataset have repetitive patterns
http://example.com/ Brick_Bronsky
dbo:Event
"1964" xsd:gYear
http://example.com/ Steve_Meilinger
dbo:Event
"1930-12-12" xsd:gYear
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01" xsd:gYear
http://example.com/ Matt_McBride
dbo:Event
"1985-05-23" xsd:gYear
dbo:birthDate
dbo:birthDate
dbo:birthDate
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01"
xsd:gYear
http://example.com/ Matt_McBride
dbo:Event
"1985-05-23" xsd:gYear
dbo:birthDate http://example.com/ {Name}_{Surname}
dbo:Event
"Birth" xsd:gYear
sets of triples of a dataset have repetitive patterns
dbo:birthDate
sets of triples of a dataset have repetitive patterns
dbo:birthDate http://example.com/ {Name}_{Surname}
dbo:Event
“Birth" xsd:gYear
Mapping languages formalize patterns into rules to generate the RDF dataset from the original data
Instead of applying Quality Assessment to the already published RDF dataset
as part of data consumption
Apply Quality Assessment to the Mappings that generate the RDF dataset
Incorporate Quality Assessment
in the publishing workflow
DQA: Dataset Quality Assessment
is applied by third parties to already published RDF dataset
violations DQA
DQA: Dataset Quality Assessment
Adjustments to the dataset are manually but rarely applied but not at the root (hard to identify)
are overwritten if a new version of
the original data is mapped & published
violations DQA
RDF Dataset Quality Assessment (DQA) DQA with RDFUnit
Mappings Quality Assessment (MQA) MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow Mapping Refinements
Mappings & Quality Assessment Results
sets of triples of a dataset have repetitive patterns
dbo:birthDate http://example.com/ {Name}_{Surname}
dbo:Event
“Birth" xsd:gYear
Mapping languages formalize patterns into rules to generate the RDF dataset from the original data
sets of triples of a dataset have repetitive patterns
Name Surname Birth
Chuck Bednarik 1925-05-01
Matt McBride 1985-05-23
Steve Meilinger 1930-12-12
Brick Bronsky 1964
Giddeon Massie 1981-08-27
dbo:birthDate http://example.com/ {Name}_{Surname}
dbo:Event
“Birth" xsd:gYear
RDF Mapping Language (RML) specify the mapping definitions to generate RDF representation from heterogeneous data sources
extends the W3C-recommended R2RML
http://rml.io
A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.
RDF Mapping Language (RML)
http://rml.io
<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}_{Surname}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Birth" ; rr:datatype xsd:gYear ] ] .
dbo:birthDate http://example.com/ {Name}_{Surname}
dbo:Event
“Birth" xsd:gYear
MQA with RDFUnit over RML
dbo:birthDate http://example.com/ {Name}_{Surname}
dbo:Event
“Birth" xsd:gYear
…WHERE { ?resource %%P1%% ?c. FILTER (DATATYPE(?c) != %%D1%%) }
…WHERE { ?resource dbo:birthDate ?c. FILTER (DATATYPE(?c) != xsd:date) }
… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }
<#Mapping> rr:subjectMap [ rr:class dbo:Event rr:template "http://example.com/{Name}_{Surname}" ] ; rr:predicateObjectMap [ rr:predicate dbo:birthDate rr:objectMap [ rml:reference "Age" ; rr:datatype xsd:gYear ] ] .
MQA: Mapping Quality Assessment
discover violations before they are even generated
specify the origin of the violation
RDFUnit over RML
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01" xsd:gYear
… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) }
RDFUnit over RML
<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .
… WHERE { ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate dbo:birthDate; rr:objectMap ?objM. ?objM rr:datatype ?c. FILTER (?c != xsd:date) }
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01" xsd:gYear
MQA: Mapping Quality Assessment
discover violations before they are even generated
specify the origin of the violation
easily apply structural adjustments to the mapping definitions
RDF Dataset Quality Assessment (DQA) DQA with RDFUnit
Mappings Quality Assessment (MQA) MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow Mapping Refinements
Mappings & Quality Assessment Results
data map doc
Mapping Processor
violations MDQA
MDQA: Uniform Mapping & Dataset Quality Assessment
<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .
data map doc
Mapping Processor
violations MDQA
MDQA: Uniform Mapping & Dataset Quality Assessment
<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:gYear ; rut:missingValue xsd:date .
DEL: <#ObjectMap> rr:datatype xsd:gYear ADD: <#ObjectMap> rr:datatype xsd:date
data map doc
Mapping Processor
Mapping Refinements
violations MDQA
Uniform Mapping & Dataset Quality Assessment Workflow
MQA with RDFUnit over RML
dbo:birthDate http://example.com/
Chuck_Bednarik
dbo:Person
"1925-05-01"
xsd:date
DEL: <#ObjectMap> rr:datatype xsd:gYear ADD: <#ObjectMap> rr:datatype xsd:date
<#Result> rut:testCase rut:datatypeError spin:violationRoot <#ObjectMap> ; spin:violationPath rr:datatype ; spin:violationValue xsd:float ; rut:missingValue xsd:int .
data
new map doc
map doc
Mapping Processor
Mapping Refinements
violations MDQA
(optional)
Uniform Mapping & Dataset Quality Assessment Workflow
data
new map doc
map doc
Mapping Processor
Mapping Refinements
violations MDQA
(optional)
Uniform Mapping & Dataset Quality Assessment Workflow
Beyond Mapping Quality Assessment
certain test cases inevitably require the RDF Dataset cardinality,
functionality, symmetricity
Beyond Mapping Quality Assessment
certain test cases inevitably require the RDF Dataset cardinality,
functionality, symmetricity
reflect to the data, DO NOT affected by the mapping definitions
Mapping Quality Assessment (MQA)
prevent the violations generation
prevent same violations to appear repeatedly over distinct entities
allow intuitively combining different ontologies and vocabularies
RDF Dataset Quality Assessment (DQA) DQA with RDFUnit
Mappings Quality Assessment (MQA) MQA with RDFUnit over RML
Mapping & Dataset Quality Assessment Workflow Mapping Refinements
Mappings & Quality Assessment Results
Dataset Vs Mapping Quality Assessment Number of Violations Dataset Quality Assessment Mapping Quality Assessment
#fail test cases #violations #fail test cases #violations
DBPedia EN 1,128 3.2M 1 160
DBPedia NL 683 815k 1 124
DBLP 7 8.1M 2 8
*Dbpedia and D2RQ Mappings were translated to RML mappings
Dataset Vs Mapping Quality Assessment Time Dataset Quality Assessment Mapping Quality Assessment
size time size time
DBPedia EN 62M 16h 115K 11s
DBPedia NL 21M 1.5h 53K 6s
DBLP 12M 12h 368 12s
CEUR-WS* 2.4k 6s 702 5s
iLastic 150k 12s 825 15s
*CEUR-WS submission to the ESWC Semantic Publishing Challenge (2014 Vs 2015)
Mapping Quality Assessment
Mapping Quality Assessment
size time
DBPedia EN 115K 11s
DBPedia NL 53K 6s
DBPedia All 511K 32s
* http://mappings.dbpedia.org/validation
Live update of DBpedia Mapping Quality Assessment results every night!
Violations Most frequent violations are related to the dataset's schema (vocabularies or ontologies)
Similar violations occur repeatedly within a single RDF dataset
The situation aggravates the more ontologies and vocabularies are reused and combined