Amrapali Zaveri Defense

147
17th April, 2015 Leipzig, Germany Linked Data Quality Assessment and its Application to Societal Progress Measurement Amrapali Zaveri 1 Faculty of Mathematics and Computer Science Supervisors: Prof. Dr. Ing. habil. Klaus-Peter Fähnrich, University of Leipzig Dr. Jens Lehmann, University of Leipzig Prof. Dr. Sören Auer, University of Bonn

Transcript of Amrapali Zaveri Defense

17th April, 2015! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Leipzig, Germany

Linked Data Quality Assessment and its Application to Societal Progress Measurement

Amrapali Zaveri

1

Faculty of Mathematics and Computer Science!!

Supervisors:!Prof. Dr. Ing. habil. Klaus-Peter Fähnrich, University of Leipzig!

Dr. Jens Lehmann, University of Leipzig! Prof. Dr. Sören Auer, University of Bonn

Outline

2Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Outline

Motivation — Linked Data Quality

2Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Outline

Motivation — Linked Data Quality

Linked Data Quality Assessment Methodologies

2Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Outline

Motivation — Linked Data Quality

Linked Data Quality Assessment Methodologies

Use Case Leveraging Data Quality

2Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Outline

Motivation — Linked Data Quality

Linked Data Quality Assessment Methodologies

Use Case Leveraging Data Quality

Contributions

2Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Outline

Motivation — Linked Data Quality

Linked Data Quality Assessment Methodologies

Use Case Leveraging Data Quality

Contributions

Future Work

2Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation!

— Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri 3

Data on the Web

4

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Data on the Web

5

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Data on the Web

5

Accessible

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Data on the Web

5

Accessible

Re-usable

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Data on the Web

5

Accessible

Re-usable

Understandable

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Data on the Web

5

Accessible

Re-usable

Understandable

Discoverable

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Linked Data Principles

6

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Linked Data Principles

6

Use URIs as names for things.

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Linked Data Principles

6

Use URIs as names for things.

Use HTTP URIs, so that people can look up those names.

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Linked Data Principles

6

Use URIs as names for things.

Use HTTP URIs, so that people can look up those names.

When someone looks up a URI, provide useful information, using the standards (RDF, RDFS, OWL, SPARQL).

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Linked Data Principles

6

Use URIs as names for things.

Use HTTP URIs, so that people can look up those names.

When someone looks up a URI, provide useful information, using the standards (RDF, RDFS, OWL, SPARQL).

Include links to other URIs, so that they can discover more things.

Motivation — Linked Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Linked Data

7

Motivation — Linked Data Quality

Linked Data

8Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

Linked Data

9

Motivation — Linked Data Quality

Linked Data

9

Motivation — Linked Data Quality

Linked Data

9

What about the quality?

Motivation — Linked Data Quality

Data Quality

10

Data Quality is defined as:!

“fitness for use”*!

* Juran, J. (1974). The Quality Control Handbook. McGraw-Hill, New York.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

Consequences of Poor Quality

11Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

*http://www.gartner.com/newsroom/id/501733!#http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information

Consequences of Poor Quality

11

Propagation of errors in integrated datasets

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

*http://www.gartner.com/newsroom/id/501733!#http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information

Consequences of Poor Quality

11

Propagation of errors in integrated datasets

Major hindrance in acquiring reliable results

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

*http://www.gartner.com/newsroom/id/501733!#http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information

Consequences of Poor Quality

11

Propagation of errors in integrated datasets

Major hindrance in acquiring reliable results

Loss of important information

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

*http://www.gartner.com/newsroom/id/501733!#http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information

Consequences of Poor Quality

11

Propagation of errors in integrated datasets

Major hindrance in acquiring reliable results

Loss of important information

Loss in productivity — Additional costs*#

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

*http://www.gartner.com/newsroom/id/501733!#http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information

Data Quality Assessment

12Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

Data Quality Assessment

12

How can one assess the quality of data and make this information explicit?!

Which criteria should be assessed?!

Which measures should be used?!

Which methodologies/tools can be utilized?

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

Main Research Question

13Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

Main Research Question

13

How can we exploit Linked Data for a particular use case and ensure good data quality?

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — Linked Data Quality

Overview

14

Systematic!literature!

review

Linked Data Quality Assessment !Methodologies Evaluation

User-driven Crowdsourcing

Semi-!automated

Use case!leveraging!

quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Overview

15

Systematic!literature!

review

Linked Data Quality Assessment !Methodologies Evaluation

User-driven Crowdsourcing

Semi-!automated

Use case!leveraging!

quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Current State

16Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Current State

16

Lack of unified descriptions for data quality dimensions and metrics for Linked Data

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Current State

16

Lack of unified descriptions for data quality dimensions and metrics for Linked Data

Lack of use-case-driven data quality assessment methodologies for Linked Data

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Current State

16

Lack of unified descriptions for data quality dimensions and metrics for Linked Data

Lack of use-case-driven data quality assessment methodologies for Linked Data

Lack of quality assessment of datasets before utilisation in particular use cases

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

17

Research Questions

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

17

RQ1 What are the existing approaches to assess the quality of Linked Data employing a conceptual framework integrating prior approaches?!

RQ1.1 What are the data quality problems that each approach assesses?!RQ1.2 Which are the data quality dimensions and metrics supported by the proposed approaches?!RQ1.3 Which tools already exist to assess the quality of Linked Data?

Research Questions

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Qualitative Analysis

18

Quality assessment methodologies for Linked Data: A Survey. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann and Sören Auer. Semantic Web Journal 2015.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Qualitative Analysis

18

30 core articles

Quality assessment methodologies for Linked Data: A Survey. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann and Sören Auer. Semantic Web Journal 2015.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Qualitative Analysis

18

30 core articles18 dimensions - definitions

Quality assessment methodologies for Linked Data: A Survey. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann and Sören Auer. Semantic Web Journal 2015.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Qualitative Analysis

18

30 core articles18 dimensions - definitions69 metrics

Quality assessment methodologies for Linked Data: A Survey. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann and Sören Auer. Semantic Web Journal 2015.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Qualitative Analysis

18

30 core articles18 dimensions - definitions69 metrics12 tools compared using 8 attributes

Quality assessment methodologies for Linked Data: A Survey. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann and Sören Auer. Semantic Web Journal 2015.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Dimensions

19Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

*specific for Linked Data

Dimensions

Relevancy Conciseness

Timeliness

Rep.-Conciseness

Interoperability

Consistency

Interpretability

Understandability

Versatility*

Availability

Performance* Interlinking*

SyntacticValidity

Representation

ContextualIntrinsic

Accessibility

Trustworthiness

Two dimensionsare related

Licensing*

Semantic Accuracy

Completeness

Security*

Dim1 Dim2

19Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

*specific for Linked Data

Metrics

20

Linked Data Quality MetricsDimension Metric Description QN/QL*

Completeness Schema completeness No. of classes and properties / !total no. of classes and properties QN

Interlinking Detection of good quality interlinks

(i) detection of (a) interlinking degree, (b) clustering coefficient, (c) centrality, (d) open sameAs chains and (e) description richness through sameAs by using network measures, (ii) via crowdsourcing

QN

Timeliness Freshness of datasets Max{0, 1 − currency / volatility} QN

Trustworthiness Trustworthiness of information provider

indicating the level of trust for the publisher on a scale of 1−9 QL

*QN - Quantitative Metric ; *QL - Qualitative Metric

Systematic Literature Review

Tools

21

Trellis TrustBOT tSPARQL WIQA ProLOD Flemming

Availablility - - ✔ - - ✔

Licensing Open-source - GPL v3 Apache v2 - -

Automation Semi-automated

Semi-automated

Semi-automated

Semi-automated

Semi-automated

Semi-automated

Collaboration Yes No No No No No

Customizability ✔ ✔ ✔ ✔ ✔ ✔

Scalability - No Yes - - No

Usability 2 4 4 2 2 3

Maintainance 2005 2003 2012 2006 2010 2010

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Tools

22

LinkQA Sieve RDFUnit DaCura TripleCheckMate LiQuate

Availablility ✔ ✔ ✔ - ✔ ✔

Licensing Open-source

Apache Apache - Apache -

Automation Automated Semi-automated

Semi-automated

Semi-automated

Semi-automated Semi-automated

Collaboration No No No Yes yes No

Customizability No✔ ✔ ✔ ✔

No

Scalability Yes Yes Yes No Yes No

Usability 2 4 3 1 5 1

Maintainance 2011 2012 2014 2013 2013 2013

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Problems in Current Approaches

23Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Problems in Current Approaches

23

Not catered to the use case

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Problems in Current Approaches

23

Not catered to the use case

Results difficult to interpret

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Problems in Current Approaches

23

Not catered to the use case

Results difficult to interpret

Do not report the root cause of the quality issues

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Problems in Current Approaches

23

Not catered to the use case

Results difficult to interpret

Do not report the root cause of the quality issues

Require considerable amount of configuration

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Problems in Current Approaches

23

Not catered to the use case

Results difficult to interpret

Do not report the root cause of the quality issues

Require considerable amount of configuration

Do not allow user to choose input dataset

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Systematic Literature Review

Overview

24

Systematic!literature!

review

Linked Data Quality Assessment !Methodologies Evaluation

User-driven Crowdsourcing

Semi-!automated

Use case!leveraging!

quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Research Questions

25

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Research Questions

25

RQ2 How can we assess the quality of Linked Data using a user-driven methodology?!

RQ2.1 How feasible is it to employ Linked Data experts to assess the quality issues of LD?!RQ2.2 How feasible is it to use a combination of user-driven and semi-automated methodology to assess the quality of LD?

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Methodology

26

Resource Selection

[Per Class] [Manual]

[Random]

Resource

Evaluation mode selection

Resource Evaluation

[Manual]

Triples

[Semi-automatic] [Automatic]

List of invalid facts

Data QualityImprovement

Pre-selection of triples

Patch Ontology

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement

Methodology

26

Resource Selection

[Per Class] [Manual]

[Random]

Resource

Evaluation mode selection

Resource Evaluation

[Manual]

Triples

[Semi-automatic] [Automatic]

List of invalid facts

Data QualityImprovement

Pre-selection of triples

Patch Ontology

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement

Manual!Semi-automated!

!

!

!

!

Manual — Phase I

27

Linked Data Quality Problem TaxonomyDimensions Category

AccuracyTriple incorrectly extracted!Datatype problems!Implicit relationships between attributesRelevancy Irrelevant information extracted

Representational consistency Representation of number values

Interlinking External linksInterlinks with other datasets

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Manual — Phase II

28

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Manual — Phase II

28

Invited Linked Data experts!

Triple-based evaluation!

Contest-based - 3 weeks

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Phase II — TripleCheckMate

29

User-Driven Quality Assessment

https://github.com/AKSW/TripleCheckMate

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Choose a resource

Phase II — TripleCheckMate

30

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Identify erroneous triples

Phase II — TripleCheckMate

30

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Identify erroneous triples

Phase II — TripleCheckMate

30

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Identify erroneous triples

Phase II — TripleCheckMate

31

User-Driven Quality Assessment

Map to the quality problem taxonomy

Manual — Results

32

Total no. of users 58

Total no. of distinct resources evaluated 521Total no. of distinct incorrect triples 2928% of triples affected 11.93%

Resource-based inter-rater agreement (Cohen’s kappa) 0.34

Total no. of triples evaluated for correctness 700

% of triples evaluated incorrectly 19%

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 1

33

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

Functionality

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

FunctionalityInverse functionality

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

FunctionalityInverse functionalityAsymmetric

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

FunctionalityInverse functionalityAsymmetricIrreflexivity

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

FunctionalityInverse functionalityAsymmetricIrreflexivity

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Example:

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

FunctionalityInverse functionalityAsymmetricIrreflexivity

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Example:Domain: Formula One Racer

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

FunctionalityInverse functionalityAsymmetricIrreflexivity

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Example:Domain: Formula One RacerRange: Grand Prix

Semi-automated — Step 1

33

Generate schema axioms for properties via DL-Learner*

FunctionalityInverse functionalityAsymmetricIrreflexivity

User-Driven Quality Assessment

*Lehmann, J. (2009). DL-Learner: learning concepts in description logics. Journal of Machine Learning Research, 10:2639–2642.!

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Example:Domain: Formula One RacerRange: Grand PrixOnly 1 first win of each Formula One Racer (Functional)

Semi-automated — Step 2

34

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 2

34

Manual evaluation of generated axioms

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 2

34

Manual evaluation of generated axioms100 random axioms per type

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 2

34

Manual evaluation of generated axioms100 random axioms per typeOnly those axioms where at least one violation can be found

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Step 2

34

Manual evaluation of generated axioms100 random axioms per typeOnly those axioms where at least one violation can be foundAlso taking target context into account

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Semi-automated — Results

35

User-Driven Quality Assessment

Inverse!functionality

Functionality

Asymmetry

Irreflexivity

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Summary

36

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Summary

36

Quality analysis of over 500 resources

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Summary

36

Quality analysis of over 500 resources12% error detected

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Summary

36

Quality analysis of over 500 resources12% error detectedLinked Data experts performed quality analysis but evaluated correct triples as errors

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Summary

36

Quality analysis of over 500 resources12% error detectedLinked Data experts performed quality analysis but evaluated correct triples as errors 75% functionality violations of property characteristics detected but required manual verification

User-Driven Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Overview

37

Systematic!literature!

review

Linked Data Quality Assessment !Methodologies Evaluation

User-driven Crowdsourcing

Semi-!automated

Use case!leveraging!

quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Research Questions

38

Crowdsourcing Linked Data Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Research Questions

38

RQ2.3 Is it possible to detect quality issues in LD datasets via crowdsourcing mechanisms?!

RQ2.4 What type of crowd is most suitable for each type of quality issue?!

RQ2.5 Which types of assessment errors are made by lay users and experts?

Crowdsourcing Linked Data Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Concepts

39

Crowdsourcing Linked Data Quality Assessment

- Crowdsourcing Linked Data quality assessment. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer and Jens Lehmann. ISWC 2013.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

- Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Fabian Flöck and Jens Lehmann. SWJ (Submitted) 2015.

Concepts

39

AMT - Amazon Mechanial Turk

Crowdsourcing Linked Data Quality Assessment

- Crowdsourcing Linked Data quality assessment. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer and Jens Lehmann. ISWC 2013.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

- Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Fabian Flöck and Jens Lehmann. SWJ (Submitted) 2015.

Concepts

39

AMT - Amazon Mechanial Turk

HITs - Human Intelligent Tasks/microtasks

Crowdsourcing Linked Data Quality Assessment

- Crowdsourcing Linked Data quality assessment. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer and Jens Lehmann. ISWC 2013.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

- Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Fabian Flöck and Jens Lehmann. SWJ (Submitted) 2015.

Concepts

39

AMT - Amazon Mechanial Turk

HITs - Human Intelligent Tasks/microtasks

MTurk Workers - monetary reward for each HIT

Crowdsourcing Linked Data Quality Assessment

- Crowdsourcing Linked Data quality assessment. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer and Jens Lehmann. ISWC 2013.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

- Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Fabian Flöck and Jens Lehmann. SWJ (Submitted) 2015.

Concepts

39

AMT - Amazon Mechanial Turk

HITs - Human Intelligent Tasks/microtasks

MTurk Workers - monetary reward for each HIT

Find-Fix-Verify phases

Crowdsourcing Linked Data Quality Assessment

- Crowdsourcing Linked Data quality assessment. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer and Jens Lehmann. ISWC 2013.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

- Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study. Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Fabian Flöck and Jens Lehmann. SWJ (Submitted) 2015.

Methodology

40

Resource

[Manual]

[Any]

Resource selection

Evaluation of resource’s

triples

Selection of quality issues

[Incorrect triples]

[Yes]

[No]

List of incorrect triples classified by quality issue

(Find stage) LD Experts in contest

HIT generation

(Verify stage) Workers in paid microtasks

Accept HIT

Assess triple according to

the given quality issue

Submit HIT

[Per Class]

[Correct]

[Incorrect]

[Data doesn’t make sense] [I don’t know]

[More triples to assess]

[No]

[Yes]

Experts Workers

Crowdsourcing Linked Data Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Quality Issues Types

41

Crowdsourcing Linked Data Quality Assessment

Quality Issues Types

41

Incorrect/incomplete object value

Crowdsourcing Linked Data Quality Assessment

Quality Issues Types

41

Incorrect/incomplete object valuedbpedia:Oreye! !dbpedia-owl:postalCode! !“4360”!@en

Incorrect datatypes/literals

Crowdsourcing Linked Data Quality Assessment

Quality Issues Types

41

Incorrect/incomplete object value

Incorrect interlink

dbpedia:Oreye! !dbpedia-owl:postalCode! !“4360”!@en

Incorrect datatypes/literals

Crowdsourcing Linked Data Quality Assessment

Results - Experts vs. Crowd

42

Crowdsourcing Linked Data Quality Assessment

LD Expert MTurk Worker

58 80

3 weeks

4 days

1512

1073

0.38 0.73

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

LD experts MTurk Workers

Object values Fair!- required validation

Fair!- simple comparisons

Datatypes & literals Fair!- required validation

Poor!- inexperienced with

RDF

Interlinks Poor!- high effort required

Good!- high inter-rater

agreement

Summary — Experts vs. Crowd

43

Crowdsourcing Linked Data Quality Assessment

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Overview

44

Systematic!literature!

review

Linked Data Quality Assessment !Methodologies Evaluation

User-driven Crowdsourcing

Semi-!automated

Use case!leveraging!

quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Research Questions

45

Use Case Leveraging Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Research Questions

45

RQ2.6 How can we semi-automatically assess the quality of datasets and provide meaningful results to the user?!RQ3 How can we exploit Linked Data for building a use case and ensure good data quality?

Use Case Leveraging Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Motivation — User Scenario

46

Use Case Leveraging Data Quality

Motivation — User Scenario

46

Healthcare!Policy maker

Use Case Leveraging Data Quality

Motivation — User Scenario

46

Healthcare!Policy maker

Use Case Leveraging Data Quality

Which diseases?!Deaths per diseases?!

Where to allocate funds?

interested in

Motivation — User Scenario

46

Healthcare!Policy maker

Use Case Leveraging Data Quality

Which diseases?!Deaths per diseases?!

Where to allocate funds?

interested in

Databases!e.g. WHO, !

ClinicalTrials.gov

looks

at

Motivation — User Scenario

46

Healthcare!Policy maker

Use Case Leveraging Data Quality

Which diseases?!Deaths per diseases?!

Where to allocate funds?

interested in

Databases!e.g. WHO, !

ClinicalTrials.gov

looks

at

Data in disparate datasets, !in different formats!

Data quality problems!Subset of data!

Error-prone analysis etc.

analysis

Motivation — User Scenario

46

Healthcare!Policy maker

Use Case Leveraging Data Quality

Which diseases?!Deaths per diseases?!

Where to allocate funds?

interested in

Databases!e.g. WHO, !

ClinicalTrials.gov

looks

at

Data in disparate datasets, !in different formats!

Data quality problems!Subset of data!

Error-prone analysis etc.

analysis translates to Inadequate !allocations of!

funds

Use Case — Societal Progress Indicators

47

Evaluate the impact of Research & Development (R&D) — educational performance — on a country’s performance in:!

Economical!

Healthcare

Use Case Leveraging Data Quality

Using Linked Data to evaluate the impact of Research and Development in Europe: a Structural Equation Model. Amrapali Zaveri, Joao Ricardo Nickenig Vissoci, Cinzia Daraio and Ricardo Pietrobon. ISWC 2013.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Datasets & Variables

48

Use Case Leveraging Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Datasets & Variables

48

4 datasets!

World Bank!

LinkedCT!

Scimago!

USPTO

Use Case Leveraging Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Datasets & Variables

48

4 datasets!

World Bank!

LinkedCT!

Scimago!

USPTO

17 variables !

Examples!

GDP (economical)!

Birth rate, death rate (healthcare)!

h-index (educational)

Use Case Leveraging Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Methodology

49

World Bank Scimago

LinkedCT USPTO

Use Case Leveraging Data Quality

*van Hage, W. R., Kauppinen, T., Graeler, B., Davis, C., Hoek- sema, J., Ruttenberg, A., and Bahls, D. (2014). SPARQL Package, v1.6. R Foundation for Statistical Computing.!* https://github.com/amrapalijz/R-LOD-SEM/blob/master/RSPARQL

extract

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Methodology

49

World Bank Scimago

LinkedCT USPTO

RSPARQL*

Use Case Leveraging Data Quality

*van Hage, W. R., Kauppinen, T., Graeler, B., Davis, C., Hoek- sema, J., Ruttenberg, A., and Bahls, D. (2014). SPARQL Package, v1.6. R Foundation for Statistical Computing.!* https://github.com/amrapalijz/R-LOD-SEM/blob/master/RSPARQL

extract

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Methodology

49

World Bank Scimago

LinkedCT USPTO

RSPARQL*

Use Case Leveraging Data Quality

*van Hage, W. R., Kauppinen, T., Graeler, B., Davis, C., Hoek- sema, J., Ruttenberg, A., and Bahls, D. (2014). SPARQL Package, v1.6. R Foundation for Statistical Computing.!* https://github.com/amrapalijz/R-LOD-SEM/blob/master/RSPARQL

perform

extract

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Methodology

49

World Bank Scimago

LinkedCT USPTO

Quality !Assessment

RSPARQL*

Use Case Leveraging Data Quality

*van Hage, W. R., Kauppinen, T., Graeler, B., Davis, C., Hoek- sema, J., Ruttenberg, A., and Bahls, D. (2014). SPARQL Package, v1.6. R Foundation for Statistical Computing.!* https://github.com/amrapalijz/R-LOD-SEM/blob/master/RSPARQL

perform

extract

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

R2RLint tool*!

7 dimensions!

13 quality metrics !

Use case specific

Semi-automated Quality Assessment

50

*https://github.com/AKSW/R2RLint

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

R2RLint tool*!

7 dimensions!

13 quality metrics !

Use case specific

Semi-automated Quality Assessment

50

Availability Completeness

Interlinking

Syntactic!validity!

Consistency

Interpretability

Representational conciseness

*https://github.com/AKSW/R2RLint

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Quality Assessment Results

51

Use Case Leveraging Data Quality

Interlinking !completeness

Population !incompleteness

Inconsistency

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Total no. detected

11/17 Variables

52

Latent variables

Observed variables

Educational!

performance

Number of articles (h) that have at least h citations (h-index)Total no. of documents published per country per yearHigh-technology export (HTE)

Healthcare!performance

Adolescent fertility rate (AFR)Birth rate (BR)Death rate (DR)Health expenditure public (HEP)Immunization DPT (IDPT)Immunization measles (IM)Mortality rate, infant (MR)

Economic performance

GDP per capita (current US$)

Use Case Leveraging Data Quality

Methodology

53

World Bank

Scimago

Structural Equation Modeling

EFA*-CFA*-!EFA-CFA

Apply SEM to !hypothesis variables

Step I

Step II

Use Case Leveraging Data Quality

apply

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Methodology

53

World Bank

Scimago

Structural Equation Modeling

EFA*-CFA*-!EFA-CFA

Apply SEM to !hypothesis variables

Step I

Step II

*EFA - Exploratory Factor Analysis!*CFA - Confirmatory Factor Analysis

Use Case Leveraging Data Quality

apply

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Theoretical Framework

54

Use Case Leveraging Data Quality

Educational !performance

Healthcare!performance

Economical!performance

correla

tion

correlation

correlation

Structural Equation Modeling

55

Use Case Leveraging Data Quality

https://github.com/amrapalijz/R-LOD-SEM/blob/master/sem_script.RLinked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

#Insert covariance matrix!var<-var(semdata)!cov<-cov(datanew)!cor<-cor(datanew)!#Acquire data!> data<-with(data,data.frame(hindex,noOfDocs,IDPT,IM,MR,! AFR,BR,DR,GDP,HEP,HET))!> ssemmodel<- specifyModel()!

#Latent Variables!> HealthCare->IDPT,efa14,NA; HealthCare->IM,efa11,NA; HealthCare-> MR,efa12,NA; HealthCare->AFR,efa13,NA;!….!#Running SEM model!> sem <- sem::sem(semmodel,cor, N=781)!> summary(sem,fit.indices=c("GFI", "AGFI", "RMSEA", "NFI","NNFI", "CFI", "RNI", "IFI", "SRMR", "AIC", "AICc"))!> modIndices(sem)!> qgraph(sem,cut = 0.8,gray=TRUE)

Structural Equation Modeling

56

Use Case Leveraging Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Theoretical Framework

57

Use Case Leveraging Data Quality

Educational !performance

Healthcare!performance

Economical!performance

correla

tion

correlation

correlation

Conclusions

58

Use Case Leveraging Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Conclusions

58

Performing robust statistical analysis on Linked Data can lead to important and meaningful insights on publicly available data for societal progress measurement.!

Importance of performing use-case driven data quality assessment of datasets before their utilisation.

Use Case Leveraging Data Quality

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Contributions

59

Contributions

59

Comprehensive survey !

18 data quality dimensions with definitions; 69 metrics!

12 tools compared according to 8 attributes!

Development and evaluation of data quality assessment methodologies!

User-driven - manual and semi-automated!

Crowdsourcing - experts vs. workers!

Semi-automated - application to a use case !

Consumption of Linked Data leveraging data quality

Future Work

60Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Future Work

60

Standardized Quality assessment methodology for Linked Data

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Future Work

60

Standardized Quality assessment methodology for Linked Data

Quality assessment tools for Linked Data

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Future Work

60

Standardized Quality assessment methodology for Linked Data

Quality assessment tools for Linked Data

Detection as well as improvement of quality issues before utilization in Linked Data use cases

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Conference Publications

61

Using Linked Data to evaluate the impact of Research and Development in Europe: a Structural Equation Model. Amrapali Zaveri, Joao Ricardo Nickenig Vissoci, Cinzia Daraio and Ricardo Pietrobon. ISWC 2013.!

Crowdsourcing Linked Data quality assessment. Maribel Acosta and Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer and Jens Lehmann. ISWC 2013. !

User-driven Quality Evaluation of DBpedia. Amrapali Zaveri, Dimitris Kontokostas, Mohamed A. Sherif, Lorenz Bühmann, Mohamed Morsey, Sören Auer and Jens Lehmann. ISEMANTICS 2013.

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Journal Publications

62

Quality assessment methodologies for Linked Data: A Survey. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann and Sören Auer. Semantic Web Journal 2015.!

Using Linked Data to build an Observatory of Societal Progress Indicators. Amrapali Zaveri, Joao Ricardo Nickenig Vissoci, Patrick Westphal, Jose Roberto Nascimento Junior, Luciano de Andrade, Cinzia Daraio, Jens Lehmann. Journal of Web Semantics 2014 (under review).!

Publishing and Interlinking the USPTO Patent Data. Amrapali Zaveri, Mofeed M. Hassan, Tariq Yousef, Sören Auer, Jens Lehmann. Semantic Web Journal 2014 (under review).

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Publications

63

No. of publications: 34 (Google Scholar),16 (DBLP)!

Citations: 251 !

h-index: 9; i-10 index: 8 (Google Scholar)

Linked Data Quality Assessment and its Application to Societal Progress Measurement A. Zaveri

Thank you for your attention !!

Questions?

[email protected]

64