LDQ 2014 DQ Methodology

Post on 02-Jul-2015

122 views 0 download

description

"Methodology for Assessment of Linked Data Quality: A Framework" at Workshop on Linked Data Quality Paper: https://dl.dropboxusercontent.com/u/2265375/LDQ/ldq2014_submission_3.pdf

Transcript of LDQ 2014 DQ Methodology

A Methodology for Assessment of

Linked Data QualityAnisa Rula

Amrapali Zaveri

Outline➢ Linked Data Quality

○ Current State ○ Limitations

➢Quality Assessment Methodology ○ 3 phases, 6 steps

➢Conclusion ○ Future Work

Linked Data Quality● c.a. 50 Billion Facts in

the Linked Data Cloud ● But, what about the quality?

● Data is only as good as its quality !

Linked Data Quality➢ 30 approaches, 18 Dimensions, 69 Metrics* ➢ 12 Tools

○ Automated ○ Semi-automated

➢No generalized methodology ➢Not taking into account the actual use case/user

requirements ➢Only assessment, no improvement * http://www.semantic-web-journal.net/content/quality-assessment-linked-data-survey

Quality Assessment Methodology for Linked Data

➢ 3 Phases ➢ 6 steps

Phase I: Requirement Analysis Step I: Use Case Analysis - Description that best illustrates the intended usage of the dataset(s) Two types of users ➢Consumers ➢Potential consumers

Phase II: Quality AssessmentStep II: Identification of quality issues ➢Based on the use case ➢Checklist-based approach ➢Yes - 1, No - 0 ➢ List of quality dimensions

Phase II: Quality AssessmentStep III: Statistics and Low-level Analysis ➢Generic statistics ➢Example

○ Interlinking degree ○ Blank nodes

Phase II: Quality AssessmentStep IV: Advanced Analysis ➢High-level metrics ➢Example

○ Accuracy ○ Completeness

➢Requires (i) input and (ii) target dataset

Data Quality Score➢Ratio

○ DQscore = 1 - (V/T) ■ V - total no. of instances that violate a DQ rule ■ T - total no. of relevant instances ■ for each property

○ DQweightedscore= (DQscore * wi / W) ■ wi - weight ■ W - sum of all weighted factors of the properties ■ for quality of overall properties

Phase III: Quality ImprovementStep V: Root Cause Analysis ➢Analyze cause of each quality issue ➢Helps user interpret the results ➢Detect whether the problem occurs in the

original dataset ➢ In case original dataset is unavailable,

analyze the available dataset to determine the cause

Phase III: Quality ImprovementStep VI: Fixing Quality Problems ➢Semi-automatic

○ Consistency ○ Completeness ○ Syntactic validity

➢Crowdsourcing* ○ Semantic accuracy

○ Datatypes ○ Interlinks

* Acosta et al., Crowdsourcing Linked Data Quality Assessment. ISWC 2013.

Conclusion and Future Work➢Assessment methodology - 3 phases, 6

steps ➢Focus on use case ➢ Improvement phase

!Future Work ➢Application to an actual use case ➢Build a tool

Questions Suggestions Comments

Thank you

@AnisaRula @amrapaliz