2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline...

Post on 16-Jan-2016

218 views 0 download

Tags:

Transcript of 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline...

2014-May-07

What is the problem?

What have others done?

What is our solution?

Does it work?

Outline

2

What is the problem?

• Linked Open Data (LOD): ▫ Realizing Semantic Web by interlinking existing

but dispersed data

• Main components of LOD:▫URIs to identify things ▫RDF to describe data▫HTTP to access data

3

Datasets: 295Triples: over 30,000,000,000 (30 B)Links: over 500,000,000 (500 M)

4

What is the problem?

Inclusion Criteria for publishing and interlinking datasets into LOD cloud

• resolvable http/https URIs

• Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples)

• Contains at least 1000 triples

• Connected via at least 50 RDF links to the existing datasets of LOD

• Accessible via RDF crawling, RDF dump, or SPARQL endpoint

Is dataset ready to publish?

5

What is the problem?

6

Idea of the LOD: Publishing first, improving later

Results in: quality problems in the published datasets

Missing link:

What is the problem?

Data Quality evaluation before release

Data quality in the Context of LOD

• General Validators

• Parsing and Syntax

• Accessibility / Dereferencability

Validators Quality Assessment of Published data

• Classifying quality problems of LOD

• Using metadata for quality assessment

• filtering poor quality data (WIQA)

• Semantic Annotation using ontologies

7

What have others done?

Limitations of related works:

•Syntax validation, not quality evaluation

•Not scalable

•Not full automated

•Evaluation after publishing

8

What have others done?

What is our solution?

Proposing a set of metrics for

Inherent quality assessment of datasets

before interlinking to LOD cloud

9

Quality Prediction

Empirical Evaluation

Theoretical Validation

Developing a Quality Model

Proposing Metrics

Selecting Inherent Quality Dimensions

10

What is our solution?

Studying data quality models

Defining inherent quality of LOD

Selecting the basic model

(ISO-25012)Mapping quality

dimensions of ISO to LOD

11

1. Selecting Inherent Quality Dimensions

Inherent Quality of LOD

Interlinking

Completeness

Semantic AccuracySyntax Accuracy

Uniqueness

Consistency

12

1. Selecting Inherent Quality Dimensions

Defining metrics using GQM

Implementing an automated tool Formal definition

13

2. Proposing Metrics

Example:Goal: Assessment of the consistency of a dataset in the context of LODQuestion: What is the degree of conflict in the context of data value?Metric: The number of functional properties with inconsistent values

14

LODQM: Linked Open Data Quality Model

• 6 Quality dimensions• 32 Metrics

3. Developing LODQM

Using Theoretical Measurement Framework

Identifying properties of

desirable metrics

Validating metrics

15

4. Theoretical Validation

Metric TypeNumber

of metricsNull-

Value

Non-

NegativitySymmetry Monotonicity

Disjoint

Module

AdditivityMerging

Cohesive

Modules

Complexity 29 √ √ √ √ n/a _ _

Cohesion 2 √ √ _ √ _ _ √

Coupling 1 √ √ _ √ n/a √_

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observationsCollecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

16

5. Empirical Evaluation 5.1

5.2

5.3

5.4

5.5

5.6

5.7

17

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

Datasets

No. of triples

No. of instances

No. of classes

No. of properties

FAO Water Areas 10,730 586 31 19

Water Economic Zones 29,193 1,074 113 127

Large Marine Ecosystems 12,012 716 21 31

Geopolitical Entities 22,725 312 88 101

ISSCAAP Species Classification 398,166 25,253 52 93

Species Taxonomic Classification 319,490 11,741 33 26

Commodities 56,420 2,788 10 19

Vessels 4,236 240 6 22

5. Empirical Evaluation √

18

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√

5. Empirical Evaluation

19

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√

5. Empirical Evaluation

Result:• Three pairs of metrics are correlated:

{IFP, Im_DT}{Im_DT, Sml_Cls} {Inc_Prp_Vlu, IF}

• The others are independent

20

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√√√

5. Empirical Evaluation

21

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√√√√

5. Empirical Evaluation

22

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√√√√

5. Empirical Evaluation

Result:• Only one pair of quality dimensions is correlated:

{Interlinking, Syntactic accuracy}

• The others are independent

Applying PCA Method to select the highly

correlated metricsDeveloping predictive models

Assessing the quality of new datasets

using models

23

6. Quality Prediction

Result:

20 out of 32 metrics are selected

Using Neural Network Method:

MultiLayerPerceptron

Dataset No. of triples No. of instances Domain

Geonames 6,590 699 Geography

IMDB 866 291 Movie

Anatomy 6,449 6449 Anatomy

Citeseer 948,770 173963 Publication

FAO 248,731 28,098 Food Science

24

6. Quality Prediction

Conclusion on Metrics

25

Definable

•Proposed by GQM (32)

•Formally defined (32)

Valid

•Theoretically validated (32)

Practical

•Implemented (32)

Correlated with quality

•Experts (28)

•Correlation study (27)

•PCA (20)

Predictability

•MLP (20)

Appreciative of your

Attention and Comments