Linkset quality (LWDM 2013)

27
Assessing Linkset Quality For Complementing Third Party Datasets Riccardo Albertoni 1,2 , Asunción Gómez Pérez 1 1 Ontology Engineering Group Departamento de Inteligencia Artificial Facultad de Informática Universidad Politécnica de Madrid 2 CNR-IMATI, Via De Marini, 6, Torre di Francia, 16149 Genova, Italy 3RD INTERNATIONAL WORKSHOP ON LINKED WEB DATA MANAGEMENT (LWDM 2013) in conjunction with the 16th International Conference on Extending Database Technology (EDBT 2013) March 22, 2013 - Genoa, Italy

Transcript of Linkset quality (LWDM 2013)

Page 1: Linkset quality (LWDM 2013)

Assessing Linkset Quality For Complementing Third Party Datasets

Riccardo Albertoni1,2, Asunción Gómez Pérez1

1Ontology Engineering GroupDepartamento de Inteligencia Artificial

Facultad de InformáticaUniversidad Politécnica de Madrid

2CNR-IMATI,Via De Marini, 6, Torre di Francia, 16149 Genova, Italy

3RD INTERNATIONAL WORKSHOP ON LINKED WEB DATA MANAGEMENT (LWDM 2013)

in conjunction with the 16th International Conference on Extending Database Technology (EDBT 2013)

March 22, 2013 - Genoa, Italy

Page 2: Linkset quality (LWDM 2013)

2

Motivations

Riccardo Albertoni

LINKED DATA’s PROMISE: Evolving the Web into a Global Data SpaceIt should help to overcome data silos effect….

So many bubbles there,

THAT’S SO COOL!!

BUT ….

Can I exploit that third party

data for my OWN

ANALYSES?

Page 3: Linkset quality (LWDM 2013)

3

Motivation

Riccardo Albertoni

What does this arrow mean ??

NO GROUND CONCEPT about

what makes a linkset suitable for a target

application

Well founded works on quality for datasets, but

Linksets are not yet directly addressed!SWDF

DBLP

Page 4: Linkset quality (LWDM 2013)

4

What is Linkset Quality for?

Linked Data Publishers can check if a linkset they have provided

• is good enough or need to be improved; • is still good enough after one of the two target

datasets is updated.

Linked Data Consumers can • figure out if they can or can’t rely on a linkset;• have a first guess of what is the next move they can

take to improve the linkset;• rank possible linkset alternatives.

Riccardo Albertoni

Page 5: Linkset quality (LWDM 2013)

5

foaf:made

a

Pub1

Pub2

b

foaf:made

Pub3

Pub4Yolanda Gil

DBLP Y

Linkset L

a owl:sameAs a’b owl:sameAs b’

XL

foaf:member

a’

Afflii5

Affili4

b’

foaf:member

Affili3

X

Journal 1

c’

Complementing a Dataset X via a Linkset L

Complementation might introduce some “data missing”

The less “data missing” (like researcher c) are introduced the more the Linkset is complete

Page 6: Linkset quality (LWDM 2013)

6

What is a Linkset ? (http://vocab.deri.ie/void)

Riccardo Albertoni

Every linkset is a special kind of dataset !!

Every linkset has two target datasets:Subject and Object datasets

Every linkset should have only one linking property

owl:sameAs linksets

Page 7: Linkset quality (LWDM 2013)

7

Defining quality measures

Riccardo Albertoni

Considering the terminology adopted by C. Bizer and R. Cyganiak. Quality-driven information filtering using the WIQA policy framework. J. Web Sem., 7(1):1-10, 2009

What to define providing a quality measure

Provided in this Linkset quality

Quality Indicator is an aspect of a data item or data set that may give an indication to the user of the suitability of the data for some intended use.

Entities TypesNumber of Entities for Types… …

Scoring Function namely, functions evaluating quality indicators to measure the suitability of the data for some intended use.

Linkset Type CoverageLinkset Type CompletenessLinkset Entity Coverage for Type

Aggregate Metric user-specified assessment metric built upon scoring functions. These aggregations produce new assessment values through the average, sum, max, min or threshold functions applied to the set of scoring functions.

Interpretation tables: interpretation on the scoring functions that helps in figuring out which is the next action to do

Page 8: Linkset quality (LWDM 2013)

8

Defining quality measures

Riccardo Albertoni

Considering the terminology adopted by C. Bizer and R. Cyganiak. Quality-driven information filtering using the WIQA policy framework. J. Web Sem., 7(1):1-10, 2009

What to define providing a quality measure

Provided in this Linkset quality

Quality Indicator is an aspect of a data item or data set that may give an indication to the user of the suitability of the data for some intended use.

Entities TypesNumber of Entities for Types… …

Scoring Function namely, functions evaluating quality indicators to measure the suitability of the data for some intended use.

Linkset Type CoverageLinkset Type CompletenessLinkset Entity Coverage for Type

Aggregate Metric user-specified assessment metric built upon scoring functions. These aggregations produce new assessment values through the average, sum, max, min or threshold functions applied to the set of scoring functions.

Interpretation tables: interpretation on the scoring functions that helps in figuring out which is the next action to do

Page 9: Linkset quality (LWDM 2013)

9

INDICATORS: Examples on DBLP & SWDF

Riccardo Albertoni

foaf:Organization

foaf:Person

ro:FullPaperfoaf:Document

foaf:Agent

swr:Proceedingsswrc:Proceedings

DBLP SWDF

ro:ShortPaperro:PosterPaper

Type(DBLP) Type(SWDF)

#E4Type(foaf:Agent,DBLP)=1000000

#E4Type(foaf:Document,DBLP)=1984087

#E4Type(swrc:Proceedings,DBLP)=1108400

Page 10: Linkset quality (LWDM 2013)

11

INDICATORS: Examples on DBLP & SWDF

Riccardo Albertoni

foaf:Organization

foaf:Person

ro:FullPaperfoaf:Document

foaf:Agent

swr:Proceedingsswrc:Proceedings

DBLP SWDF

L2

ro:PosterPaper

Type(DBLP) Type(SWDF)

#E4Type(foaf:Agent,L2)=100

#E4Type(foaf:Person,L2)=100 Type(L2)

Page 11: Linkset quality (LWDM 2013)

12

Quality indicators: Types

Riccardo Albertoni

Dataset/ Linkset

Power set on the possible User defined Types

e.g. owl:Class, owl:Restriction, skos:Concept,

skos:ConceptScheme

Returns the types of entities

exposed in a dataset or a

linkset

Page 12: Linkset quality (LWDM 2013)

13

Quality indicators: # of Entity for a Type

Riccardo Albertoni

Dataset/ Linkset

One of the possible User defined Types

Set of (positive) integer

Returns the number of entities exposed in a dataset/ linkset for a given type

Blank nodes are left out

Page 13: Linkset quality (LWDM 2013)

15

Defining quality measures

Riccardo Albertoni

Considering the terminology adopted by C. Bizer and R. Cyganiak. Quality-driven information filtering using the WIQA policy framework. J. Web Sem., 7(1):1-10, 2009

What to define providing a quality measure

Provided in this Linkset quality

Quality Indicator is an aspect of a data item or data set that may give an indication to the user of the suitability of the data for some intended use.

Entities TypesNumber of Entities for Types… …

Scoring Function namely, functions evaluating quality indicators to measure the suitability of the data for some intended use.

Linkset Type CoverageLinkset Type CompletenessLinkset Entity Coverage for Type

Aggregate Metric user-specified assessment metric built upon scoring functions. These aggregations produce new assessment values through the average, sum, max, min or threshold functions applied to the set of scoring functions.

Interpretation tables: interpretation on the scoring functions that helps in figuring out which is the next action to do

Page 14: Linkset quality (LWDM 2013)

16

SCORING FUNCTIONS: Linkset Type Coverage (1)

Riccardo Albertoni

foaf:Organization

foaf:Personfoaf:Agent

swrc:Proceedings

DBLP SWDF

L1

Type(DBLP) Type(SWDF)

Complementing DBLP with L1, are we adding some new entities to DBLP?

DBLPL1 “imports” organizations for the researchers (foaf:Agent) involved in the linkset

Page 15: Linkset quality (LWDM 2013)

17

SCORING FUNCTIONS: Linkset Type Coverage (2)

Riccardo Albertoni

foaf:Organization

foaf:Personfoaf:Agent

swrc:Proceedings

DBLP SWDF

Type(DBLP) Type(SWDF)

Complementing SWDF with L2, we don’t add any new type of entities

SWDFL2 has exactly the same kind of Entities of SWDF

swr:ProceedingsL2

Page 16: Linkset quality (LWDM 2013)

18

Definition of Linkset Type Coverage

Riccardo Albertoni

LinksetTarget dataset

Considering a dataset X, What percentage of types of X that are also covered by the linkset?

Page 17: Linkset quality (LWDM 2013)

19

SCORING FUNCTION: Ideas behind Type Completeness (1)

Riccardo Albertoni

foaf:Organization

foaf:Personfoaf:Agent

swrc:Proceedings

DBLP SWDF

L1

Type(DBLP) Type(SWDF)

L1 is type complete

It does not make sense to run a procedure ( e.g., SILK) trying to discover

interlinks between the instances of swrc:Proceedings and foaf:Organization!!!

Page 18: Linkset quality (LWDM 2013)

20

SCORING FUNCTION: Ideas behind Type Completeness(2)

Riccardo Albertoni

foaf:Organization

foaf:Personfoaf:Agent

swrc:Proceedings

DBLP SWDF

L1

Type(DBLP) Type(SWDF)

swr:Proceedings

We should try to run a procedure ( e.g., SILK) trying to discover interlinks

between the instances of swrc:Proceedings and swr:Proceedings!!!

Alignment among classes

L1 is type incomplete

Page 19: Linkset quality (LWDM 2013)

21

Formalization of Linkset Type Completeness

Riccardo Albertoni

LinksetTerget dataset 2

Target dataset 1

Types In the subject that are not considered in the linkset

returns the set of types that X have an equivalent in Y according to a relation of equivalence among classes

A linkset is complete with respect to types LTCom= 1LTCom<1 otherwise

Page 20: Linkset quality (LWDM 2013)

22

Example on Type Completeness

Riccardo Albertoni

foaf:Organization

foaf:Personfoaf:Agent

swrc:Proceedings

DBLP SWDF

L1

Type(DBLP) Type(SWDF)

swr:ProceedingsL2

LTCom(L1,DBLP, SWDF) = 1- (|{swrc:Proceedings}| / |{swrc:Proceedings,foaf:Person}|)=1/2

LTCom(L2,DBLP, SWDF) = 1- (|{}| / |{swr:Proceedings,foaf:Person}|)=1

Page 21: Linkset quality (LWDM 2013)

23

foaf:Organization

foaf:Personfoaf:Agent

swrc:Proceedings

DBLP SWDF

L1

L1 and L2 are indistinguishable from the point of view of types

Which is the most interesting? L1 or L2? Or L1 U L2 ?

swr:ProceedingsL2

Linkset Entity Coverage for Type

Riccardo Albertoni

Number of Entity of type T in the linkset L

Number of Entity of type T in the Dataset X

How good is a linkset providing 100 owl:sameAs?

Page 22: Linkset quality (LWDM 2013)

25

Defining quality measures

Riccardo Albertoni

Considering the terminology adopted by C. Bizer and R. Cyganiak. Quality-driven information filtering using the WIQA policy framework. J. Web Sem., 7(1):1-10, 2009

What to define providing a quality measure

Provided in this Linkset quality

Quality Indicator is an aspect of a data item or data set that may give an indication to the user of the suitability of the data for some intended use.

Entities TypesNumber of Entities for Types… …

Scoring Function namely, functions evaluating quality indicators to measure the suitability of the data for some intended use.

Linkset Type CoverageLinkset Type CompletenessLinkset Entity Coverage for Type

Aggregate Metric user-specified assessment metric built upon scoring functions. These aggregations produce new assessment values through the average, sum, max, min or threshold functions applied to the set of scoring functions.

Interpretation tables: interpretation on the scoring functions that helps in figuring out which is the next action to do

Page 23: Linkset quality (LWDM 2013)

26Riccardo Albertoni

Aggregate Metrics: Interpretation upon the presented score functions

Interpretation is summed up

as “decision tree”

Page 24: Linkset quality (LWDM 2013)

27

Related work: (extended discussion in the paper)

• WIQA is a Information Quality Assessment Framework

• C. Bizer and R. Cyganiak. Quality-driven information filtering using the WIQA policy framework. J. WebSem., 7(1):110, 2009

• LOD2 • P. N. Mendes, C. Bizer, J. H. Young, Z. Miklos, J.-P.

Calbimonte, and A. Moraru. Conceptual model and best practices for high-quality metadata publishing.Technical report, PlanetData, Deliverable 2.1, 2012,http://planet-data-wiki.sti2.at/web/File:D2.1.pdf.

• PlanetData• P. N. Mendes and C. Bizer. Survey report state of the art

in mapping, quality assessment and data fusion. Technical report, LOD2- Creating Knowledge out of Interlinked data, Deliverable 4.3.1, 2011,http://static.lod2.eu/Deliverables

• SIEVE• P. N. Mendes, H. Muhleisen, and C. Bizer. Sieve: linked

data quality assessment and fusion. In D. Srivastava and I. Ari, editors, LWDM EDBT/ICDT Workshops, pp. 116-123. ACM, 2012.

Riccardo Alberton

Contributes with a policy language, engine for interpreting such policies, Explanation if a piece of information

satisfies that policy

Quality criteria are parameters of the system It does not aim at proposing new

quality measures

Reviews quality dimensions

No indicators or criteria for completeness

Intensionally compl. : the schema contains all the necessary attributes,;Extensionally compl. : all instances re quired al present), LDS Completeness: relevant properties have a values

SIEVE deploys some of the idea developed in WIQA and LDS completeness

They don’t explicitly address quality for Linksets

Page 25: Linkset quality (LWDM 2013)

28

Related work: (extended discussion in the paper)

• Link-QA• C. Gueret, P. T. Groth, C. Stadler, and J. Lehmann.

Assessing linked data mappings using network measures. In E. Simperl, P. Cimiano, A. Polleres, O. Corcho, and V. Presutti, editors, ESWC, volume 7295 of Lecture Notes in Computer Science, pp. 87-102. Springer, 2012

Riccardo Alberton

Different approach:They apply classic network measure such as degree, centrality, clustering coefficient +

open-sameAs chain, description richness for determining whether a bunch of links

improves the overall dataset quality

Quality of interlinking not for linksetLINK-QA works on links independently

of they are part or not of the same linksets;

LINK-QA addresses correctness and it does not deal with

Completeness

LINK-QA is for ranking sets of links, itcan be used to say a linkset is better than

another, but itdoes not suggest what is the next move

a consumer shouldtake to improve his linkset

Page 26: Linkset quality (LWDM 2013)

29

Conclusions

Contribution: Quality measure on linksets• The only measure explicitly addressing linkset

completeness for dataset complementation• Formalization for indicators, score functions and

aggregation metrics; • A first proof of concept prototype (JAVA-JENA)

On-going and Future work• Validation on the LOD,

• How many “incomplete” Linksets can we detect in the LOD?

• Extension for considering others than owl:sameAs Linkset (e-g., skos:exactMatch)

• Other dimensions than completeness (e.g., Timeliness, Availability, Consistency)

Riccardo Albertoni

Page 27: Linkset quality (LWDM 2013)

30

THANKS for your ATTENTION! [email protected]

Riccardo Albertoni