Information Content based Ranking Metric for Linked Open Vocabularies

17
Ghislain A. Atemezing (@gatemezing) Raphaël Troncy (@rtroncy) Information Content based Ranking Metric for Linked Open Vocabularies

description

This talk was presented in Leipzig, during the SEMANTiCS '2014 Conference, in September. It basically gives an overview of how Information Content Theory metrics can be applied to Semantic Web, and especially to vocabularies. The results of the proposed ranking metrics can be applied in three areas: (1) vocabulary life-cycle management, (ii) semantic web visualizations and (iii) Interlinking process.

Transcript of Information Content based Ranking Metric for Linked Open Vocabularies

Page 1: Information Content based Ranking Metric for Linked Open Vocabularies

Ghislain A. Atemezing (@gatemezing)

Raphaël Troncy (@rtroncy)

Information Content based

Ranking Metric for Linked Open

Vocabularies

Page 2: Information Content based Ranking Metric for Linked Open Vocabularies

Goal and Agenda

Goal: Present a new ranking metric for reusing

vocabularies

Motivation

Combine Information Theory with metadata information

Find new assessment metric for vocabularies

Current situation

Unicity of popularity based-metric (e.g. prefix.cc or lodstats)

Only ONE dimension used for assessing vocabularies

Proposal: compute informativeness of LOV terms

Experiments and Results

Applications

201/09/05 - 2SEMANTICS 2014 - Leipzig, Germany

Page 3: Information Content based Ranking Metric for Linked Open Vocabularies

Vocabulary Purpose

Model to understand a domain’s semantics

Vocabulary terms contain information

A term = Class, Object Property, Data Property

Essential for publishing data on the Web

How to quantify value of a term?

Informativeness value = negative relation with

probability

201/09/05 - 3SEMANTICS 2014 - Leipzig, Germany

Page 4: Information Content based Ranking Metric for Linked Open Vocabularies

Existing catalogs of vocabularies

201/09/05 - 4SEMANTICS 2014 - Leipzig, Germany

Some catalogs of vocabularies

Page 5: Information Content based Ranking Metric for Linked Open Vocabularies

Linked Open Vocabularies (LOV)

A curated list of vocabularies

More than 420 vocabularies

Each of them described by the vocabulary-of-a-friend

(voaf) schema

Track the (temporal) evolution of vocabularies

Some related services

SPARQL endpoint: http://lov.okfn.org/endpoint/lov

Search function: http://lov.okfn.org/dataset/lov/search

An Aggregator endpoint:

http://lov.okfn.org/endpoint/lov_aggregator

An intelligent bot agent for updates:

http://lov.okfn.org/dataset/lov/bot201/09/05 - 5SEMANTICS 2014 - Leipzig, Germany

Page 6: Information Content based Ranking Metric for Linked Open Vocabularies

LOV DESCRIPTION: http://lov.okfn.org/dataset/lov/

CORE FEATURES OF THE FRAMEWORK

Domain Intended Use Collection GatekeepingNumber of

OntologiesDynamics

Search

metadata

Search

within

ontology

Search across

ontologies

Navigation

criteria

General

Promote and

facilitate the

reuse of

vocabularies in

the linked data

ecosystem.

Submitted by any

user via LOV-

Suggest tool.

Manual

curation and

automatic URI

validation

450+ Growing

Yes, with

visual

depiction

Yes

Keyword-based;

structured

search (query-

based)

Ordered by

prefix,

namespace,

title and

visual links

navigation

CORE FEATURES OF THE FRAMEWORK

MetricsComments

and reviewRanking

Web

service

access

SPARQL

endpoint

Content

available

Read/

Write

Ontology

directory

Ontology

registry

Applicatio

n platform

Reuse

popularity on

the LOD

Cloud

N/A - Only by

the curators

Metric-

basedAPI Yes

Ontology

metadata

, URI

Read Yes Yes Yes

LOV DESCRIPTION WITH THE FRAMEWORK OF [d’Aquin-Noy2012-Survey]

201/09/05 - 6SEMANTICS 2014 - Leipzig, Germany

Page 7: Information Content based Ranking Metric for Linked Open Vocabularies

LOV Evolution since March, 2011

The glitch in 2012

corresponds to the

migration to OKFN

Quasi linearity of the growth,

started with 75 vocabularies

201/09/05 - 7SEMANTICS 2014 - Leipzig, Germany

Page 8: Information Content based Ranking Metric for Linked Open Vocabularies

Proposal: Metrics for Ranking LOV

Metrics

Information Content Metric (IC): value of

information associated with a given entity

Partition Information Content Metric (PIC)

Proposed a ranking based on IC and PIC

Method

Adapt IC and PIC function on semantics

Select candidate vocabularies in LOV catalog

Compute the scores

201/09/05 - 8SEMANTICS 2014 - Leipzig, Germany

Page 9: Information Content based Ranking Metric for Linked Open Vocabularies

Information Content Metrics for LOV

Information Content

Formula:

N = MAX value of term

occurrence in LOV

φ(t)=occurrence of

term in LOV

Partitioned IC

LOV is a semantic

network of resources

Formula:

wf= weight for vocab f

+objectURI+ = owl:ObjectProperty/Datatyp

eProperty; rdfs:Property

201/09/05 - 9SEMANTICS 2014 - Leipzig, Germany

Page 10: Information Content based Ranking Metric for Linked Open Vocabularies

Information Content Metrics for LOV

(Light)weighting

scheme wf=2 if datasets are using

vocabulary

wf=1 if vocabulary reused

other vocabularies.

wf=3 if vocabulary reused

elsewhere

201/09/05 - 10SEMANTICS 2014 - Leipzig, Germany

Page 11: Information Content based Ranking Metric for Linked Open Vocabularies

Ranking Algorithm

Output ranking

4- Compute PIC score

3- Compute IC score

2- Grouping terms by namespace & weight assignment

1- Candidate terms selection in LOV

201/09/05 - 11SEMANTICS 2014 - Leipzig, Germany

Page 12: Information Content based Ranking Metric for Linked Open Vocabularies

Running Example: dcterms vs foaf

dcterms:

http://purl.org/dc/terms/

Candidate terms: 53 (39

properties + 14 classes)

wf = 1+ 2+3 = 6

PIC = 1724.844

foaf:

http://xmlns.com/foaf/0.1/

Candidate terms: 35 (26

properties + 9 classes)

wf = 1+ 2+ 3 = 6

PIC = 1033.197

PIC(dcterms) > PIC(foaf)

201/09/05 - 12SEMANTICS 2014 - Leipzig, Germany

Page 13: Information Content based Ranking Metric for Linked Open Vocabularies

Results on Ranking

Top-15 terms (IC value) Top-15 vocabs (PIC value)

201/09/05 - 13SEMANTICS 2014 - Leipzig, Germany

Page 14: Information Content based Ranking Metric for Linked Open Vocabularies

Relative stable position of foaf in prefix.cc,

vocab.cc and lodstats catalogues.

LOV-PIC/LODstats: skos, dcterms

with “relative” stable raking.

List of “most popular”

vocabularies: foaf, skos,

dcterms, time, dce, prov.

Comparison

201/09/05 - 14SEMANTICS 2014 - Leipzig, Germany

Page 15: Information Content based Ranking Metric for Linked Open Vocabularies

Vocabulary life-cycle management

Help assessing the use of terms and vocabulary updates

Monitoring the use of http://www.w3.org/2003/06/sw-

vocab-status/ns#:term_status or owl:deprecated

Semantic Web applications

Vocabularies with higher PIC might be proposed to a

user as much as possible, e.g. for choosing properties to

display in a facetted browsing interface

Interlinking datasets

Generate sameAs links with data based on vocabularies

terms with lower IC value

Applications of the Ranking Metrics

201/09/05 - 15SEMANTICS 2014 - Leipzig, Germany

Page 16: Information Content based Ranking Metric for Linked Open Vocabularies

Conclusion and Future Work

We have presented new metrics for ranking

vocabularies

By applying Information Content concept to LOV

By taking more dimensions in the ranking metrics

The metrics can be applied to vocabulary

reused, ontology modelling and visualizations

Future work

Add equivalence axioms in the ranking model

Compare (P)IC with other graph-based ranking

(e.g. pagerank)

Investigate the dependency ranking between vocabularies

201/09/05 - 16SEMANTICS 2014 - Leipzig, Germany

Page 17: Information Content based Ranking Metric for Linked Open Vocabularies

Q/A Session

Thanks for your attention!