Information Content based Ranking Metric for Linked Open Vocabularies

Ghislain A. Atemezing (@gatemezing)

Raphaël Troncy (@rtroncy)

Information Content based

Ranking Metric for Linked Open

Vocabularies

Goal and Agenda

Goal: Present a new ranking metric for reusing

vocabularies

Motivation

Combine Information Theory with metadata information

Find new assessment metric for vocabularies

Current situation

Unicity of popularity based-metric (e.g. prefix.cc or lodstats)

Only ONE dimension used for assessing vocabularies

Proposal: compute informativeness of LOV terms

Experiments and Results

Applications

201/09/05 - 2SEMANTICS 2014 - Leipzig, Germany

htpp://www.eurecom.fr/


http://prefix.cc/

http://stats.lod2.eu/

Vocabulary Purpose

Model to understand a domain’s semantics

Vocabulary terms contain information

A term = Class, Object Property, Data Property

Essential for publishing data on the Web

How to quantify value of a term?

Informativeness value = negative relation with

probability




Existing catalogs of vocabularies


Some catalogs of vocabularies



Linked Open Vocabularies (LOV)

A curated list of vocabularies

More than 420 vocabularies

Each of them described by the vocabulary-of-a-friend

(voaf) schema

Track the (temporal) evolution of vocabularies

Some related services

SPARQL endpoint: http://lov.okfn.org/endpoint/lov

Search function: http://lov.okfn.org/dataset/lov/search

An Aggregator endpoint:

http://lov.okfn.org/endpoint/lov_aggregator

An intelligent bot agent for updates:

http://lov.okfn.org/dataset/lov/bot201/09/05 - 5SEMANTICS 2014 - Leipzig, Germany



http://lov.okfn.org/endpoint/lov

http://lov.okfn.org/dataset/lov/search

http://lov.okfn.org/endpoint/lov_aggregator

http://lov.okfn.org/dataset/lov/bot

LOV DESCRIPTION: http://lov.okfn.org/dataset/lov/

CORE FEATURES OF THE FRAMEWORK

Domain Intended Use Collection GatekeepingNumber of

OntologiesDynamics

Search

metadata

Search

within

ontology

Search across

ontologies

Navigation

criteria

General

Promote and

facilitate the

reuse of

vocabularies in

the linked data

ecosystem.

Submitted by any

user via LOV-

Suggest tool.

Manual

curation and

automatic URI

validation

450+ Growing

Yes, with

visual

depiction

Yes

Keyword-based;

structured

search (query-

based)

Ordered by

prefix,

namespace,

title and

visual links

navigation

CORE FEATURES OF THE FRAMEWORK

MetricsComments

and reviewRanking

Web

service

access

SPARQL

endpoint

Content

available

Read/

Write

Ontology

directory

Ontology

registry

Applicatio

n platform

Reuse

popularity on

the LOD

Cloud

N/A - Only by

the curators

Metric-

basedAPI Yes

Ontology

metadata

, URI

Read Yes Yes Yes

LOV DESCRIPTION WITH THE FRAMEWORK OF [d’Aquin-Noy2012-Survey]




http://lov.okfn.org/dataset/lov/

http://www.sciencedirect.com/science/article/pii/S157082681100076X

LOV Evolution since March, 2011

The glitch in 2012

corresponds to the

migration to OKFN

Quasi linearity of the growth,

started with 75 vocabularies




Proposal: Metrics for Ranking LOV

Metrics

Information Content Metric (IC): value of

information associated with a given entity

Partition Information Content Metric (PIC)

Proposed a ranking based on IC and PIC

Method

Adapt IC and PIC function on semantics

Select candidate vocabularies in LOV catalog

Compute the scores




Information Content Metrics for LOV

Information Content

Formula:

N = MAX value of term

occurrence in LOV

φ(t)=occurrence of

term in LOV

Partitioned IC

LOV is a semantic

network of resources

Formula:

wf= weight for vocab f

+objectURI+ = owl:ObjectProperty/Datatyp

eProperty; rdfs:Property




Information Content Metrics for LOV

(Light)weighting

scheme wf=2 if datasets are using

vocabulary

wf=1 if vocabulary reused

other vocabularies.

wf=3 if vocabulary reused

elsewhere




Ranking Algorithm

Output ranking

4- Compute PIC score

3- Compute IC score

2- Grouping terms by namespace & weight assignment

1- Candidate terms selection in LOV




Running Example: dcterms vs foaf

dcterms:

http://purl.org/dc/terms/

Candidate terms: 53 (39

properties + 14 classes)

wf = 1+ 2+3 = 6

PIC = 1724.844

foaf:

http://xmlns.com/foaf/0.1/

Candidate terms: 35 (26

properties + 9 classes)

wf = 1+ 2+ 3 = 6

PIC = 1033.197

PIC(dcterms) > PIC(foaf)




http://purl.org/dc/terms/

http://xmlns.com/foaf/0.1/

Results on Ranking

Top-15 terms (IC value) Top-15 vocabs (PIC value)




Relative stable position of foaf in prefix.cc,

vocab.cc and lodstats catalogues.

LOV-PIC/LODstats: skos, dcterms

with “relative” stable raking.

List of “most popular”

vocabularies: foaf, skos,

dcterms, time, dce, prov.

Comparison




Vocabulary life-cycle management

Help assessing the use of terms and vocabulary updates

Monitoring the use of http://www.w3.org/2003/06/sw-

vocab-status/ns#:term_status or owl:deprecated

Semantic Web applications

Vocabularies with higher PIC might be proposed to a

user as much as possible, e.g. for choosing properties to

display in a facetted browsing interface

Interlinking datasets

Generate sameAs links with data based on vocabularies

terms with lower IC value

Applications of the Ranking Metrics




http://www.w3.org/2003/06/sw-vocab-status/ns#:term_status

Conclusion and Future Work

We have presented new metrics for ranking

vocabularies

By applying Information Content concept to LOV

By taking more dimensions in the ranking metrics

The metrics can be applied to vocabulary

reused, ontology modelling and visualizations

Future work

Add equivalence axioms in the ranking model

Compare (P)IC with other graph-based ranking

(e.g. pagerank)

Investigate the dependency ranking between vocabularies




Q/A Session

Thanks for your attention!

Information Content based Ranking Metric for Linked Open Vocabularies

Data & Analytics

Transcript of Information Content based Ranking Metric for Linked Open Vocabularies