Data Interlinking

81
Data interlinking erˆ ome Euzenat Montbonnot, France [email protected] June 10, 2015

Transcript of Data Interlinking

Data interlinking

Jerome Euzenat

Montbonnot, [email protected]

June 10, 2015

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

The problem: RDF data interlinking

3

〈http://data.bnf.fr/12144801/edgar allan poe the gold bug/, dc:title, “The gold bug”〉The gold bug

title

creator

en

E. Poe

lang

firstname lastname

Writer

Work

rdf:type

rdf:type

b a1 a2

Baudelaire Malarme

The raven

orig

name namename

orig

authortranslator translator

Person

Book

rdf:type

rdf:type

≥Jerome Euzenat Data interlinking 3 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Goal of the lecture

I Provide an overview of the problem of data interlinking

I Describe broad categories of solutions

I Point to useful tools for generating links

Mostly about generating links, not on finding how to generate them

Jerome Euzenat Data interlinking 4 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Outline

Data interlinling

Similarity-based approach

Key-based interlinking

Ontology matching & data interlinking

Tools

Jerome Euzenat Data interlinking 5 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Data interlinking

I use (with the same meaning):

I instance matching

I entity linking

I data interlinking

I do not use:

I record linkage

I data deduplication

I entity reconciliation

I coreference resolution

Jerome Euzenat Data interlinking 6 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

The data interlinking problem

Data interlinking is the task of finding same entities within different datasets(RDF graphs).

Data source 1 Data source 2

interlinking

owl:sameAs

Jerome Euzenat Data interlinking 7 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

The data interlinking process

Data source

Data source

interlinking Resulting linksSample links

parameters

resources

Jerome Euzenat Data interlinking 8 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

The data interlinking process (2)

d

d ′

extraction

Linkage spec

generation l

interlinking

Jerome Euzenat Data interlinking 9 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Approaches to data interlinking

There are two main approaches to data interlinking:

I similarity-based: resources are compared through a similarity measureand if they are similar enough, they are the same.

I key-based: sufficient conditions for two resources to be the same areinduced and used to find same entities

Jerome Euzenat Data interlinking 10 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Classification of similarities

Data interlinking techniques may be based on:

I Data ID (URIs);

I Data keys

I External relations: (explicit or implicit) links to other resources

I Data description (content)

Jerome Euzenat Data interlinking 12 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Manual resource matching

URI1 URI2

Manual observation

owl:sameAs

I This does not scale.

I But may be good for a first sample or reference.

I Crowdsourcing?

Jerome Euzenat Data interlinking 13 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

URI matching

URI1 URI2

URI transformation

owl:sameAs

http://dbpedia.org/resource/Johann Sebastian Bach owl:sameAs

http://www.lastfm.fr/music/Johann+Sebastian+Bach

http://rdf.insee.fr/geo/regions-2011.rdf#REG 11 ?

http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR10

Jerome Euzenat Data interlinking 14 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Id matching

id id

Finding same ids

owl:sameAs

You can find such types of ids:

I Social security numbers

I ISBN, DOI, MAC addresses, etc.

I authorities: ISO (countries, languages), IATA (airports)

Most databases are built on such identifiers. . . but they are often local to thedatabase.

Jerome Euzenat Data interlinking 15 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Context-based similarity

URI1 URI2

VIAF

Context-based“similarity”

owl:sameAs

Process:I Project your data into another resource (DBPedia, geonames, viaf, etc.)I Assess relations between considered termsI Import the relation in the dataset

This harness the power of links!Jerome Euzenat Data interlinking 16 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Content-based similarity

3

The gold bug

title

creator

E. Poe

firstname lastname

Writer

Work

rdf:type

rdf:typeb a1 a2

Baudelaire Poe

Le corbeauLe scarabe d’or

orig

name name

title

authortranslator

Person

Book

rdf:type

rdf:type

Compute similarity

owl:sameAs

Two main approaches:

I bag of text

I structured similarity

Jerome Euzenat Data interlinking 17 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Term-based similarity

The gold bug

E. Poe

firstname lastname

Writer

Work

type

type Baudelaire Poe

Le corbeauLe scarabe d’or

orig

name name

title

authortranslator

Person

Book

type type

Compute “bag of words” similarity

owl:sameAs

Various tools:I Normalisation (Stemmer, Tokenizers)I Use of linguistic resources (Wordnet)I TranslationI Many similarity measures, especially from information retrieval

Jerome Euzenat Data interlinking 18 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Structure similarity

title

creator

firstname lastname

type

type orig

name name

title

authortranslator

type

type

Compute structure similarity

owl:sameAs

Techniques:

I Based on graph matching techniques

I Can be used to learn weights on properties (but need matching)

I Problem: scalability

Jerome Euzenat Data interlinking 19 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Cross-lingual RDF data interlinking

http://a.org/Mus999 France

Musee du Louvre

nom

lieu

Paris

99,rue de Rivoli

75001

adresse

ville

rue

zip

http://bb.cn/盧浮宮

盧浮宮

法國巴黎

稱號

位於

owl:sameAs ?

Jerome Euzenat Data interlinking 20 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Similarity-based data interlinking

RESOURCE RESOURCE

SIMILARITY

owl:sameAs ?

Hypothesis: ↑ similarity ↑ probability that it is the same object

DOCUMENT DOCUMENTSIMILARITY

owl:sameAs ?

Yuzhong Qu, Wei Hu, Gong Cheng: Constructing virtual documents for ontology matching. WWW 2006: 23-31.

DOCUMENT(zh) DOCUMENT(en)

DOCUMENT(en)

translation

DOCUMENT(zh)

translationSIMSIM

SIMILARITY

owl:sameAs ?

BabelNet(IDs) BabelNet(IDs)SIMILARITY

owl:sameAs ?

Jerome Euzenat Data interlinking 21 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

General cross-lingual interlinkingframework

1 VirtualDocuments

3 SimilarityComputation

4 LinkGeneration

2 LanguageNormalization

Jerome Euzenat Data interlinking 22 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Building virtual documents by levels

http://dbpedia.org/resource/Charles Perrault

Charles Perrault

dbpedia:France

Level 1

France is a sovereigncountry in Western Eu-rope that includes over-seas regions and territo-ries. . .

Level 2

Jerome Euzenat Data interlinking 23 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Machine translation: parameters

1 VirtualDocuments

2.1 MachineTranslation

2.2 NLPPreprocessing

3 SimilarityComputation

4 LinkGeneration

Level 1

Level 2

ZH→ENLowercase+Tokenize+ Filter stop words

+ Stemming (Porter)

+ Bigrams (terms)

TF+cosine

TF*IDF+cosine

Greedy

Hungarian

32 settings have been explored in total

Jerome Euzenat Data interlinking 24 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Lcase+Tokenization with TF*IDF atLevel 1

0 - 0.11

0.11 - 0.15

0.15 - 0.25

0.25 - 0.35

0.35 - 0.45

0.45 - 1

Jerome Euzenat Data interlinking 25 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Adding noise

Jerome Euzenat Data interlinking 26 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

BabelNet method: parameters

1 VirtualDocuments

2 MultilingualKB Mapping

3 SimilarityComputation

4 LinkGeneration

Level 1

Level 2

TF+cosine

TF*IDF+cosine

Greedy

Hungarian

Jerome Euzenat Data interlinking 27 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Database keys

I A set of attributes which uniquely identifies elements of a relation

I e.g., Book: isbn, People: fistname, lastname, birthplace, birthdate

I usually given and used to check integrity

They may be used for identifying same entities across two databases.But they require alignments.

Jerome Euzenat Data interlinking 29 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Example of interlinking with keys andalignments

Are the resources bnf:cb118949856 and bne:XX1721208 the same?

I if BNF ontology states foaf:Person owl:hasKey {foaf:name, dc:dates}I and we have the following alignment

foaf:Person

bnf:cb118949856

Albert Camus

07-11-1913

04-01-1960

Romancier, dramaturge et essayiste

http://id.loc.gov/vocabulary/countries/fr

Mondovi (Algerie)

1913-1960

foaf:name

rda:dateOfBirth

rda:dateOfDeath

rda:biographicalInformation

rda:countryAssociatedWithThePerson

rda:placeOfBirth

dc:dates

frbrer:C1005

bne:XX1721208

Camus, Albert

1913-1960

Aut [...]1980

frber:P3039

frber:P3040

rda:sourceConsulted

w

owl:sameAs

owl:sameAs ?

Jerome Euzenat Data interlinking 30 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Key-based interlinking methods

Database keys allow for identifying entities: if they are aligned, this can beused for linking.

I AdvantagesI they are logically groundedI they allow to minimize the number of properties to compare (if we use

minimal keys)

I DrawbacksI Require alignment between properties and classesI Very few key axioms are available, and they are not necessarily useful for

interlinking

We overcome these drawbacks by introducing link keys

Jerome Euzenat Data interlinking 31 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Link key

A link key

〈{〈p1, q1〉, . . . , 〈pn, qn〉}{〈p′1, q′1〉, . . . , 〈p′m, q′m〉} linkkey 〈c , d〉〉

holds iffFor all pairs of instances a and b belonging respectively to classes c and d ofontologies O and O′,

if a and b share at least one value (object) for each pairs ofproperties pi and qi respectively,

and a and b share all their values (objects) for each pairs ofproperties p′i and q′i respectively,

then they are the same (〈a, owl:sameAs, b〉).

Example:

〈{〈foaf:name, frbr:P3039〉}{〈dc:dates, frbr:P3040〉} linkkey 〈foaf:Person, frbr:C1005〉〉

Jerome Euzenat Data interlinking 32 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Link key extraction

Problem: How to induce such link keys from data?

The number of set of pairs of properties is exponential

Our approach:

I discover only candidate link keys.

I evaluate them in order to select only the “good” ones

Jerome Euzenat Data interlinking 33 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Candidate link key

A candidate link key is a set of property pairs {〈p1, q1〉, . . . , 〈pk , qk〉} that

1. would generate at least one link if used as a link key

2. is maximal for at least one link, or is the intersection of severalcandidate link keys

Jerome Euzenat Data interlinking 34 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Supervised selection measures

If a sample of reference links is available:

I Positive examples (L+) : a set of owl:sameAs links

I Negative examples (L−) : a set of owl:differentFrom links

Idea: Approximate precision and recall on that sample

Definition (Relative precision and recall)

precision(K , L+, L−) =|L+ ∩ LD,D′(K )|

|(L+ ∪ L−) ∩ LD,D′(K )|

recall(K , L+) =|L+ ∩ LD,D′(K )|

|L+|

Jerome Euzenat Data interlinking 35 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Unsupervised selection measures

When no reference link is available.Idea: measuring how close the extracted links would be fromone-to-one and total.

Definition (Discriminability)

disc(K ,D,D ′) =min(|{a : 〈a, b〉 ∈ LD,D′(K )}|, |{b : 〈a, b〉 ∈ LD,D′(K )}|)

|LD,D′(K )|

Definition (Coverage)

cov(K ,D,D ′) =|{a : 〈a, b〉 ∈ LD,D′(K )} ∪ {b : 〈a, b〉 ∈ LD,D′(K )}|

|{a : c(a) ∈ D} ∪ {b : d(b) ∈ D ′}|

Jerome Euzenat Data interlinking 36 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Experimental evaluation

These selection measures were evaluated on public datasets.

Finding links between French municipalities described in two differentdatasets:

I Insee dataset: 36700 instances;

I Geonames dataset: 36552 instances.

The reference link set is composed of:

I Positive links: 36552 owl:sameAs statements;

I owl:differentFrom links derived from owl:sameAs links (closed worldassumption).

Jerome Euzenat Data interlinking 37 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Evaluation

The algorithm extracted 11 candidate link keys:

bad F-measure≈ 0

high F-measure≈ .99

good F-measure≈ 0.89

{1} {2} {3, 4} {5, 6}

{7, 1} {2, 1} {3, 4, 1} {3, 2, 4}

{3, 7, 4, 1} {3, 2, 4, 1}

{3, 7, 2, 4, 1}

coveraged

iscr

imin

abili

ty

1 = 〈nom, name〉 2 = 〈nom, alternateName〉3 = 〈subdivisionDe, parentFeature〉 4 = 〈subdivisionDe, parentADM3〉5 = 〈codeINSEE, population〉 6 = 〈codeCommune, population〉7 = 〈nom, officialName〉

Jerome Euzenat Data interlinking 38 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Evaluation

Correlation between the harmonic means of discriminability and coverage andF-measure:

bad F-measure≈ 0

high F-measure≈ .99

good F-measure≈ 0.89

{1} {2} {3, 4} {5, 6}

{7, 1} {2, 1} {3, 4, 1} {3, 2, 4}

{3, 7, 4, 1} {3, 2, 4, 1}

{3, 7, 2, 4, 1}

coverage

dis

crim

inab

ility

h-mean(disc.,cov)≈ .99 h-mean(disc.,cov)≈ .89 h-mean(disc.,cov) ≈ 0

1 = 〈nom, name〉 2 = 〈nom, alternateName〉3 = 〈subdivisionDe, parentFeature〉 4 = 〈subdivisionDe, parentADM3〉5 = 〈codeINSEE, population〉 6 = 〈codeCommune, population〉7 = 〈nom, officialName〉

Jerome Euzenat Data interlinking 38 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Why using ontologies?

Because it is obvious that we must compare the instances of equivalentclasses based on equivalent properties.

More precisely:

I For reducing the search space for finding link keys and similarities

I For reducing the scope of linkage specifications

I Because not the same linkage rules work for the same classes

I Because classes and properties are hint like others of the similaritybetween resources

Ex. With similarity and with keys

Jerome Euzenat Data interlinking 40 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Data interlinking through a commonontology

o

URI1 URI2

Resource matching

of datasets

described by the

same ontology

owl:sameAs

Jerome Euzenat Data interlinking 41 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Matching with a common ontology

+ Focus the search: only match instances of the same class;

– Not sufficient: it remains to identify corresponding entities

+ If keys are defined (OWL 2), this is done;+ At least we know which properties to compare;– Inferring secondary keys may be useful;– Correcting discrepancies: record linkage.

Jerome Euzenat Data interlinking 42 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Record linkage

Name Johann

Date 1665-03-21

Place Munchen

NameJohannes

Date31/03/1665

PlaceMonaco di Bavaria

Having a common ontology does not solve all problems.

Jerome Euzenat Data interlinking 43 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Different types of mismatch

I Different domains, connected (BIM, Energy demand)⇒ few correspondences, any type

I Same domain, different models (engineer, policy maker)⇒ many correspondences, mostly equivalence

I Same domain, different granularity (city management, building design)⇒ many correspondences, mostly subsumption

Jerome Euzenat Data interlinking 44 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Data interlinking with differentontologies (implicit alignment)

o o ′

URI1 URI2

Resource matching

of datasets

described by

different ontologies

owl:sameAs

Jerome Euzenat Data interlinking 45 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Data interlinking with differentontologies (explicit alignment)

o o ′

URI1 URI2

A

Resource matching

of datasets

described by

different ontologies

owl:sameAs

Jerome Euzenat Data interlinking 46 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Ontology matching for data interlinking

o o ′

URI1 URI2

Ontology matching

A

Data interlinking

owl:sameAs

Jerome Euzenat Data interlinking 47 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Heterogeneity problem

Resources being expressed in different ways must be reconciled before beingused.Mismatch between formalized knowledge can occur when:

I different languages are used (OWL vs. Topic maps);I different terminologies are used:

I English vs. Chinese;I Book vs. Monograph.

I different models are used:I different classes: Autobiography vs. Paperback;I classes vs. property: Essay vs. literarygenre;I classes vs. instances: One physical book as an instance vs. one work as

an instance.

I different scopes and granularity are used.I Only books vs. cultural items vs. any product;I Books detailed to the print and translation level vs. books as works.

Jerome Euzenat Data interlinking 48 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Ontology alignment

Item

DVD

Book

Paperback

Hardcover

CD

pricetitledoicreatorpp

author

integer

string

uri

Person

Monograph

Essay

Literary critics

Politics

Biography

Autobiography

Literature

pages

isbnauthor

title

subject

Human

Writer

Jerome Euzenat Data interlinking 49 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Expressive alignments (EDOAL)

Pocket

Booktopic

author=

Volume

size14≥

Autobiography

v

=

∀x ,Pocket(x)⇐ Volume(x) ∧ size(x , y) ∧ y ≤ 14

∀x ,Book(x) ∧ author(x , y) ∧ topic(x , y) ≡ Autobiography(x)

Jerome Euzenat Data interlinking 50 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Example: INSEE dataset

Region table:

code nom chef-lieu

11 Ile-de-France 7505621 Champagne-Ardenne 5110822 Picardie 80021

Sous-region table:

region departement

11 7511 7711 7811 9111 9211 93

Jerome Euzenat Data interlinking 51 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Example: Administrative ontology

Territoire FR

Pays

Region

Departement

Arrondissement

Commune

codenom

chef-lieusubdivision

integer

string

Jerome Euzenat Data interlinking 52 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Example: NUTS dataset

NUTSRegion table:

level code name hasParentRegion

0 FR FRANCE

1 FR1 ILE DE FRANCE FR

2 FR10 Ile de France FR13 FR101 Paris FR103 FR104 Essonne FR10

Jerome Euzenat Data interlinking 53 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Example: Linking INSEE and NUTS

NUTS: Nomenclature of territorial units for statistics

#INSEE INSEE name NUTS Level #NUTS1 Pays 0 34

1 14226 Region 2 344

100 Departement 3 1488342 Arrondissement

4036 Canton 452422 Commune 5

Jerome Euzenat Data interlinking 54 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Example: Linking INSEE and NUTS

Territoire FR

Pays

Region

Departement

Commune

PAYS FR

REG 11

DEP 75

DEP 77

DEP 78

COM 75056

Region

Country

NUTSRegion

LAURegion

FR

UK

FR1

FR10

FR101

FR102

FR103

owl:sameAs

owl:sameAs

owl:sameAs

owl:sameAs

owl:sameAs

Jerome Euzenat Data interlinking 55 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Example: Linksets

Specific data sets containing URIs.

<http://www.example.org/linkset/INSEE-NUTS>

a void:Linkset ;

void:target <http://rdf.insee.fr/geo/regions-2011.rdf>;

void:target <http://nuts.psi.enakting.org/id/>;

insee:PAYS FR owl:sameAs nuts:FR

insee:REG 11 owl:sameAs nuts:FR10

insee:DEP 75 owl:sameAs nuts:FR101

insee:DEP 77 owl:sameAs nuts:FR102

insee:DEP 78 owl:sameAs nuts:FR103

Jerome Euzenat Data interlinking 56 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Example: interesting sets

nuts

onsordnance s. igninsee

geonames dbpedia freebase

Jerome Euzenat Data interlinking 57 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

A simple algorithm

I Find matching concepts [concept matching];

I For each of them, determine matching properties based on the similaritybetween their values in both datasets [property matching];

I From them find property combinations identifying corresponding entities[key extraction];

I Link corresponding entities [link generation].

For instance, nom/RegionINSEE ⊆ name/NUTSRegionNUTS and moreoverthey are unambiguous.

Jerome Euzenat Data interlinking 58 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

INSEE and NUTS: ontology alignment

Territoire FR

Pays

Region

Departement

Arrondissement

Canton

Commune

codenom

chef-lieusubdivision

integer

string

Region

Country

NUTSRegion

LAURegion

name

level

code

hasSubRegion=

=

Jerome Euzenat Data interlinking 59 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Simple alignments are not sufficient

Territoire FR

Region

Departement

Commune

nom

DEP 75

nom

COM 75056

nom

Region

NUTSRegion

name

FR101

name

Paris

=

=

=

≤≤

=

=

=

Jerome Euzenat Data interlinking 60 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Expressive alignments are necessary

Region

NUTSRegion

level

hasParentRegion

2 =

FR=

=

subdivision hasSubRegion=

nom name=

Jerome Euzenat Data interlinking 61 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

What does this mean?

I Ontology alignments are schema-level expression of correspondences;

I They are useful for focussing the search;

I Expressive alignments are necessary;

I They can be turned into SPARQL-based link generators.

but it is also necessary to express instance level constraints:

I for converting data (e.g., mph vs. m/s);

I for expressing matching constraint on data (e.g., similarity).

Jerome Euzenat Data interlinking 62 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Data interlinking and ontology matching

d

o

d ′

o ′Matcher

A

Generator

l

Jerome Euzenat Data interlinking 63 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Tools for data interlinking

Linkage spec extraction generation

similarity LIMES Silk, LIMES, OpenRefine

key LinkKeyDisco SPARQL

Jerome Euzenat Data interlinking 65 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Silk

Silk is a robust software for interlinking data sets.

It relies on an expressive specification of linking conditions:

I Declare data sources (DataSource);

I Circumscribe entities to compare (Source/TargetDataset);I Describe how to compare them (LinkageRule):

I Select properties to compare through paths (Input);I Compute distances between them (Compare+threshold);I Aggregate all comparisons (Aggregate);

I Select those pairs of entities to be linked (Filter);

I Generate links (Output+thresholds).

Jerome Euzenat Data interlinking 66 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

A Silk script

Consider a linking script between INSEE and NUTS:

<Silk>

<Prefix id="nuts"

namespace="http://ec.europa.eu/.../geographic.rdf#" />

<Prefix id="insee"

namespace="http://rdf.insee.fr/geo/" />

<DataSource id="nuts2008"

type="sparqlEndpoint">

<Param name="endpointURI"

value="http://localhost:9091/.../internal"/>

<Param name="graph"

value="http://localhost:9091/.../nuts2008-complete-1"/>

</DataSource>

<DataSource id="insee2010"

type="sparqlEndpoint">

<Param name="endpointURI"

value="http://localhost:9091/.../internal"/>

<Param name="graph"

value="http://localhost:9091/.../source/regions-2010-1"/>

</DataSource>

<Thresholds accept="0.9" verify="0.7" />

<Outputs>

<Output type="sparul">

<Param name="graphUri"

value="http://localhost:9091/.../source/insee-nuts-silk"/>

<Param name="uri"

value="http://localhost:9091/.../lifted/"/>

<Param name="parameter" value="update"/>

</Output>

</Outputs>

<Interlinks>

<Interlink id="linkingNUTS">

<LinkType>owl:sameAs</LinkType>

<SourceDataset dataSource="nuts2008" var="s">

<RestrictTo>?s rdf:type nuts:NUTSRegion.

?s nuts:level 2.

</RestrictTo>

</SourceDataset>

<TargetDataset dataSource="insee2010" var="ss">

<RestrictTo>?ss rdf:type insee:Region</RestrictTo>

</TargetDataset>

<LinkageRule>

<Aggregate type="max">

<Compare metric="levenshteinDistance"

threshold=".2">

<Input path="?s/nuts:name"/>

<Input path="?ss/insee:nom"/>

</Compare>

</Aggregate>

</LinkageRule>

</Interlinks>

</Interlink>

</Silk>

Jerome Euzenat Data interlinking 67 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Silk: prefix and sources

<Silk>

<Prefix id="nuts" namespace="http://ec.europa.eu/.../geographic.rdf#" />

<Prefix id="insee" namespace="http://rdf.insee.fr/geo/" />

<DataSource id="nuts2008" type="sparqlEndpoint">

<Param name="endpointURI" value="http://localhost:9091/.../internal"/>

<Param name="graph" value="http://localhost:9091/.../nuts2008-complete-1"/>

</DataSource>

<DataSource id="id1" type="file">

<Param name="file" value="/Skratch/TutoLinking/admin/regions-2010.rdf"/>

<Param name="format" value="RDF/XML" />

</DataSource>

Sources can be files or SPARQL endpoint.

Jerome Euzenat Data interlinking 68 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Silk rules

<Interlinks>

<Interlink id="linkingNUTS">

<LinkType>owl:sameAs</LinkType>

<SourceDataset dataSource="nuts2008" var="s">

<RestrictTo>?s rdf:type nuts:NUTSRegion.

?s nuts:level 2.

</RestrictTo>

</SourceDataset>

<TargetDataset dataSource="insee2010" var="ss">

<RestrictTo>?ss rdf:type insee:Region</RestrictTo>

</TargetDataset>

<Thresholds accept="0.9" verify="0.7" />

<Outputs>

<Output type="sparul">

<Param name="graphUri" value="http://localhost:9091/.../source/insee-nuts-silk"/>

<Param name="uri" value="http://localhost:9091/.../lifted/"/>

<Param name="parameter" value="update"/>

</Output>

</Outputs>

Restrictions are given in SPARQL graph patternsOutput can be file (in various format, including the Alignment API) or aSPARQL endpoint.They can be made dependent on thresholds.

Jerome Euzenat Data interlinking 69 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Silk rules (cont’ed)

<LinkageRule>

<Aggregate type="max">

<Compare metric="levenshteinDistance" threshold=".2">

<Input path="?s/nuts:name"/>

<Input path="?ss/insee:nom"/>

</Compare>

</Aggregate>

</LinkageRule>

</Interlinks>

</Interlink>

</Silk>

They can:

I transform the data (lowercase, tokenize, to integers, etc.),

I use comparison metrics (equality, levenshtein, Jaro-Winkler, etc.), and

I aggregate their values (average, min, max, etc.).

Jerome Euzenat Data interlinking 70 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Silk workbench

Jerome Euzenat Data interlinking 71 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

EDOAL Alignments

<Cell>

<entity1><e:Class rdf:about="&insee;Region"/></entity1>

<entity2>

<e:Class>

<e:and rdf:parseType="Collection">

<e:Class rdf:about="&nuts;NUTSRegion"/>

<e:AttributeValueRestriction>

<e:onAttribute><e:Property rdf:about="&nuts;level"/></e:onAttribute>

<e:comparator rdf:resource="&edoal;equals"/>

<e:value><e:Literal e:type="&xsd;integer" e:string="2" /></e:value>

</e:AttributeValueRestriction>

<e:AttributeValueRestriction>

<e:onAttribute>

<e:Relation rdf:about="&nuts;hasParentRegion" />

</e:onAttribute>

<e:comparator rdf:resource="&edoal;equals"/>

<e:value><e:Instance rdf:about="&esdata;FR" /></e:value>

</e:AttributeValueRestriction>

</e:and>

</e:Class>

</entity2>

<relation>equivalence</relation>

<measure>1.0</measure>

...

</Cell>

Jerome Euzenat Data interlinking 72 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Link keys in the Alignment API

<e:linkkey>

<e:Linkkey>

<e:binding>

<e:Intersects>

<e:property1><e:Property rdf:about="&insee;nom" /></e:property1>

<e:property2><e:Property rdf:about="&nuts;name" /></e:property2>

</e:Intersects>

<e:Equals>

<e:property1>

<e:Property>

<e:inverse><e:Property rdf:about="&insee;subdivision" /></e:inverse>

</e:property1>

<e:property2><e:Property rdf:about="&nuts;hasParentRegion" /></e:property2>

</e:Equals>

</e:binding>

</e:Linkkey>

</e:linkkey>

Jerome Euzenat Data interlinking 73 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Query generation

PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?r

FROM <http://rdf.insee.fr/geo/regions-2011.rdf>

WHERE {?r rdf:type insee:Region .

}

PREFIX nuts: <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?n

FROM <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/>

WHERE {?n rdf:type nuts:NUTSRegion .

?n nuts:level 2^^xsd:integer .

?n nuts:hasParentRegion nuts:FR .

}

Jerome Euzenat Data interlinking 74 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Data transformation

PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>

CONSTRUCT {?r rdf:type nuts:NUTSRegion .

?r nuts:level 2^^xsd:integer .

?r nuts:hasParentRegion nuts:FR .

}FROM <http://rdf.insee.fr/geo/regions-2011.rdf>

WHERE {?r rdf:type insee:Region .

}

Jerome Euzenat Data interlinking 75 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

SameAs link generation generation

PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>

PREFIX nuts: <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

CONSTRUCT { ?r owl:sameAs ?n . }FROM <http://rdf.insee.fr/geo/regions-2011.rdf>

FROM <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/>

WHERE {?r rdf:type insee:Region .

?r insee:nom ?l .

?n rdf:type nuts:NUTSRegion .

?n nuts:name ?l .

?n nuts:level 2^^xsd:integer .

?n nuts:hasParentRegion nuts:FR .

}

Jerome Euzenat Data interlinking 76 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Other issue: performances

n ×m n3 ×

m3 + n

3 ×m3 + n

3 ×m3

10× 10 = 1001000× 1000 = 1000000

100000× 100000 = 10000000000

Jerome Euzenat Data interlinking 77 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Other issue: performances

Blocking: index+cluster

Dataset 1 Dataset 2

Jerome Euzenat Data interlinking 78 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Other issue: performances

Blocks can be obtained from:

I clustering values in index

I predefined block (based on equality)

I classes in an ontology (blocks are defined as class expressions)

Jerome Euzenat Data interlinking 79 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Other issue: evaluation

d

d ′

interlinking l

Reference links

evaluation

Precision

Recall

F-measure

Jerome Euzenat Data interlinking 80 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Other issue: learning

d

d ′

Training links interlinking l

evaluation

Jerome Euzenat Data interlinking 81 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Conclusion

I Data interlinking is one of the most critical task in linked data

I . . . but not only, e.g. smart citiesI If faces many problems due to:

I heterogeneity (format, languages, convention)I size

I Interlinking can be based on similarities or keys

I There is active work to infer such interlinking pattern

Jerome Euzenat Data interlinking 82 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Further reading

I T. Heath, C. Bizer, Linked Data: Evolving the Web into a Global DataSpace, Morgan & Claypool (US), 2011 http://linkeddatabook.com/

I J. Euzenat, P. Shvaiko, Ontology matching, 2nd ed., Springer,Heildelberg (DE), 2013 http://book.ontologymatching.org

I K. Stefanidis, V. Efthymiou, M. Herschel, V. Christophides, EntityResolution in the Web of Data, Tutorial, WWW conference, Seoul(KR), 2014 http://www.csd.uoc.gr/~vefthym/er/

Silk http://silk-framework.com/

Alignment API http://alignapi.gforge.inria.fr

Al 4 SC http://al4sc.inrialpes.fr

Jerome Euzenat Data interlinking 83 / 0

Data interlinlingSimilarity-based approach

Key-based interlinkingOntology matching & data interlinking

Tools

Thanks

I To my colleagues Manuel Atencia, Jerome David, Nicolas Guillouet andFrancois Scharffe

I The Datalift and Lindicle projects

I The Ready4SmartCities project

Jerome Euzenat Data interlinking 84 / 0

http://exmo.inria.fr

Jerome . Euzenat @ inria . fr