Identifying Relevant Sources for Data Linking using a Semantic Web Index

16
Identifying Relevant Sources for Data Linking using a Semantic Web Index Andriy Nikolov Mathieu d’Aquin Knowledge Media Institute The Open University, UK

description

 

Transcript of Identifying Relevant Sources for Data Linking using a Semantic Web Index

Page 1: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Identifying Relevant Sources for Data Linking using a Semantic Web Index

Andriy NikolovMathieu d’AquinKnowledge Media InstituteThe Open University, UK

Page 2: Identifying Relevant Sources for Data Linking using a Semantic Web Index

How to link a new dataset?

• What other repositories contain relevant data which I should link to?– Select the external repository

• How to select the relevant data instances to link?– Select the relevant classes within the chosen repository

TV programsmovies

pieces of music

LinkedMDB

DBPedia

Freebase

MusicBrainz

?

actors

composers bestbuy

Page 3: Identifying Relevant Sources for Data Linking using a Semantic Web Index

data.open.ac.uk

Selection criteria

• Additional information about local instances• Popularity• Degree of overlap

Publication data

DBPedia

DBLP

rae:RKBExplorer

Page 4: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Available information• Additional information about resources

– Schema ontology– Test examples

• Popularity– VoiD descriptors

• Linking repositories– Catalog of repositories (CKAN)

• Degree of overlap– VoiD descriptors (only topic relevance)– Relevant info hard to obtain on the client

side

Page 5: Identifying Relevant Sources for Data Linking using a Semantic Web Index

ApproachSearch for sources with potentially high degree of overlap

– Use a subset of entity labels from the original dataset as keywords for entity search

Page 6: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Approach

Aggregate results– Group instances occurring

in returned result sets by their source repositories

Page 7: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Approach

Rank sources – Sort by number of

individuals returned in search results

Page 8: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Approach

Select “most relevant” class – Select the class in each

source, which covers most of instances

Page 9: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Issues: imprecise results

• Main cause: ambiguous instance labels• Inclusion of irrelevant sources

– E.g., DBLP for movie score composers• Selection of inappropriate classes within

the selected source– Too generic: e.g., dbpedia:Person vs

dbpedia:MusicArtist– Irrelevant: e.g., akt:Publication-Reference

(journal volume) vs akt:Journal

Page 10: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Filtering resultsDetermine potentially irrelevant classes

– Use state-of-the-art schema matching to select relevant classes

Page 11: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Filtering resultsFilter out irrelevant search results

– Only consider search result instances belonging to “approved” classes

Page 12: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Preliminary experiments

• Datasets– ORO journals (data.open.ac.uk): 3110

instances– LinkedMDB films: 400 instances– LinkedMDB music contributors: 400

instances• External components

– Semantic index: Sig.ma– Ontology matching techniques: CIDER,

instance-based schema mappings retrieved from BTC2009 dataset

Page 13: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Preliminary experiments

Before filtering +/- After filtering +/-rae2001 (RKB) + rae2001 (RKB) +dotac (RKB) + DBPedia +DBPedia + dblp.l3s.de +oai (RKB) + Freebase +dblp.l3s.de + DBLP (RKB) +wordnet (RKB) - eprints (RKB) +bibsonomy -eprints (RKB) +Freebase +www.examiner.com -

• Performance measure:– Proportion of relevant sources among the top-10

returned results

Page 14: Identifying Relevant Sources for Data Linking using a Semantic Web Index

• Summary:– Top-ranked returned repositories are largely

relevant from the point of view of linking– Filtering using schema matching techniques

greatly improves precision (all remaining sources are relevant)

– … but at the expense of some recall

Preliminary experiments

Page 15: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Future work• Improving the quality of results

– E.g., estimating the potential loss of precision/recall for different filtering decisions

• Integrating with the data linking workflow– Automatically pre-configuring the data

linking algorithm• Repository search as a potentially useful

semantic search use case (in addition to entity and document search)

Page 16: Identifying Relevant Sources for Data Linking using a Semantic Web Index

Questions?

Thanks for your attention