Download - A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles

Transcript

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

A Scalable Approach for Efficiently GeneratingStructured Dataset Topic Profiles

Besnik Fetahu1, Stefan Dietze1, Bernardo Pereira Nunes2, MarcoAntonio Casanova2, Davide Taibi3, Wolfgang Nejdl1

1L3S Research Center, Leibniz Universitat Hannover2Department of Informatics - PUC-Rio

3Institute for Educational Technologies, CNR

May 29, 2014

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

1 Introduction

2 Problem and Motivation

3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches

4 Experimental SetupBaselines

5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling

6 Conclusions

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

1 Introduction

2 Problem and Motivation

3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches

4 Experimental SetupBaselines

5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling

6 Conclusions

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Introduction

• Increasing amount of Web Data

• Data heterogeneity: representation, language, quality anddomains

• Sparsely connected datasets

• Lack of descriptive metadata about datasets

• Exhaustive techniques for data analysis

• Efficiency heavily dependent on information need

• Ease of access and representation of datasets

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Introduction

• Increasing amount of Web Data

• Data heterogeneity: representation, language, quality anddomains

• Sparsely connected datasets

• Lack of descriptive metadata about datasets

• Exhaustive techniques for data analysis

• Efficiency heavily dependent on information need

• Ease of access and representation of datasets

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Introduction

• Increasing amount of Web Data

• Data heterogeneity: representation, language, quality anddomains

• Sparsely connected datasets

• Lack of descriptive metadata about datasets

• Exhaustive techniques for data analysis

• Efficiency heavily dependent on information need

• Ease of access and representation of datasets

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Introduction

• Increasing amount of Web Data

• Data heterogeneity: representation, language, quality anddomains

• Sparsely connected datasets

• Lack of descriptive metadata about datasets

• Exhaustive techniques for data analysis

• Efficiency heavily dependent on information need

• Ease of access and representation of datasets

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Introduction

• Increasing amount of Web Data

• Data heterogeneity: representation, language, quality anddomains

• Sparsely connected datasets

• Lack of descriptive metadata about datasets

• Exhaustive techniques for data analysis

• Efficiency heavily dependent on information need

• Ease of access and representation of datasets

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Introduction

• Increasing amount of Web Data

• Data heterogeneity: representation, language, quality anddomains

• Sparsely connected datasets

• Lack of descriptive metadata about datasets

• Exhaustive techniques for data analysis

• Efficiency heavily dependent on information need

• Ease of access and representation of datasets

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Introduction

• Increasing amount of Web Data

• Data heterogeneity: representation, language, quality anddomains

• Sparsely connected datasets

• Lack of descriptive metadata about datasets

• Exhaustive techniques for data analysis

• Efficiency heavily dependent on information need

• Ease of access and representation of datasets

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

1 Introduction

2 Problem and Motivation

3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches

4 Experimental SetupBaselines

5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling

6 Conclusions

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Why dataset profiling?

• Growing number of datasets: 227 datasets

• Data represented as triples: 31 billion triples

• Multi-lingual content: 18 languages

• Broad set of topics covered

• Inter-dataset links

Domain # Data. TriplesMedia 25 1,841,852,061Geographic 31 6,145,532,484Government 49 13,315,009,400Publications 87 2,950,720,693Cross-domain 41 4,184,635,715Life sciences 41 3,036,336,004User-generated 20 134,127,413

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Why dataset profiling?

Find datasets covering the domain of “Renewable Energy”?

• Sparsity: Datasets that cover the topic?• 38 out of 228 datasets contain

topic coverage information.

• Scalability: Use SPARQL filter clause?• regex(*) filter clause needs

to check all triples that containa specific keyword.

• Disambiguity: What are all the possible forms ofrenewable energy?

• solar energy, wind energy, geothermal. . .

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Why dataset profiling?

Find datasets covering the domain of “Renewable Energy”?

• Sparsity: Datasets that cover the topic?• 38 out of 228 datasets contain

topic coverage information.

• Scalability: Use SPARQL filter clause?• regex(*) filter clause needs

to check all triples that containa specific keyword.

• Disambiguity: What are all the possible forms ofrenewable energy?

• solar energy, wind energy, geothermal. . .

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Why dataset profiling?

Find datasets covering the domain of “Renewable Energy”?

• Sparsity: Datasets that cover the topic?• 38 out of 228 datasets contain

topic coverage information.

• Scalability: Use SPARQL filter clause?• regex(*) filter clause needs

to check all triples that containa specific keyword.

• Disambiguity: What are all the possible forms ofrenewable energy?

• solar energy, wind energy, geothermal. . .

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

1 Introduction

2 Problem and Motivation

3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches

4 Experimental SetupBaselines

5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling

6 Conclusions

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Profiling Overview

1 Metadata extraction

2 Resource sampling

3 Entity/topic extraction

4 Profile graphs

5 Profiles representation

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Profiling Overview

1 Metadata extraction

2 Resource sampling

3 Entity/topic extraction

4 Profile graphs

5 Profiles representation

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Profiling Overview

1 Metadata extraction

2 Resource sampling

3 Entity/topic extraction

4 Profile graphs

5 Profiles representation

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Profiling Overview

1 Metadata extraction

2 Resource sampling

3 Entity/topic extraction

4 Profile graphs

5 Profiles representation

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Profiling Overview

1 Metadata extraction

2 Resource sampling

3 Entity/topic extraction

4 Profile graphs

5 Profiles representation

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Dataset Profiling Example

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Dataset Profiling Example

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Dataset Profiling Example

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Dataset Profiling Example

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Resource Instance and Type Extraction

• Simple SPARQL SELECT queries

• Avg. indexing time 10% (7min) vs. 100% (4hrs).

• Approximately ∼300 million resource instances

10

100

1000

10000

100000

uriburner

bluk-bnb

bio2rdf-kegg-pathway

nomenclator-asturias

b3kat

lobid-resources

twc-ieeevis

educationalprogramss isvu

farmbio-chem

bl

world-bank-linked-data

event-media

eeaeunishungarian-national-library-catalog

bio2rdf-pubmed

linked-user-feedback

oecd-linked-data

bio2rdf-goa

pscs-catalogue

bio2rdf-genbank

linkedmdb

bfs-linked-data

bio2rdf-reactome

british-museum-collection

bio2rdf-ncbigene

datos-bcn-cl

l3s-dblp

bio2rdf-sgd

hellenic-fire-brigade

Log-s

cale

indexin

g t

ime 100%

10%

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Resource Sampling ApproachesEntity and Topic Extraction

Resource Sampling

• random: randomly select a resource instance for analysis

• weighted: weigh a resource by the number of datatypeproperties used to describe it wk = |f (rk)|/max{|f (rj)|}

• centrality: weigh a resource by the number of typesused to describe it ck = |C ′

k |/|C |

Topic Extraction

• Resources as documents by combining all textual literals

• Perform NED1 and extract corresponding DBpedia entities

• Extract topics as DBpedia categories from entities viadcterms:subject

1DBpedia Spotlight

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Resource Sampling ApproachesEntity and Topic Extraction

Resource Sampling

• random: randomly select a resource instance for analysis

• weighted: weigh a resource by the number of datatypeproperties used to describe it wk = |f (rk)|/max{|f (rj)|}

• centrality: weigh a resource by the number of typesused to describe it ck = |C ′

k |/|C |

Topic Extraction

• Resources as documents by combining all textual literals

• Perform NED1 and extract corresponding DBpedia entities

• Extract topics as DBpedia categories from entities viadcterms:subject

1DBpedia Spotlight

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Resource Sampling ApproachesEntity and Topic Extraction

Resource Sampling

• random: randomly select a resource instance for analysis

• weighted: weigh a resource by the number of datatypeproperties used to describe it wk = |f (rk)|/max{|f (rj)|}

• centrality: weigh a resource by the number of typesused to describe it ck = |C ′

k |/|C |

Topic Extraction

• Resources as documents by combining all textual literals

• Perform NED1 and extract corresponding DBpedia entities

• Extract topics as DBpedia categories from entities viadcterms:subject

1DBpedia Spotlight

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Constructing profiles: Dataset-topic graph

1 Profile graph nodes: datasets,

resources, topics

2 Weighted graph edges: ∆〈D, t〉3 Edge weights: ∆〈Di , t〉 6= ∆〈Dj , t〉4 Compute ∆〈Di , t〉 by assessing the

importance of t given the resourcesof Di as prior knowledge

5 The given prior knowledge biasesthe importance of t in the profilegraph towards Di

2

6 Incrementally add datasets in theprofile graph, by simply computingthe weights ∆〈Dk , t〉

2Scott White and Padhraic Smyth. 2003. Algorithms for estimating relativeimportance in networks. In 9th ACM SIGKDD (KDD ’03).

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Topic Ranking ApproachesTopic filtering

Topic pre-filtering:

NTR(t,D) =Φ(·,D)

Φ(t,D)+

Φ(·, ·)Φ(t, ·)

• Filter noisy topics

• φ(·, ·) - number of entitiesassociated with topic t

• Closely related to the tf-idfweighting scheme

Topic Ranking

• PageRank with Priors (PRankP)

• HITS with Priors (HITSP)

• K-Step Markov (KStepM)

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

1 Introduction

2 Problem and Motivation

3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches

4 Experimental SetupBaselines

5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling

6 Conclusions

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Experimental Setup

Datasets and Ground-truth

• 129 dataset from lod-cloud3

• 6 ground-truth datasets with manually assigned topicindicators for their resources

Dataset Properties #Resources

yovistoskos:subject, dbp:{subject, class,

discipline, kategorie, tagline} 62879

oxpoints dcterms:subject,dc:subject 37258

socialsemweb-thesaurusskos:subject, tag:associatedTag,dcterms:subject

2243

semantic-web-dog-food dcterms:subject, dc:subject 20145lak-dataset dcterms:subject, dc:subject 1691

Evaluation Metrics

• NDCG@k (k=1, . . . , 1000)

• Compare the induced ranking by the graphical modelsagainst the ideal ranking

3At the time of experimentation only 129 dataset endpoints wereresponsive.

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Baselines

• tf-idf: Consider resources as documents. Extract foreach dataset the top {50, 100, 150, 200} terms.

• LDA: Consider dataset as documents4. Extract topweighted topic terms. For every dataset extract top {50,100, 150, 200} with a number of topics {10, 20, 30, 40,50}.

4In this case it does not matter if datasets are considered at theresource level or aggregated.

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

1 Introduction

2 Problem and Motivation

3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches

4 Experimental SetupBaselines

5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling

6 Conclusions

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Efficiency of Dataset Profiling

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 100 200 300 400 500 600 700 800 900 1000

ND

CG r

anki

ng s

core

NDCG rank

Profiling accuracy for all topic ranking approaches

K-Step Markov + NTRPageRank with priors + NTR

HITS with priors + NTR

LDAtf-idf

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Sample Size

K-Step Markov profiling accuracy (Centrality Sampling)

KStepM + NTRKStepM

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Scalability of Dataset Profiling

100

1000

10000

0 20 40 60 80 100 0

0.05

0.1

0.15

0.2

0.25

0.3

Log-

scal

e tim

e pe

rfor

man

ce

ND

CG r

anki

ng s

core

Sample Size

Time Performance vs. Profiling Accuracy

HITS with priors timeHITS with priors ranking

K-Step Markov time

K-Step Markov rankingPageRank with priors time

PageRank with priors ranking

• 5% and 10% already providestable profiling accuracy

• Avg. 7mins for indexing 10%of resources per dataset vs.4hrs per dataset

• 2mins for ranking datasetprofiles with 10% of resourcesvs. 45mins for 100%

• NED runtime 10% vs. 100%?

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Motivation Example Revisited!

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Conclusions and Future Work

• Structured dataset profiles

• Scalable approach through sampling

• Efficient profiling through topic filtering and ranking

• Incremental generation of dataset profiles

• Dataset profiles as a set of links (entity and topic links)

• Provenance information of links (e.g. resources fromwhich an entity is extracted)

• Profiles for dataset recommendation, search, etc.

Resources

• Profiles Endpoint:http://data-observatory.org/lod-profiles/sparql

• Profiles Webpage:http://data-observatory.org/lod-profiles/

Introduction

Problem andMotivation

Approach

ResourceInstance andType Extraction

ResourceSamplingApproaches

Constructingprofiles:Dataset-topicgraph

Topic RankingApproaches

ExperimentalSetup

Baselines

EvaluationResults

Efficiency ofDataset Profiling

Scalability ofDataset Profiling

Conclusions

Thank you! Questions?#eswc2014Fetahu