KnowEscape workshop, OKCon 2013

20
Curation and profiling of Linked Data KnowEscape workshop, Open Knowledge Conference 2013 (OKCon2013) Stefan Dietze 1 , Besnik Fetahu 1 , Mathieu d’Aquin 2 1 L3S Research Center (Germany); 2 The Open University (UK) http://linkedup-project.eu http://purl.org/dietze @stefandietze 19/09/2013 1 Stefan Dietze

Transcript of KnowEscape workshop, OKCon 2013

Page 1: KnowEscape workshop, OKCon 2013

Motivation Data on the Web

Some eyecatching opener illustrating growth and or diversity of web data

Curation and profiling of Linked Data KnowEscape workshop, Open Knowledge Conference 2013 (OKCon2013)

Stefan Dietze1, Besnik Fetahu1, Mathieu d’Aquin2 1 L3S Research Center (Germany); 2 The Open University (UK)

http://linkedup-project.eu

http://purl.org/dietze @stefandietze

19/09/2013 1 Stefan Dietze

Page 2: KnowEscape workshop, OKCon 2013

17/09/2013 2 Stefan Dietze

Success models: data & applications

LinkedUp Challenge to identify innovative tools & applications

Evaluation methods and approaches

http://www.linkedup-challenge.org/

“LinkedUp” – Linking Web Data for Education L

Data curation

Technology transfer & community-building

Collecting & exposing open data of educational relevance => LinkedUp Data Catalog

Profiling and linking of Web Data for education => educational data graph

Disseminating knowledge & building communities (educators, computer scientists, data engineers)

Gathering stakeholder feedback: use cases, and requirements

http://linkedup-challenge.org/#usecases

http://data.linkededucation.org

http://linkedup-project.eu/events

European project aimed at advancing take-up of open data and related technologies

http://linkedup-project.eu

Page 3: KnowEscape workshop, OKCon 2013

Problem: too many datasets, too few information

Stefan Dietze 19/09/13

http://datahub.io/dataset/bbc

60.000.000 triples

Using/exploiting Linked Data in Education ?

Lack of reliable dataset metadata about

Resource types

Topics & disciplines

Quality, currentness & availability

Provenance

Lack of links and cross-dataset references

Lack of scalable query methods

LOD: 300+ datasets, 32++ billion distinct RDF statements

DataHub: 6000+ open datasets

Page 4: KnowEscape workshop, OKCon 2013

Goal: dataset metadata & search for data consumers

“LinkedUp/Linked Education cloud” as “expanded” subset of LOD cloud at The DataHub (http://datahub.io/groups/linked-education)

RDF (VoID) catalog of datasets = dataset of datasets (Linked Education Catalog): classification of datasets according to, eg, represented types, disciplines/topics, data quality, accessability

Links and coreferences => unified view on data => Linked Education Graph

Infrastructure, unified (SPARQL) endpoint & APIs for distributed/federated querying

Data curation and dataset profiling LinkedUp approach

Educational Datasets

LinkedUp

Catalog

LinkedUp

Links Automated processing to generate: Descriptive VoID/RDF Dataset Catalog Data links

19/09/2013 4 Stefan Dietze

Page 5: KnowEscape workshop, OKCon 2013

Assessing the Educational Linked Data

Landscape, D’Aquin, M., Adamou, A.,

Dietze, S., ACM Web Science 2013

(WebSci2013), Paris, France, May 2013.

[WEBSCI‘13]

19/09/2013 5 Stefan Dietze

Linked Data „Observatory“ for linking and profiling

Endpoint Retrieval

& Graph

Extraction

Schema

Extraction and

Mapping

Sample Graph

Extraction

(per dataset)

NER & NED

(per resource)

Interlinking & Co-

Resolution

(cross-dataset)

Category Mapping,

Normalisation,

Filtering

Dataset

Catalog/Index Links/

Cross-references

rdfs:label:„…ECB….“ ?

Dataset metadata (RDF/VoID): Schema mappings

(types, properties) Entities & categories Topic relevance scores Availability, currentness

data (tbc)

dbpedia:Finance

dbpedia:Sports

dbpedia:England-Wales-Cricket-Board

dbpedia:European_Central_Bank

Combining a co-occurrence-based and a

semantic measure for entity linking, B. P.

Nunes, S. Dietze, M.A. Casanova, R.

Kawase, B. Fetahu, and W. Nejdl. , ESWC

2013 - 10th Extended Semantic Web

Conference, (May 2013).

Generating structured Profiles of Linked

Data Graphs, Fetahu, B; Adamou, A.,

Dietze, S., d’Aquin, M., Nunes, B.P.,

ISWC2013 – 12th International Semantic

Web Conference; under review.

[ESCW‘13] [ISWC‘13]

Page 6: KnowEscape workshop, OKCon 2013

Schema assessment and mapping

Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

<po:Programme …>

<po:title>Secret Universe –

The Life of the Cell</po:title>

</po:Programme…>

BBC Programme

<sioc:Item …>

<label>Viral diseases &

bacteria</title>

</sioc:Item ….>

SlideShare Set

po:Programme

sioc:Item

?

http://datahub.io/group/linked-education

19/09/2013 6 Stefan Dietze

Page 7: KnowEscape workshop, OKCon 2013

Schema assessment and mapping

Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)

Co-occurence graph after mapping

(201 frequent types mapped into 79 classes)

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

bibo:Slideshow

bibo:Film

bibo:Document

19/09/2013 7 Stefan Dietze

<po:Programme …>

<po:title>Secret Universe –

The Life of the Cell</po:title>

</po:Programme…>

BBC Programme

<sioc:Item …>

<label>Viral diseases &

bacteria</title>

</sioc:Item ….>

SlideShare Set

po:Programme

sioc:Item

Page 8: KnowEscape workshop, OKCon 2013

LinkedUp Data Catalog in a nutshell http://datahub.io/group/linked-education

http://data.linkededucation.org/linkedup/catalog/

VoID dataset catalog: browse, explore and query for datasets/types

Federated queries using type mappings

19/09/2013 8 Stefan Dietze

Page 9: KnowEscape workshop, OKCon 2013

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Video

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

Topics/categories addressed? Relatedness of resources/entities? (types, semantics)

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

Combining a co-occurrence-based and a semantic measure

for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.

Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended

Semantic Web Conference, (May 2013).

Generating structured Profiles of Linked Data Graphs,

Fetahu, B; Adamou, A., Dietze, S., d’Aquin, M., Nunes, B.P.,

ISWC2013 – 12th International Semantic Web Conference; under

review.

Dataset topic profiling: data heterogeneity?

19/09/2013 9 Stefan Dietze

Page 10: KnowEscape workshop, OKCon 2013

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Video <po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

Data disambiguation, linking & profiling

Brian Cox?

Sun?

Pluto?

19/09/2013 10 Stefan Dietze

Page 11: KnowEscape workshop, OKCon 2013

db:Pluto

(Dwarf Planet)

db:Astrono-

mical Objects

db:Sun

Data disambiguation, linking & profiling

db:Astronomy

19/09/2013 11 Stefan Dietze

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Video <po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

Page 12: KnowEscape workshop, OKCon 2013

db:Pluto

(Dwarf Planet)

db:Astrono-

mical Objects

<yov:Lecture8748720>

<title>Pluto & the Dwarf

Planets</title>

< yov:Lecture8748720>

Online Lecture

db:Astronomy

Computation of connectivity scores between resources/entities

Method: combination of a

(i) semantic (graph-based) connectivity score (SCS) with

(ii) a Web co-occurence-based measure (CBM) (similar to NGD)

For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties)

Data linking

Dataset categorisation: computation of normalised (DBpedia) category relevance scores for datasets

db:Sun

SCS = 0.32

CBM = 0.24

http://purl.org/vol/doc/

http://purl.org/vol/ns/

19/09/2013 12 Stefan Dietze

Combining a co-occurrence-based and a semantic

measure for entity linking, B. P. Nunes, S. Dietze, M.A.

Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013

- 10th Extended Semantic Web Conference, (May 2013).

Data disambiguation, linking & profiling

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

Page 13: KnowEscape workshop, OKCon 2013

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

db:Astrono-

mical Objects

db:Astronomy

db:Sun

Dataset profiling

Goal: extracting representative metadata („topic profile“) for each dataset

Approach: computation of normalised (DBpedia) category relevance scores

Using representative sample resource sets per reource type & dataset

Generating structured Profiles of Linked Data

Graphs, Fetahu, B; Adamou, A., Dietze, S., d’Aquin,

M., Nunes, B.P., ISWC2013 – 12th International

Semantic Web Conference; under review.

DBpedia category graph

Page 14: KnowEscape workshop, OKCon 2013

Endpoint Retrieval

& Graph

Extraction

Schema

Extraction and

Mapping

Sample Graph

Extraction

(per dataset/type)

NER & NED

(per resource)

Interlinking & Co-

Resolution

(cross-dataset)

Dataset

Catalog/Index Links/

Cross-references

rdfs:label:„…ECB….“ ?

Dataset metadata (RDF/VoID): Schema mappings

(types, properties) Entities & categories Topic relevance scores Availability, currentness

data (tbc)

dbpedia:Finance

dbpedia:Sports

dbpedia:England-Wales-Cricket-Board

dbpedia:European_Central_Bank

19/09/2013 14 Stefan Dietze

Dataset profiling: topic extraction process (1/2)

Category Mapping,

Normalisation,

Filtering

Step 1 – NER:

Online NER & NED vs. incremental similarity-based „NER“:

Online NER: DBpedia Spotlight

Incremental & similarity-based NER: compare [via Jaccard Index] textual desc of already extracted entities with literal values of a resource instance (assumption: recurring entities likely within datasets)

Page 15: KnowEscape workshop, OKCon 2013

Endpoint Retrieval

& Graph

Extraction

Schema

Extraction and

Mapping

Sample Graph

Extraction

(per dataset/type)

NER & NED

(per resource)

Interlinking & Co-

Resolution

(cross-dataset)

Dataset

Catalog/Index Links/

Cross-references

rdfs:label:„…ECB….“ ?

Dataset metadata (RDF/VoID): Schema mappings

(types, properties) Entities & categories Topic relevance scores Availability, currentness

data (tbc)

dbpedia:Finance

dbpedia:Sports

dbpedia:England-Wales-Cricket-Board

dbpedia:European_Central_Bank

19/09/2013 15 Stefan Dietze

Dataset profiling: topic extraction process (1/2)

Category Mapping,

Normalisation,

Filtering

Step 1 – NER:

Online NER & NED vs. incremental similarity-based „NER“:

Online NER: DBpedia Spotlight

Incremental & similarity-based NER: compare [via Jaccard Index] textual desc of already extracted entities with literal values of a resource instance (assumption: recurring entities likely within datasets)

Step 2 – Computation of profile (ranked categories)

Entities => DBpedia categories = “Topics”: extraction of topics from DBpedia entities via dcterms:subject

Expand the set of topics by leveraging hierarchical category organization (skos:broader)

Normalised topic score:

topics datasets

# entities

for dataset D # entities

for all datasets

# of entities for t

in dataset D

# of entities for t

for all datasets

Page 16: KnowEscape workshop, OKCon 2013

http://data.linkededucation.org/linkedup/categories-explorer

http://data.linkededucation.org/

Dataset profile explorer http://data.linkededucation.org/request/pipeline/sparql

Page 17: KnowEscape workshop, OKCon 2013

LinkedUp Data Catalog – hands-on in a nutshell

http://data.linkededucation.org

http://data.linkededucation.org/linkedup/catalog/sparql

http://data.linkededucation.org/request/pipeline/sparql

Querying FOR datasets

• Retrieving datasets for categories SELECT ?datasetname ?link ?score WHERE

{ ?linkset a void:Linkset.

?linkset vol:hasLink ?link.

?link vol:linksResource <http://dbpedia.org/resource/Category:Technology>.

?link vol:hasScore ?score.

?dataset a void:Dataset.

?linkset void:target ?dataset.

?dataset dcterms:title ?datasetname.

FILTER (?score > 0.5) }

• Retrieve datasets describing schools: select distinct ?endpoint ?cl where

{ ?ds void:sparqlEndpoint ?endpoint. {{?ds void:classPartition [ void:class ?cl]} UNION {?ds void:subset [ void:classPartition [ void:class ?cl] ]}} {{?cl owl:equivalentClass aiiso:School} UNION {?cl rdfs:subClassOf aiiso:School} UNION {FILTER ( str(?cl) = str(aiiso:School) ) }} }

Querying THE datasets

• Federated queries using mappings beetwen aaiso:school and other „school“ types prefix void: <http://rdfs.org/ns/void#> prefix aiiso: <http://purl.org/vocab/aiiso/schema#> prefix owl:

<http://www.w3.org/2002/07/owl#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?endpoint ?school ?cl where { … as above …. }

service silent ?endpoint { ?school a ?cl } }

19/09/2013 17 Stefan Dietze

type mappings!

topic profiles/scores!

query federation!

Page 18: KnowEscape workshop, OKCon 2013

Outlookin a nutshell

Merging the two VoID datasets

Datasets and type mappings (LinkedUp Catalog)

Category annotations (data.linkededucation.org)

Extracting statistical observations (RDF Data Cube)

Feeding data back into the DataHub

Application to entire LOD cloud group on DataHub

Consideration of additional profiling features

Quality aspects

Dataset and link dynamics

Temporal and spatial coverage (=> http://www.duraark.eu)

fake example

19/09/2013 18 Stefan Dietze

Page 19: KnowEscape workshop, OKCon 2013

LinkedUp Vidi Competition

19/09/13 19

Tools and demos that analyse or integrate open web data for educational purposes

• Wanted: applications tools that address real educational needs

• Anyone can participate - researchers, students, developers, industry

• Challenging focused tracks with clear goals

• More data, more challenging, more support, more prizes

More info: http://linkedup-challenge.org/

Launch at 4 November 2013

Submission deadline is 14 February 2014

20,000 Euro prize money