KnowEscape workshop, OKCon 2013
-
Upload
stefan-dietze -
Category
Education
-
view
1.265 -
download
0
Transcript of KnowEscape workshop, OKCon 2013
Motivation Data on the Web
Some eyecatching opener illustrating growth and or diversity of web data
Curation and profiling of Linked Data KnowEscape workshop, Open Knowledge Conference 2013 (OKCon2013)
Stefan Dietze1, Besnik Fetahu1, Mathieu d’Aquin2 1 L3S Research Center (Germany); 2 The Open University (UK)
http://linkedup-project.eu
http://purl.org/dietze @stefandietze
19/09/2013 1 Stefan Dietze
17/09/2013 2 Stefan Dietze
Success models: data & applications
LinkedUp Challenge to identify innovative tools & applications
Evaluation methods and approaches
http://www.linkedup-challenge.org/
“LinkedUp” – Linking Web Data for Education L
Data curation
Technology transfer & community-building
Collecting & exposing open data of educational relevance => LinkedUp Data Catalog
Profiling and linking of Web Data for education => educational data graph
Disseminating knowledge & building communities (educators, computer scientists, data engineers)
Gathering stakeholder feedback: use cases, and requirements
http://linkedup-challenge.org/#usecases
http://data.linkededucation.org
http://linkedup-project.eu/events
European project aimed at advancing take-up of open data and related technologies
http://linkedup-project.eu
Problem: too many datasets, too few information
Stefan Dietze 19/09/13
http://datahub.io/dataset/bbc
60.000.000 triples
Using/exploiting Linked Data in Education ?
Lack of reliable dataset metadata about
Resource types
Topics & disciplines
Quality, currentness & availability
Provenance
Lack of links and cross-dataset references
Lack of scalable query methods
LOD: 300+ datasets, 32++ billion distinct RDF statements
DataHub: 6000+ open datasets
Goal: dataset metadata & search for data consumers
“LinkedUp/Linked Education cloud” as “expanded” subset of LOD cloud at The DataHub (http://datahub.io/groups/linked-education)
RDF (VoID) catalog of datasets = dataset of datasets (Linked Education Catalog): classification of datasets according to, eg, represented types, disciplines/topics, data quality, accessability
Links and coreferences => unified view on data => Linked Education Graph
Infrastructure, unified (SPARQL) endpoint & APIs for distributed/federated querying
Data curation and dataset profiling LinkedUp approach
Educational Datasets
LinkedUp
Catalog
LinkedUp
Links Automated processing to generate: Descriptive VoID/RDF Dataset Catalog Data links
19/09/2013 4 Stefan Dietze
Assessing the Educational Linked Data
Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013
(WebSci2013), Paris, France, May 2013.
[WEBSCI‘13]
19/09/2013 5 Stefan Dietze
Linked Data „Observatory“ for linking and profiling
Endpoint Retrieval
& Graph
Extraction
Schema
Extraction and
Mapping
Sample Graph
Extraction
(per dataset)
NER & NED
(per resource)
Interlinking & Co-
Resolution
(cross-dataset)
Category Mapping,
Normalisation,
Filtering
Dataset
Catalog/Index Links/
Cross-references
rdfs:label:„…ECB….“ ?
Dataset metadata (RDF/VoID): Schema mappings
(types, properties) Entities & categories Topic relevance scores Availability, currentness
data (tbc)
dbpedia:Finance
dbpedia:Sports
dbpedia:England-Wales-Cricket-Board
dbpedia:European_Central_Bank
Combining a co-occurrence-based and a
semantic measure for entity linking, B. P.
Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl. , ESWC
2013 - 10th Extended Semantic Web
Conference, (May 2013).
Generating structured Profiles of Linked
Data Graphs, Fetahu, B; Adamou, A.,
Dietze, S., d’Aquin, M., Nunes, B.P.,
ISWC2013 – 12th International Semantic
Web Conference; under review.
[ESCW‘13] [ISWC‘13]
Schema assessment and mapping
Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
?
http://datahub.io/group/linked-education
19/09/2013 6 Stefan Dietze
Schema assessment and mapping
Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)
Co-occurence graph after mapping
(201 frequent types mapped into 79 classes)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
bibo:Slideshow
bibo:Film
bibo:Document
19/09/2013 7 Stefan Dietze
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
LinkedUp Data Catalog in a nutshell http://datahub.io/group/linked-education
http://data.linkededucation.org/linkedup/catalog/
VoID dataset catalog: browse, explore and query for datasets/types
Federated queries using type mappings
19/09/2013 8 Stefan Dietze
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
Topics/categories addressed? Relatedness of resources/entities? (types, semantics)
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
Generating structured Profiles of Linked Data Graphs,
Fetahu, B; Adamou, A., Dietze, S., d’Aquin, M., Nunes, B.P.,
ISWC2013 – 12th International Semantic Web Conference; under
review.
Dataset topic profiling: data heterogeneity?
19/09/2013 9 Stefan Dietze
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video <po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Data disambiguation, linking & profiling
Brian Cox?
Sun?
Pluto?
19/09/2013 10 Stefan Dietze
db:Pluto
(Dwarf Planet)
db:Astrono-
mical Objects
db:Sun
Data disambiguation, linking & profiling
db:Astronomy
19/09/2013 11 Stefan Dietze
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video <po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
db:Pluto
(Dwarf Planet)
db:Astrono-
mical Objects
<yov:Lecture8748720>
<title>Pluto & the Dwarf
Planets</title>
…
< yov:Lecture8748720>
Online Lecture
db:Astronomy
Computation of connectivity scores between resources/entities
Method: combination of a
(i) semantic (graph-based) connectivity score (SCS) with
(ii) a Web co-occurence-based measure (CBM) (similar to NGD)
For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties)
Data linking
Dataset categorisation: computation of normalised (DBpedia) category relevance scores for datasets
db:Sun
SCS = 0.32
CBM = 0.24
http://purl.org/vol/doc/
http://purl.org/vol/ns/
19/09/2013 12 Stefan Dietze
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
Data disambiguation, linking & profiling
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
db:Astrono-
mical Objects
db:Astronomy
db:Sun
Dataset profiling
Goal: extracting representative metadata („topic profile“) for each dataset
Approach: computation of normalised (DBpedia) category relevance scores
Using representative sample resource sets per reource type & dataset
Generating structured Profiles of Linked Data
Graphs, Fetahu, B; Adamou, A., Dietze, S., d’Aquin,
M., Nunes, B.P., ISWC2013 – 12th International
Semantic Web Conference; under review.
DBpedia category graph
Endpoint Retrieval
& Graph
Extraction
Schema
Extraction and
Mapping
Sample Graph
Extraction
(per dataset/type)
NER & NED
(per resource)
Interlinking & Co-
Resolution
(cross-dataset)
Dataset
Catalog/Index Links/
Cross-references
rdfs:label:„…ECB….“ ?
Dataset metadata (RDF/VoID): Schema mappings
(types, properties) Entities & categories Topic relevance scores Availability, currentness
data (tbc)
dbpedia:Finance
dbpedia:Sports
dbpedia:England-Wales-Cricket-Board
dbpedia:European_Central_Bank
19/09/2013 14 Stefan Dietze
Dataset profiling: topic extraction process (1/2)
Category Mapping,
Normalisation,
Filtering
Step 1 – NER:
Online NER & NED vs. incremental similarity-based „NER“:
Online NER: DBpedia Spotlight
Incremental & similarity-based NER: compare [via Jaccard Index] textual desc of already extracted entities with literal values of a resource instance (assumption: recurring entities likely within datasets)
Endpoint Retrieval
& Graph
Extraction
Schema
Extraction and
Mapping
Sample Graph
Extraction
(per dataset/type)
NER & NED
(per resource)
Interlinking & Co-
Resolution
(cross-dataset)
Dataset
Catalog/Index Links/
Cross-references
rdfs:label:„…ECB….“ ?
Dataset metadata (RDF/VoID): Schema mappings
(types, properties) Entities & categories Topic relevance scores Availability, currentness
data (tbc)
dbpedia:Finance
dbpedia:Sports
dbpedia:England-Wales-Cricket-Board
dbpedia:European_Central_Bank
19/09/2013 15 Stefan Dietze
Dataset profiling: topic extraction process (1/2)
Category Mapping,
Normalisation,
Filtering
Step 1 – NER:
Online NER & NED vs. incremental similarity-based „NER“:
Online NER: DBpedia Spotlight
Incremental & similarity-based NER: compare [via Jaccard Index] textual desc of already extracted entities with literal values of a resource instance (assumption: recurring entities likely within datasets)
Step 2 – Computation of profile (ranked categories)
Entities => DBpedia categories = “Topics”: extraction of topics from DBpedia entities via dcterms:subject
Expand the set of topics by leveraging hierarchical category organization (skos:broader)
Normalised topic score:
topics datasets
# entities
for dataset D # entities
for all datasets
# of entities for t
in dataset D
# of entities for t
for all datasets
http://data.linkededucation.org/linkedup/categories-explorer
http://data.linkededucation.org/
Dataset profile explorer http://data.linkededucation.org/request/pipeline/sparql
LinkedUp Data Catalog – hands-on in a nutshell
http://data.linkededucation.org
http://data.linkededucation.org/linkedup/catalog/sparql
http://data.linkededucation.org/request/pipeline/sparql
Querying FOR datasets
• Retrieving datasets for categories SELECT ?datasetname ?link ?score WHERE
{ ?linkset a void:Linkset.
?linkset vol:hasLink ?link.
?link vol:linksResource <http://dbpedia.org/resource/Category:Technology>.
?link vol:hasScore ?score.
?dataset a void:Dataset.
?linkset void:target ?dataset.
?dataset dcterms:title ?datasetname.
FILTER (?score > 0.5) }
• Retrieve datasets describing schools: select distinct ?endpoint ?cl where
{ ?ds void:sparqlEndpoint ?endpoint. {{?ds void:classPartition [ void:class ?cl]} UNION {?ds void:subset [ void:classPartition [ void:class ?cl] ]}} {{?cl owl:equivalentClass aiiso:School} UNION {?cl rdfs:subClassOf aiiso:School} UNION {FILTER ( str(?cl) = str(aiiso:School) ) }} }
Querying THE datasets
• Federated queries using mappings beetwen aaiso:school and other „school“ types prefix void: <http://rdfs.org/ns/void#> prefix aiiso: <http://purl.org/vocab/aiiso/schema#> prefix owl:
<http://www.w3.org/2002/07/owl#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?endpoint ?school ?cl where { … as above …. }
service silent ?endpoint { ?school a ?cl } }
19/09/2013 17 Stefan Dietze
type mappings!
topic profiles/scores!
query federation!
Outlookin a nutshell
Merging the two VoID datasets
Datasets and type mappings (LinkedUp Catalog)
Category annotations (data.linkededucation.org)
Extracting statistical observations (RDF Data Cube)
Feeding data back into the DataHub
Application to entire LOD cloud group on DataHub
Consideration of additional profiling features
Quality aspects
Dataset and link dynamics
Temporal and spatial coverage (=> http://www.duraark.eu)
fake example
19/09/2013 18 Stefan Dietze
LinkedUp Vidi Competition
19/09/13 19
Tools and demos that analyse or integrate open web data for educational purposes
• Wanted: applications tools that address real educational needs
• Anyone can participate - researchers, students, developers, industry
• Challenging focused tracks with clear goals
• More data, more challenging, more support, more prizes
More info: http://linkedup-challenge.org/
Launch at 4 November 2013
Submission deadline is 14 February 2014
20,000 Euro prize money
Thank you!
Contact http://purl.org/dietze | @stefandietze
See also (data)
http://datahub.io/group/linked-education
http://data.linkededucation.org
http://data.linkededucation.org/linkedup/catalog/
http://lak.linkededucation.org
See also (general)
http://linkedup-project.eu
http://linkedup-challenge.org
http://linkededucation.org
http://linkeduniversities.org
19/09/2013 20 Stefan Dietze