Knowledge graph

33
Let’s talk Linked Data session Open Belgium 2017 Brecht Van de Vyvere | @brechtvdv Building a knowledge graph of the Belgium War Press

Transcript of Knowledge graph

Page 1: Knowledge graph

Let’s talk Linked Data session Open Belgium 2017Brecht Van de Vyvere | @brechtvdv

Building a knowledge graph of the Belgium War Press

Page 2: Knowledge graph

Can I easily link historic papers with other datasources?

Page 3: Knowledge graph

Agenda

• hetarchief.be• Knowledge graph• 5-star Open Data plan• Adding context• Linked Data as a Service• Future Work

Page 4: Knowledge graph

Dataset

Page 5: Knowledge graph

hetarchief.be“News from the Great War”• Newspapers 1914 - 1918• 10+ Content Partners• Begin 2015: site launched• Functionality• Search by keyword• Map with place of publication• Collections

1k titles

55k newspapers

300k pages

Page 6: Knowledge graph

Human-readable interface

Page 7: Knowledge graph

Policy1.Metadata• No restrictions → CC0

2.OCR, documents• Pictures, short stories…• Uncertain copyright status• No license or “terms of use” that minimises restrictions

for re-use• Disclaimer

Page 8: Knowledge graph

hetarchief.be• One of the biggest databases online• No raw data?• Title• Description → OCR from ALTO• Date created• Owner• IDs (carrier, Abraham, VIAA)• URL image

Page 9: Knowledge graph

9

5-starsOpen Data Plan

Page 10: Knowledge graph

First 3 Stars• Open License• Structured• Non-proprietary

VIAA DB VIAA API NodeJS

→ github.com/viaacode/hetarchief2lod

IDs Metadata

CSV

Transform

Page 11: Knowledge graph

Step 4: URIs for everything• Map VIAAs internal ID to URI:• http://data.viaa.be/noid/{id}

• Use ontologies• BBC → Creative Work Ontology• schema.org• Hydra → collections

Page 12: Knowledge graph

Knowledge graph• Semantic network• Concepts• Relations

• Linked Data• URIs• RDF

Page 13: Knowledge graph

<http://dbpedia.org/page/Albert_I_of_Belgium>

rdfs:type

<http://xmlns.com/foaf/0.1/Person>

<http://data.viaa.be/noid/example>

<http://www.bbc.co.uk/ontologies/creativework#tag>

Page 14: Knowledge graph

5-star: link to other sources• ABRAHAM: catalogue of newspapers in Belgium

<http://anet.be/record/abraham/opacbnc/c:bnc:26>

<http://data.viaa.be/noid/tm71v5c76q_191510XX>

owl:sameAs

Page 15: Knowledge graph

L’illustration“1915-10-XX”

http://data.viaa.be/noid/tm71v5c76q_191510XX

cwork:titlecwork:dateCreated

On dit que c'est notre imagination

et….

cwork:content

cwork:CreativeWork

rdf:type

UGENT

schema:copyrightHolder

schema:inLanguage

en

Basic information triples

Page 16: Knowledge graph

http://data.viaa.be/noid/tm71v5c76q

http://data.viaa.be/noid/tm71v5c76q_191804xx_0003

http://data.viaa.be/noid/tm71v5c76q_191804xx_0002

http://data.viaa.be/noid/tm71v5c76q_191804xx_0001

first last

previous/nextfirst

memberOf

totalItemsHydra

last

3

first/last

Page 17: Knowledge graph

Problems• Node limited to 1.7 GB memory• OCR too big• Turtle file: 475 MB max (32k

newspapers)• Compressed to HDT: 388 MB• Basic triples with HDT:• 54k newspapers → 8.2 MB

Page 18: Knowledge graph

Adding context

Page 19: Knowledge graph

Connect with other datasources

• Cfr. Europeana, delpher.nl, lab.kbresearch.nl

Page 20: Knowledge graph

Stanford NER• 4 types: Location, Organisation, Person and

Other• Train your model: golden corpus• Write code that fits your needs

• SPARQL query that matches strings• REPERTOIRE des COMMUNES et des PRINCIPAUX

HAMEAUX de la ci-devant Belgique

• Difficult to find cultural APIs (cfr. InFlandersField list of names, Abraham catalogue)

Page 21: Knowledge graph

DBpedia Spotlight• Proof of concept• Models for all languages (nl, en, fr, de)

NL/FR/EN/DE trained model

DBpedia matcher

Stanford NER

Page 22: Knowledge graph

Results?

• Filter on OCR quality; e.g. <90% assurance in ALTO

• Wrong time period, e.g. geonames• Standard models, should be trained• Use DBpedia knowledge later to filter

“impossible” tags

Page 23: Knowledge graph

DBpedia Spotlight• Running your own endpoint is easy:• java -Xmx8G -jar dbpedia-

spotlight-0.7.1.jar nl http://localhost:2223/nl/rest

• Or with Docker:• docker build -f Dockerfile -t

dutch_spotlight .• docker run -i -p 2223:80 dutch_spotlight

spotlight.sh

Page 24: Knowledge graph

Linked Data as a Service• Allow federated queries• Low server cost• Be reliable• Triple Pattern Fragments: a Low-cost

Knowledge Graph Interface for the Web

Page 25: Knowledge graph

Linked Data Fragments querying• VIAA is part of the family!

http://data.viaa.be/ldfhttps://query.wikidata.org/

bigdata/ldf

http://data.linkeddatafragments.

org/linkedgeodata

http://data.linkeddatafragments.

org/dbpedia2014

Your browser

Client-side algorithm

GET fragments

Page 26: Knowledge graph

Demo time!

Page 27: Knowledge graph

Demo

• Retrieve all newspaper titles:

SELECT DISTINCT ?titleWHERE {?paper <http://www.bbc.co.uk/ontologies/creativework#title> ?title}

Page 28: Knowledge graph

Demo• Retrieve more info from corresponding

DBpedia URI:

SELECT ?label ?commentWHERE {<http://data.viaa.be/noid/2z12n51476_19141120_0001> <http://www.bbc.co.uk/ontologies/creativework#tag> ?tag .?db owl:sameAs ?tag .?db rdfs:label ?label .?db rdfs:comment ?comment}

Page 29: Knowledge graph

Battle of the Somme• Pages with military leaders from the Battle

of the Somme mentioned + thumbnail:

SELECT ?paper ?o ?thumbnailWHERE {<http://dbpedia.org/resource/Battle_of_the_Somme> <http://dbpedia.org/ontology/commander> ?o .?paper <http://www.bbc.co.uk/ontologies/creativework#tag> ?ctag .?o owl:sameAs ?ctag .?o <http://dbpedia.org/ontology/thumbnail> ?thumbnail .}

Page 30: Knowledge graph

Frontpainters• Semi-automatic generation of collections,

e.g. about frontpaintersSELECT ?newspaper ?artist ?tag ?hetarchiefWHERE {?artist dc:subject <http://dbpedia.org/resource/Category:Belgian_war_artists> .?artist owl:sameAs ?tag .?newspaper <http://www.bbc.co.uk/ontologies/creativework#tag> ?tag .?newspaper <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?hetarchief}

Page 31: Knowledge graph

Conclusion

• Extra search method for our researchers• NER versus OCR: enhanced findability• Adding extra information (cfr. Abraham)

requires effort, we need more TPFs interfaces

Page 32: Knowledge graph

Future work• Dereferencable URIs• http://data.viaa.be/noid/{id}

• Content negotiation• HTML• JSON• RDF

• Save location with OLR• Suggestions are welcome!

Page 33: Knowledge graph

Q&A

Brecht Van de Vyvere | @brechtvdv