Using Wikidata properties to improve search in...

31
Using Wikidata properties to improve search in Dutch historical newspapers Theo van Veen, SEA, 18-11-2016

Transcript of Using Wikidata properties to improve search in...

Page 1: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Using Wikidata properties to improve search in Dutch historical newspapers Theo van Veen, SEA, 18-11-2016

Page 2: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Content enrichment: purpose and approach

•  making content better findable and usable, especially newspapers

•  by enriching text or parts of text and names in the text with a.o. links to related information

•  this related information is in most cases linked data (Wikipedia, Polygoon news reels)

•  linked data is used to improve usability of content by adding related information to the presentation

•  linked data is used as a means to improve disclosure of content by adding related information to the search index

•  But … we want to hide the user from SPARQL

TheovanVeen,SEA,18-11-2016

Page 3: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

How will access and usability be improved?

1.  Because “things” are identified we can make a better distinction between things (thesaurus function)

2.  Because the identifiers are links to resource descriptions it is possible to present the content with context information about “things” in the content

3.  Relevant context information can be indexed as part of a “thing” so it can be used for searching

4.  By enriching the content with the identification of “things” semantic search is enabled using properties in external descriptions

1.  Iden7fica7on2.  Context3.  Indexing4.  Seman7csearch

TheovanVeen,SEA,18-11-2016

Page 4: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Additional motivation

TheovanVeen,SEA,18-11-2016

•  Libraries are more and more part of the outside world. Improving disclosure and usability requires intelligent connecting content with the outside world.

•  Content contains “knowledge” that cannot be easily be found by means of conventional search. This requires intelligent preprocessing.

•  This knowledge should not first have to be searched for but should be offered on request after alerting the user.

•  Our software should have read and analyzed our content integrally prior to the user !!

Page 5: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

How to identity names in text? •  By recognizing names (named entity recognition) •  Those names have to be identified. •  How? By searching them in Wikipedia/DBpedia and successively link them to the

Wikipedia/DBpedia descriptions •  But …..

•  Those names are ambiguous: does Einstein link to Albert Einstein or Alfred Einstein? •  So ….

•  We have to create software for improving the accuracy of links. Conventional “if then else” software isn’t fit for this job: we need machine learning techniques

•  But …. •  There remain still many false links and missing links and DBpedia does not contain

everything •  So …..

•  We need user feedback for correction, for adding links for unrecognized names and for additional training of the software

TheovanVeen,SEA,18-11-2016

Page 6: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Enrichment types

•  Newspaper articles and radio bulletins linked to Polygoon newsreels •  Named entities linked to DBpedia (en VIAF, Wikidata etc.) •  Place-street combinations in newspaper articles linked to latitude

and longitude •  Newspaper articles linked to images from Memory of the

Netherlands

LinkedNE’s Geodata Links Extractedfeatures

Userannota6on

Imageenrichment

DBpedia Street,place,laH.,long.

Webpages Classifica7on Tags Facerecogni7on

Wikidata Place,laH.,long.

Video Sen7ment Stories Emo7ondetec7on

VIAF Images Relevance

Geonames Sound Interes7ngness

Etc.

Nowavailable

TheovanVeen,SEA,18-11-2016

Page 7: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Steps in machine learning

1.  Polygoon newsreels matching articles on basis of features like named entity matching, string matching, date matching etc. using linear classification

2.  Linking named entities in news articles to DBpedia titles using linear classification using SVM

3.  Classification using a neural network

TheovanVeen,SEA,18-11-2016

Page 8: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Machine learning for matching newspaper articles and Polygoon news reels

TheovanVeen,SEA,18-11-2016

Page 9: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Titel presentatie Naam en/of datum

Matching newspaper articles by means of title, description and date of Polygoon videos

TheovanVeen,SEA,18-11-2016

Page 10: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Matching by means of different features

Match No match

TheovanVeen,SEA,18-11-2016

Page 11: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

3-D feature space

TheovanVeen,SEA,18-11-2016

Page 12: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Machine learning for entity linking

TheovanVeen,SEA,18-11-2016

Page 13: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Named Entity Linking

DBpediaSolrIndexDBpedia

Searchen7ty

NamedEn6ty

recogni6on

Listwith

Einsteins

Enrichmentdatabase

Enrichmentand

training

processar7cle

Geten77esStorear7cleid+resourceids

Findthebestcandidate

VIAF

Wikidata

Etc.

TheovanVeen,SEA,18-11-2016

Page 14: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Index and use of resource identifiers

Newspaperindex

Text+Viafid+Wikidataidetc.

Enrichmentdatabase

Indexing

Gettextforar7cleX

TheovanVeen,SEA,18-11-2016

Getenrichmentsforar7cleX

searchwikidataid’s

Wikidata

Seman7csearchprovidingwikidataid’s

search

Page 15: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Timeline for enriching the newspapers

Ar7clenumber

100

50

1 108mlj

4phases:•  AllDBpedia7tlessearchedinnewsar7cles•  NamedEn77essearchedinDBpedia•  SpeedupbyprocessingcapacitySURFsara•  Usingcontextandmachinelearning

Quality/c

onfid

ence(%)

0

TheovanVeen,SEA,18-11-2016

Page 16: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

accuracy linkrecall linkprecision linkF-measure

conven7onal .76 .76 .65 .70

svm .85 .76 .84 .80

svm(balanced) .83 .81 .76 .79

neuralnetwork .83 .75 .84 .79

Features,featuresandfeatures

? ? ? ?

crowdsouring ? ? ? ?

TheovanVeen,SEA,18-11-2016

From conventional entity linking to deeplearning and beyond

Page 17: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

How to present enrichments in Delpher, the main portal to books, newspapers and serials?

•  Links to Wikipedia? •  Adding images from Wikipedia to text? •  Show abstract from Wikipedia at mouse over? •  User may decide himself ?

TheovanVeen,SEA,18-11-2016

Page 18: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Naamen/ofdatum•  TheovanVeen,16-6-2016

Page 19: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

•  TheovanVeen,16-6-2016

Page 20: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

How to present enrichments in Delpher? •  Links to Wikipedia? •  Adding images from Wikipedia to text? •  Show abstract from Wikipedia at mouse over? •  User may decide himself ? For the time being we use a research portal (xportal) to show enriched search and a browser extension to add enriched information to Delpher

TheovanVeen,SEA,18-11-2016

Page 21: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Naamen/ofdatum•  TheovanVeen,16-6-2016

Page 22: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Naamen/ofdatum•  TheovanVeen,16-6-2016

Page 23: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Naamen/ofdatum•  TheovanVeen,16-6-2016

Page 24: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016
Page 25: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

1.  Iden7fica7on2.  Context3.  Co-indexing4.  Seman7csearch

1.  Iden7fica7on2.  Context3.  Co-indexing4.  Seman7csearch

1.  Iden7fica7on2.  Context3.  Co-indexing4.  Seman7csearch

1.  Iden7fica7on2.  Context3.  Co-indexing4.  Seman7csearch

[memberofTheBeatles]

Page 26: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

HidingSPARQLforendusers:ThetermbetweensquarebracketsisexpandedinseveralwaysbyqueryingWikidataviaSPARQL.

Page 27: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

“Heel Holland verrijkt”, starting at KB !

To improve the automatically generated enrichments and add new enrichments we need user feedback. This feedback can also be used for additional training of our disambiguation software.

TheovanVeen,SEA,18-11-2016

Page 28: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016
Page 29: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Next steps

•  Improving accuracy by changing from linear classification to neural network

•  Crowd sourcing by KB employees before broadening the audience

•  Use of non-Wikidata identifiers when resource is not in Wikidata

•  Combining Solr and RDF and SPARQL for removing limitation on number of wikidata identifiers in Solr query

TheovanVeen,SEA,18-11-2016

Page 30: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

The higher goal

Our software should have read and analyzed our content completely !!

TheovanVeen,SEA,18-11-2016

Page 31: Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Any questions?