British Library Labs - Overview Talk 2017

91
Context and collections, and the British Library Ben O’Steen, British Library Labs @benosteen

Transcript of British Library Labs - Overview Talk 2017

Page 1: British Library Labs - Overview Talk 2017

Context and collections, and the British LibraryBen O’Steen, British Library Labs

@benosteen

Page 2: British Library Labs - Overview Talk 2017
Page 3: British Library Labs - Overview Talk 2017

The British Library

Inside the British LibrarySpace for 1200 readers, around 400,000 visitors per year

Uses low oxygen and robotsReading room and delivery to London

Document Supply and Storage at Boston Spa

Stockton-on-TeesAuthor right to payment each time their books

are borrowed from public libraries.

St Pancras, London, UKMany books are stored 4 stories below the buildingLegal Deposit Library – Reference only

Page 4: British Library Labs - Overview Talk 2017

Living Knowledge Vision (2015 – 2023)

Custodianship Research Business

Culture Learning International

Document:http://goo.gl/h41wW7 Speech:https://goo.gl/Py9uHK

Roly Keating (Chief Executive Officer of the British Library)

To make our intellectual heritage accessible to everyone, for research, inspiration and enjoyment and be the most open,

creative and innovative institution of its kind by 2023.

Page 5: British Library Labs - Overview Talk 2017

Collections – not just books!> 180* million items

> 0.8* m serial titles

> 8* m stamps

> 14* m books

> 3* m sound recordings> 4* m maps

> 1.6* m musical scores

> 0.3* m manuscripts

> 60* m patents

King’s Library *Estimates

Page 6: British Library Labs - Overview Talk 2017

Wider…not just Researchers

Researchershttps://goo.gl/WutNyi

Artistshttp://goo.gl/nNKhQ2

LibrariansCurators

https://goo.gl/9NWZUW

Software Developershttps://goo.gl/7QQ5Tf

Archivistshttps://goo.gl/x7b4tg Educators

https://goo.gl/qh01Mi

Page 7: British Library Labs - Overview Talk 2017
Page 8: British Library Labs - Overview Talk 2017
Page 9: British Library Labs - Overview Talk 2017
Page 10: British Library Labs - Overview Talk 2017
Page 11: British Library Labs - Overview Talk 2017

Digital research methods

Visualisations

Application Programming Interfaces for datasets e.g. Metadata, Images Annotation

Location based searching & Geo-tagging CrowdsourcingHuman Computation

Page 12: British Library Labs - Overview Talk 2017

How did we do this?

Page 13: British Library Labs - Overview Talk 2017

Competitions

Awards

Projects

Tell us your ideas of what to do with our digital content

Show us what you have already done with our digital content in research, artistic, commercial and learning and

teaching categories

Talk to us about working on collaborative projects

Page 14: British Library Labs - Overview Talk 2017

Getting to the heart of it

British Library Labs works with researchers on their specific problems, trying to assess how widely this problem is felt.

With their help, we talk to communities of researchers and try to pinpoint what they need as opposed to what they think they need to ask us.

Page 15: British Library Labs - Overview Talk 2017

Researchers often ask for all the content we have.

What does that mean for digitised items in practice?

Page 16: British Library Labs - Overview Talk 2017

Taking a peek at our Open Data

A digitised book…

Page 17: British Library Labs - Overview Talk 2017

002819694

Page 18: British Library Labs - Overview Talk 2017
Page 19: British Library Labs - Overview Talk 2017
Page 20: British Library Labs - Overview Talk 2017

OCR XML Generated by ABBY Fine Reader

Page 21: British Library Labs - Overview Talk 2017

Could Labs provide other ways to understand this book?

Page 22: British Library Labs - Overview Talk 2017
Page 24: British Library Labs - Overview Talk 2017

Optically Character Recognised (OCR)generated Text

Scanned Page

Image on Flickr Commons

https://goo.gl/AC43vs

Page 26: British Library Labs - Overview Talk 2017
Page 27: British Library Labs - Overview Talk 2017

Tagging, Tagging, Tagging…

Page 28: British Library Labs - Overview Talk 2017

Iterative crowdsourcing?

(The term is borrowed from Mia Ridge.)

1. Crowdsource broad facts and subcollections of related items emerge.

2. No 'one-size-fits-all': Subcollections allow for more focussed curation.

GOTO 1

Page 30: British Library Labs - Overview Talk 2017
Page 31: British Library Labs - Overview Talk 2017

SherlockNet: Competition Winner 2016Karen Wang, Luda Zhao and Brian Do

Using Convolutional Neural Networks to Automatically Tag and Caption the British Library Flickr Commons 1 million Image Collection

12 categories

>20 million tags added >100,000 captions

bit.ly/sherlocknet

Pooled surrounding OCR text on page

from similar imagesUsed Microsoft COCO (photographs) &

British Museum Prints and Drawingscollections as training sets.

Tags Captions

Page 32: British Library Labs - Overview Talk 2017

Artistic / Creative Works

http://goo.gl/dM8ieA

Mario Klingeman (2015)

David Normal 2014 and 2015

Kris Hoffman (2016)

https://goo.gl/QilqqT

Jiayi Chong 2016 Ling Low 2016

https://www.youtube.com/watch?v=bcOP1E5bRE0

https://www.facebook.com/RealmlandStory/ Paul Rand Pierce 2016

A Hat on the Ground Spells trouble

Tragic Looking Women

44 Men who Look 44(Notice the direction faces)

Page 33: British Library Labs - Overview Talk 2017

Imaginary Cities – BL Labs Project 16-17Michael Takeo Magruder

https://goo.gl/4ARwTy

An artistic exploration seeking to create provocative fictional cityscapes for the Information Age from the British Library’s digital collection of historic urban maps

Page 34: British Library Labs - Overview Talk 2017
Page 35: British Library Labs - Overview Talk 2017

Mario Klingemann 2016

https://www.youtube.com/watch?v=xgnxnmqnR7YGoogle Arts and Culture Lab – Experiments with Machine Learning

https://artsexperiments.withgoogle.com/

Page 36: British Library Labs - Overview Talk 2017
Page 37: British Library Labs - Overview Talk 2017
Page 40: British Library Labs - Overview Talk 2017

MIT Moral Machine survey:http://moralmachine.mit.edu/

Page 41: British Library Labs - Overview Talk 2017

Presentation shapes perception

Page 42: British Library Labs - Overview Talk 2017
Page 43: British Library Labs - Overview Talk 2017
Page 44: British Library Labs - Overview Talk 2017
Page 45: British Library Labs - Overview Talk 2017
Page 47: British Library Labs - Overview Talk 2017

David Normalhttp://www.davidnormal.com/

Page 48: British Library Labs - Overview Talk 2017
Page 49: British Library Labs - Overview Talk 2017
Page 50: British Library Labs - Overview Talk 2017

Burning Man Festival

David Normal created light boxes around theBurning man, using the British Library’s Flickr Images

Page 51: British Library Labs - Overview Talk 2017

“Crossroads of Curiosity” (20th June -> November, 2015)

Page 52: British Library Labs - Overview Talk 2017
Page 53: British Library Labs - Overview Talk 2017

But how can anyone find anything useful?

Page 54: British Library Labs - Overview Talk 2017

John Cooper, https://www.flickr.com/photos/atomicshed/2436324958 CC-BY-NC-ND 2.0

Page 55: British Library Labs - Overview Talk 2017
Page 56: British Library Labs - Overview Talk 2017

Infancy of understandingLarge-scale analysis of text is evolving but young.

Exasperating situation where ‘black boxes’ of algorithms are used to draw conclusions.

http://www.scottbot.net/HIAL/?p=41271

Page 57: British Library Labs - Overview Talk 2017

“Black Boxes”:a misnomer

It is legitimate and useful to use code that you could not write.

It is not legitimate to simply believe the ‘label’ on the side of the box.

E.g. “Sentiment Analysis” is often nothing of the sort.

Page 58: British Library Labs - Overview Talk 2017

Quoting Scott Weingart: (emphasis mine)● Do sentiment analysis algorithms agree with one another enough to be considered

valid?

● Do sentiment analysis results agree with humans performing the same task enough to

be considered valid?

● Is Jockers’ instantiation of aggregate sentiment analysis validly measuring anything

besides random fluctuations?

● Is aggregate sentiment analysis, by human or machine, a valid method for revealing plot

arcs?

● If aggregate sentiment analysis finds common but distinct patterns and they don’t seem to map

onto plot arcs, can they still be valid measurements of anything at all?

● Can a subjective concept, whether measured by people or machines, actually be

considered invalid or valid?

(again from http://www.scottbot.net/HIAL/?p=41271)

Page 59: British Library Labs - Overview Talk 2017
Page 61: British Library Labs - Overview Talk 2017

* (2012) https://ariddell.org/where-are-the-novels.html

Page 62: British Library Labs - Overview Talk 2017

Digitisation

Often through Partnerships withCommercial & Other Organisations

Bias in digitisation

http://goo.gl/bR9UJL

Sample Generator

Page 63: British Library Labs - Overview Talk 2017

Open Licensed Digital Content?

15% Openly Licensed

Around 10%* available online

Working through

Breakdown by collection*Manuscripts 59%Books 9%Maps and Views 7%Newspapers 3%Archives and Records 3%Paintings, Prints and Drawings 2%

*Based on digitisation projects

Largest proportion of fundingPublic / Private Partnership

15%* Openly Licensed85%* Available onsite

*Estimates

Page 64: British Library Labs - Overview Talk 2017

Accessing digital collections onsite

OPEN £

•Have to be ‘onsite’

•Need to be security cleared for some collections– Hence ‘Researcher in Residence Model’

•Permission required (depending on ‘story’ of collection)

•Content on various media formats

•20 % re-use of material for non commercial research for some collections

•We are learning ‘pathways’ so that this becomes ‘everyday’ to provide onsite access in the future

Page 65: British Library Labs - Overview Talk 2017

Typical pattern of research for Labs

•Finding invisible things in ‘messy’ historical data

•Unearthing / unlocking hidden histories and data to stimulate new research

•Celebrating hidden histories / data creatively through events, art and performance

Page 66: British Library Labs - Overview Talk 2017

Finding things in messy OCR text

Mrs Folly• Clean up some manually• Get human ‘ground truth’• Write code to find things

reliably in it automatically• Try code on messy content• Tweak if necessary• Digital ‘lasso’ around content• Human sift through

Mrs Folly

Page 67: British Library Labs - Overview Talk 2017

Code: Machine Learning / Reading•Analogies to how humans read / learn

•Machines acquire ‘knowledge’ / data and use that knowledge / data to make sense / identify patterns

•Labs doing this on a case by case basis so methods can vary

•Need computational AND human effort

•Legalities of this process being ‘ironed’ out with publishers,

•Often a misunderstood area…

•Computers look for ‘patterns’ or the ‘essence’ of something

Page 68: British Library Labs - Overview Talk 2017
Page 69: British Library Labs - Overview Talk 2017
Page 70: British Library Labs - Overview Talk 2017
Page 72: British Library Labs - Overview Talk 2017

Katrina Navickas (2015) Political Meetings Mapper

http://politicalmeetingsmapper.co.uk https://goo.gl/Qq78Oa

Labs Symposium 2015

https://goo.gl/BSA3be

Interview 2015

The Chartist Newspaper

http://goo.gl/vOLSnH

Chartist Monster Meeting

Chartists Walking Tour and Re-enactment London

Page 73: British Library Labs - Overview Talk 2017

Working with NewspaperCollections

Using Jupyter Notebooks

Page 74: British Library Labs - Overview Talk 2017

Virtual Infrastructure for OCR text

OCR text scraped from digitised newspapers

and in cloud

Jupyter notebookWrite python code and results

in browserhttp://jupyter.org

Access available for researchers ‘in residence’

Page 75: British Library Labs - Overview Talk 2017

Black AbolitionistsIn the UK

Researcher: Hannah Rose Murray

Page 76: British Library Labs - Overview Talk 2017

Black Abolitionist Performances & their Presence in Britain (2016) – Hannah-Rose Murray

Aberdeen Journal, 5 February 1851 “Fugitive Slaves”

Aberdeen Journal, 14 April 1847“Frederick Douglass, The Emancipated Slave”

FrederickDouglass

EllenCraft

JosiahHenson

Ida B Wells

A Performance by Joe Williams &

Martelle Edinborough

http://frederickdouglassinbritain.com/

Page 77: British Library Labs - Overview Talk 2017
Page 78: British Library Labs - Overview Talk 2017

Use of Overproof / OCR Correction?

Re-OCR with ABBY FineReader?

https://www.abbyy.com/en-gb/

http://overproof.projectcomputing.com/

Page 79: British Library Labs - Overview Talk 2017
Page 80: British Library Labs - Overview Talk 2017

Surveyed a set portion of the collection for words we were interested in, and those 1 and 2 ‘distant’ from these (Levenshtein distance).

Page 81: British Library Labs - Overview Talk 2017
Page 82: British Library Labs - Overview Talk 2017

Naive-Bayes Classifier:

Page 83: British Library Labs - Overview Talk 2017

Classifiers allowed us to prioritise on relevant articles without us reading them:

Page 84: British Library Labs - Overview Talk 2017

Data-mining verse in 18th Century newspapersBL Labs Project 16-17, Jennifer Batt

https://goo.gl/5Akthd

Slides courtesy Jennifer BattJennifer Batt @ the BL on World Poetry Day

Page 85: British Library Labs - Overview Talk 2017

What thoj' among ourrelves, with too much Heat, or t W: fweutimes.wongle, wvhen we Ihould debate, W – (A confequential Ill which Freedom drawvs, fl t A bad Efficf, but from a noble Caufe) t We can with univeifal Zcal advance, to To cutb the faithlefs Arrogancccof V rance. hi

Dublin Journal 10-14 September, 1745

Slides courtesy Jennifer Batt

Page 86: British Library Labs - Overview Talk 2017

Verse: 81% lines begin with initial capital

Prose: 52% lines begin with initial capital

Westminster Journal 3 March 1745

Slides courtesy Jennifer Batt

Page 87: British Library Labs - Overview Talk 2017
Page 88: British Library Labs - Overview Talk 2017

http://varianceexplained.org/r/kmeans-free-lunch/

Page 89: British Library Labs - Overview Talk 2017
Page 90: British Library Labs - Overview Talk 2017

In Summary:

- Context about how an digitised image came to be and why it was scanned is both crucial to understand and sometimes crucial to hide.- aka Opening up large collections brings its own issues.

- Presentation shapes perception.- Too much trust in black boxes algorithms, like search

engines or social feed suggestions.- So little of our history is online that there is a natural bias.

The gaps are being filled in with less credible sources.- It still might have happened even if you cannot google

it, and vice versa!