Enabling complex analysis of large scale digital collections

22
Research data spring Enabling Complex Analysis of Large Scale Digital Collections 14/7/2015 Lots of money has been spent digitising heritage collections. Digitised heritage collections are data. But non-computationally trained scholars don't know what to ask of large quantities of data. Often they do not have access to high performance computing facilities and they don’t know how to use them. We have addressed this fundamental problem by extending research data management processes in order to enable novel research in the arts, humanities, and social and historical sciences and a deeper understanding of emerging research needs. In our first phase, we have successfully implemented large scale, complex search of a digitised collection: now we scale up…

Transcript of Enabling complex analysis of large scale digital collections

Page 1: Enabling complex analysis of large scale digital collections

Research data spring

Enabling Complex Analysis of Large Scale Digital Collections

14/7/2015

Lots of money has been spent digitising heritage collections. Digitised heritage collections are data. But non-computationally trained scholars don't know what to ask of large quantities of data. Often they do not have access to high performance computing facilities and they don’t know how to use them.

We have addressed this fundamental problem by extending research data management processes in order to enable novel research in the arts, humanities, and social and historical sciences and a deeper understanding of emerging research needs. In our first phase, we have successfully implemented large scale, complex search of a digitised collection: now we scale up…

Page 2: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 2

More & more digitised content is in the public domain

14/07/15

Page 3: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 3

UK eScience infrastucture not used in A+H or SHS

14/07/15

Page 4: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 4

Phase 1: take 64,000 British Library digitised books

14/07/15

Page 5: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 5

See how we can analyse them using UCL’s HPC

14/07/15

Page 6: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 6

Moving beyond restrictive basic searches

14/07/15

Page 7: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 7

team

14/07/15

James HetheringtonResearch Software Engineer

Page 8: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 8

Work with researchers 1: detect trends

14/07/15

Anne WelshLecturer in Library and Information Studies, UCL

Interested in growth of professions in the Victorian era.Needs to be able to do AND, OR, NOT, AND NOT Boolean queries: beyond capabilities of currentLarge scale digitisation search functions.

Page 9: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 9

Work with researchers 2: compare data sources

14/07/15

Oliver Duke-WilliamsLecturer in DigitalInformation Studies, UCL

Interested in history of demographics and health data. Can we track the prevalence of diseases in the corpus, and do they relate to known epidemics, using existing data?

Page 10: Enabling complex analysis of large scale digital collections

1853-54c. 11,000 UK deaths

('John Snow / Broad Street pump' epidemic)Deaths in England 1838 1839Measles 6,514 10,937Whooping cough 9,107 8,165

Consumption 59,025 59,559

First outbreak in UK 1831-2c. 55,000 deaths

Cholera 1848-4953,293 deaths (England)

1863 – East Londonc.6,000 deaths

Page 11: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 11

Work with researchers 3: visualise content

14/07/15

Will FinleyPhD Student, History, University of Sheffield

Interested in History of Printed Book Illustration 1750-1850. How can we analyse and visualise how the size and placing of illustrations in the corpus changes over time?

Page 12: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 12

All outputs documented on github

»https://github.com/UCL-dataspring

14/07/15

Page 13: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 13

Including all code, recipes, & visualisations

14/07/15

Page 14: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 14

Explained in a series of blog posts

»http://britishlibrary.typepad.co.uk/digital-scholarship/2015/07/turning-research-questions-into-computational-queries.html http://bit.ly/dataspring

14/07/15

Page 15: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 15

Overview

»Not a Research Project

»Not an API

»Not replicating existing search facilities

»How can we provide access to data and compute?

»What are the technical issues in using escience infrastructure for cultural and heritage datasets?

»How can we train people in the A+H, and Libraries, to use this?

»How can we scale this up across the arts and humanities, and social and historical sciences?

14/07/15

Page 16: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 16

Scaling Up 1: more, different data

• 25,000 texts from the first phase of EEBO-TCP

• 1473 to 1700, 2m pages, 1b words, public domain

• Little overlap with BL data

• We have global search of the BL data working. Adding EEBO-TCP will allow us to compare different ingest issues

• Inform data service providers about issues in using different textual data sets

14/07/15

Page 17: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 17

Scaling Up 2: More researchers, understanding needs

14/07/15

Page 18: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 18

Scaling Up 3: making researchers into independent users

• Moving away from the “tame programmer in the room”

• Building a set of reusable recipes

• Training A+H, SHS researchers and Librarians to be able to run queries themselves

• Core set of fundamental queries that can be tweaked be individual researchers to search for unique terms

• By end of Phase 2: Have researchers searching successfully without the help of programmers or data scientists

• In prep for Phase 3: where we train others from the UK in the set up and query of textual data using existing HPC facilities.

14/07/15

Page 19: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 19

Plan & Outputs

» Month 1: identify researchers. Ingest EEBO-TCP. Stress test existing queries, develop search templates

» Month 2: Training with core set of researchers to adopt and implement queries. Documentation and developing of training.

» Month 3: Independent Search workshops – software carpentry for A+H research computing

» Month 4: Reflection, write up, preparation of public facing materials that tell others how to do this.

» Fully documented on Github Repo› https://github.com/UCL-dataspring› Cluster code› Raw results› Visualisations› User guides

» Publicly presented (will also set up dedicated blog, social media channels, etc in Phase 2)

» Submission of academic paper re project to leading conference14/07/15

Page 20: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 20

Funding

»Pitching for the whole £40,000

»We need adequate funding to pay for research programmer to:› set up the infrastructure for training› Prepare training materials› Ingest new data set

»Also, other staff time, data preservation costs, travel between sites

»Full support from UCL in FEC14/07/15

Page 21: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 21

Phase 2: Make digitised books truly searchable

14/07/15

Page 22: Enabling complex analysis of large scale digital collections

Enabling Complex Analysis of Large Scale Digital Collections 22

Not for the pitch, but please fill in

»Contact person: still Melissa Terras

»Social media presence -@melissaterras and @j_w_baker

14/07/15