NHM Data Portal: first steps toward the Graph-of-Life

31
NHM Data Portal: first steps toward the Graph- of-Life Vince Smith, Ben Scott & Ed Baker Informatics & Digital Collections Group, NHM London SPNHC, Berlin, 23 June 2016

Transcript of NHM Data Portal: first steps toward the Graph-of-Life

Page 1: NHM Data Portal: first steps toward the Graph-of-Life

NHM Data Portal:first steps toward the Graph-of-Life

Vince Smith, Ben Scott & Ed BakerInformatics & Digital Collections Group, NHM London

SPNHC, Berlin, 23 June 2016

Page 2: NHM Data Portal: first steps toward the Graph-of-Life

NHM CollectionCollection area No of objects No of type

specimens Physical register

Digital data

Palaeontology 6,919,207 43,146 2,364,232 340,636 Mineralogy 423,563 615 425,000 402,727 Botany 5,863,000 172,750 127,200 645,222 Entomology 33,753,257 612,796 57,197 255,000 Zoology 27,501,350 325,000 1,986,000 1,160,216 Library & archives 5,460,000 - - - TOTAL 79,920,377 1,154,307 4,959,629 2,803,801

<3% of NHM specimens are digitised, & even fewer are ‘computable’

Page 3: NHM Data Portal: first steps toward the Graph-of-Life

Citizen science

Big, open, linked dataHigh-throughput digitisation

Data portal and tools Text mining

Robotics

Digital Science at the NHM

Page 4: NHM Data Portal: first steps toward the Graph-of-Life

Citizen science

Big, open, linked dataHigh-throughput digitisation

Data portal and tools Text mining

Robotics

Digital Science at the NHM

Page 5: NHM Data Portal: first steps toward the Graph-of-Life

NHM Digital Collections Access, pre-2015• Developed with the best of intentions, but…• 23 separate interfaces• Hard to find, cite, access and integrate• No maps, few images, slow, no statistics, no export,

few updates, no authors, no citation mechanisms, no GBIF connection

Page 6: NHM Data Portal: first steps toward the Graph-of-Life
Page 7: NHM Data Portal: first steps toward the Graph-of-Life

NHM Data Portal• Discovery of NHM collections & research data• Easy access & reuse to promote collaboration

(website, API, R-package, RDF & direct download)• 3.7m records, >1m images (+sound, video & 3D) • Integrates with our collection management

system (weekly) & DAM system (for images)• Traffic light data quality indicators• Stable, citable (DataCite) identifiers on datasets &

GUIDs on records to measure impact• Technically sustainable & scalable• Default open licensing (CC-Zero, CC-BY, CC-BY-NC)

http://data.nhm.ac.uk

Page 8: NHM Data Portal: first steps toward the Graph-of-Life

CKAN – the technical foundation for the portal• Enterprise, open source data portal platform• Developed by Open Knowledge Foundation• Used by 31 national governments, 74

regional authorities, academia & large commercial organisations

• Key featureso Publish & find datasetso Store & manage large datao Robust APIo Customise & extendo Sustainable

http://ckan.org/ e.g. http://data.gov.uk/

Page 9: NHM Data Portal: first steps toward the Graph-of-Life

Primary views of each NHM dataset

Point map Grid map Heat map

Statistical overviewFilterable table

Page 10: NHM Data Portal: first steps toward the Graph-of-Life

Dataset & data record citation• DataCite DOIs on every dataset• Stable URI (UUID) on every record• Prior identifiers aliased &

disambiguated• Citation encouraged with clear

statements at dataset & record level• Allows us to track cited usage• Dynamic DOI’s on subsets coming soon

Dataset DOI Specimen URI

Page 11: NHM Data Portal: first steps toward the Graph-of-Life

Traffic-light data quality indicators (via GBIF)

Via GBIF API

Major errors

Minor errors

No errors

nb. similar services offered by CRIA for Brazilian data

Page 12: NHM Data Portal: first steps toward the Graph-of-Life

Potential errors highlighted & “corrected”

Page 13: NHM Data Portal: first steps toward the Graph-of-Life

Assembly Video

doi: 10.3897/zookeys.481.8788

Step-by-step instructions

Supports deposition of other research datasets

Page 14: NHM Data Portal: first steps toward the Graph-of-Life

Easy addition of new datasets (rapid & semi-automated)

1. Name the dataset

2. Upload / link the data file

3. Describe the data file

4. Theme & tag

5. Add additional resources

6. Temporal coverage

7. Geographic coverage

8. Save & finish

Page 15: NHM Data Portal: first steps toward the Graph-of-Life

Data access & feedback

Extensive API

R integration

Link to data curator team

DwCA Downloads RDF (Linked Open Data)

Page 16: NHM Data Portal: first steps toward the Graph-of-Life

Serving external data aggregators

GBIF iDigBio EOL

Vertnet CRIA

Page 18: NHM Data Portal: first steps toward the Graph-of-Life

500,000,000(since Feb. 2015, excluding major aggregators)

Records downloaded

Page 19: NHM Data Portal: first steps toward the Graph-of-Life

Data access & feedback

Extensive API

R integration

Link to data curator team

DwCA Downloads RDF (& Linked Open Data)

Page 20: NHM Data Portal: first steps toward the Graph-of-Life

Tim Berners-Lee, the inventor of the Web and Linked Data initiator, suggested a 5-star deployment scheme for Open Data…

Availa

ble Structu

red Non-proprietary

URI’sLin

ked (L

OD)

What does a 5-star Data Portal mean?

Page 21: NHM Data Portal: first steps toward the Graph-of-Life

LOD gives us the means to connect our data (i.e. graph queries across distributed datasets)

Page 22: NHM Data Portal: first steps toward the Graph-of-Life

Top 200 collections holding institutions contributing specimen record to GBIF

Example 1: “what data are we publishing”

• What proportion of our collections are accessible / digitised?

• What biases exiting in our digitised collections?

• How much taxonomic redundancy exists in our collections?

Useful for policy setting:- Planning digitisation strategies

(why should we all be digitising the same taxa first)- Identifying institutional collections strengths

(outside our community these are often not known)- What is ‘unique’ in our collections

(taxonomically, geospatially, temporally)- Disaster planning

(how many institutions hold the same material)

Page 23: NHM Data Portal: first steps toward the Graph-of-Life
Page 24: NHM Data Portal: first steps toward the Graph-of-Life

What collections are held globally?Where are these specimens from?

There are huge gaps and biases in what & where about our collections & where these collections are from

Top 200 collections(scaled by size)

Specimen country origin(darker is more )

Page 25: NHM Data Portal: first steps toward the Graph-of-Life

Our results are very incomplete,constrained by what we’ve digitised

Size of collection

Proportion digitised

RBGE

RBGK

NHMMNHN

RMCA

RBINS

Very small proportions of our collections are digitally accessibleWe don’t publish the overall size of our collections in a machine readable way

Page 26: NHM Data Portal: first steps toward the Graph-of-Life

Example 2: exploring ecological interactions

• Specimen data is one dimension of our collections

• We need to know how organisms interactE.g. Predator-prey, pollinator-pollenated, host-parasite

• Museums have lots of this data

NHM Interactions data:• Louse-host (12,000+)• Helminth host-parasite (250,000+)• Also large datasets: Coleoptera feeding on

dipterocarp seeds, butterfly host-plants, British mammal-flea associations, bee flower pollinators, several parasitic wasp datasets, ….

Increasingly published as RDF via NHM Data Portal

Page 27: NHM Data Portal: first steps toward the Graph-of-Life

Global Biotic Interactions (GloBI) Database

• By Jorrit Poelen & colleagues• Collates interaction datasets• Currently >1.9M interactions• EOL pulls these into Species Pages• NHM Portal creates a combined

dataset to feed GloBI• Produces Linked Open Data

– Create beautiful visualisations

http://www.globalbioticinteractions.org/

Page 28: NHM Data Portal: first steps toward the Graph-of-Life

• Predatory interactions for Eurythenes gryllus

• Visualisations highlight number, frequency & type of interaction

GloBI’s Interaction Browser

https://blog.globalbioticinteractions.org/2014/03/21/exploring-antarctic-

interactions-using-globis-interaction-browser/

Page 29: NHM Data Portal: first steps toward the Graph-of-Life

Create beautiful visualisations with custom R scripts and existing libraries

(e.g., igraph, Reol, rgdal)https://blog.globalbioticinteractions.org/

2014/06/06/a-food-web-map-of-the-world/

Page 30: NHM Data Portal: first steps toward the Graph-of-Life

Conclusions

• Data portals like the NHM Portal allow us to contribute and reflect our data through the lens of specialist aggregators

• GBIF & GloBI are specialist aggregators serving LOD• LOD allows us to combine big datasets to address new questions

– Tracking interactions & distribution of disease vectors– Predicting crop pests, via the distribution and interactions of pests of crop wild relatives

Next Steps• Continue Portal development & encourage institutional adoption• Consolidate NHM ecological interaction datasets• Publish combined dataset on the NHM Data Portal• GloBI to harvest the dataset and publish linked open data• Develop visualisations for key NHM datasets

Page 31: NHM Data Portal: first steps toward the Graph-of-Life

Acknowledgements

Ben Scott – Portal Engineer & Architect

Ed Baker – Data Researcher

Laurence Livermore - Project Management

Matt Woodburn – Data Architect

Vince Smith – SRO / Coordinator