Persistent identifiers for museum specimens, NeIC workshop, August 2015
Museum impact: linking-up specimens with research published on them
-
Upload
ross-mounce -
Category
Data & Analytics
-
view
1.121 -
download
0
Transcript of Museum impact: linking-up specimens with research published on them
![Page 1: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/1.jpg)
Museum ImpactLinking-up our specimens with
research published on them
Dr Ross Mounce
@rmounce
![Page 2: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/2.jpg)
Talk Structure● Background: the collections, the research literature
● Interesting things you should know about access to research○ The costs of knowledge $$$
● Examples of content mining○ Including a video demo!
● My work (in progress) on finding NHM specimens in recent literature
![Page 3: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/3.jpg)
Source: http://www.nhm.ac.uk/our-science/collections.html © The Trustees of the Natural History Museum, London
![Page 4: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/4.jpg)
● New
● Open Data
● Easy-to-use
● Quick
● Images
● Audio
● Interactive Maps
● Citable
● API access
● Open Source Infrastructure
It’s not KE Emu :)
![Page 5: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/5.jpg)
What I want to do: link specimen records to their mentions in the literature
“Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following
species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)http://dx.doi.org/10.1371/journal.pone.0061998
![Page 6: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/6.jpg)
114,000,000
scholarly papers available online36,000,000 of which are
‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’
Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
![Page 7: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/7.jpg)
Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and no institution in the world has access to everything. Not even close to everything!
![Page 8: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/8.jpg)
Cheryl Hall (2014) FOI request https://www.whatdotheyknow.com/request/academic_journal_subscription_co
We rent access to knowledge. Companies profiteer from it
2004/05 £357,197.792005/06 £383,214.292006/07 £340,690.332007/08 £381,526.572008/09 £441,706.362009/10 £437,539.712010/11 £430,105.082011/12 £449,515.122012/13 £469,007.502013/14 £494,913.01
10-year-total: £4,185,415.76
Tax Year Revenue Profit Profit Margin2004 £1363m £460m 33.75%2005 £1436m £449m 31.25%2006 £1521m £465m 30.57%2007 £1507m £477m 31.65%2008 £1700m £568m 33.41%2009 £1985m £693m 34.91%2010 £2026m £724m 35.74%2011 £2058m £768m 37.30%2012 £2063m £780m 37.81%2013 £2126m £826m 38.85%
Source: RELX Group (Parent company of Elsevier) Company Reports
![Page 9: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/9.jpg)
Actually, the NHM’s annual bill isn’t bad compared to others
Source: Lawson S and Meghreblian B. (2015) Journal subscription expenditure of UK higher education institutions. F1000Research
http://shiny.retr0.me/journal_costs/
![Page 10: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/10.jpg)
Content Mining provides more bang for your buckMaking fuller use of our expensively provisioned access
● If the NHM is going to pay £500,000 per year to rent journals, why not use the access to this resource to its fullest?
● I can’t read everything with my human eyes but…computers can!
● If you can process one document with a computer,you can process a million: content mining
![Page 11: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/11.jpg)
Recent examples of Content Mining
Fig. 6 from the paper
Brachiopod body-size estimates
Red-line humans
Grey bars machines(PaleoDeepDive)
Better than PaleoDB ?I think so. PDD more clearly-linked to evidence than PDBProvenance matters.
![Page 12: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/12.jpg)
Recent examples of Content Mining (Images)
3-second image analysis
source: 10.1099/ijs.0.65212-0
(Zymobacter_palmae:261,((((Chromohalobacter_canadensis:42,(Chromohalobacter_sarecensis:96,Chromohalobacter_nigrandesensls:154):41):80,(Chromohalobacter_marismortui:125,Chromohalobacter_beijerinckii:103):164):61,(Chromohalobacter_israelensis:11,Chromohalobacter_salexigens:11):92):293,((Halomonas_halodurans:328,(Halomonas_ventosae:100,(Halomonas_pacifica:116,(Halomonas_halophila:223,(Halomonas_eurihalina:27,Halomonas_elongate:58):236):79):41):46):72,(Halomonas_desiderata:187,(Halomonas_pantelleriensis:173,Halomonas_muralis:190):70):30):110):187);
outputs re-usable Newick & NeXML
no manual input required
Can replot data,re-analyse,
combine many to make a supertree!
PLUTo Project Mounce, Murray-Rust, Wills (in prep.)
![Page 13: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/13.jpg)
How to get a sufficient volume of journal articles?
● The ContentMine (CM) team are actively developing new tools & training workshops to help researchers get into content mining: be it text, data, or image mining
● CM are a not-for-profit Shuttleworth-funded project led by Peter Murray-Rust
● All the software tools are open source and available on github: https://github.com/ContentMine/
● I’m a Scientific Advisor with the ContentMine
● Try getpapers OR quickscrape to get journal content en masse
https://github.com/ContentMine/getpapers
https://github.com/ContentMine/quickscrape
http://contentmine.org/
![Page 15: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/15.jpg)
● No problem. PMC to the rescue!
● PMC has a full text Open Access-only subset which you can download easily for free
● >1,100,000 full texts in XML (compressed) is just 16.6GB
Want to download more than a million (OA) papers?
Source: Neil Saunders (2014) https://rpubs.com/neilfws/45828http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
![Page 16: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/16.jpg)
Are there NHM specimens in the PMC OA subset?
PMC is medically-focused, so one wouldn’t expect it to be rich in organismal biology, however …some relevant content
ALL of PLOS ONE is in the PMC OA subset. Over 100,000 articles in that journal alone!
![Page 17: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/17.jpg)
https://github.com/rossmounce/NHM-specimens
Version-controlled data on githubopen for scrutiny & collaboration
![Page 18: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/18.jpg)
Searching ALL full texts is not enough!!!
A significant number of specimens are probably ‘hiding-out’ in
supplementary data files of all sorts of formats.
Google Scholar does not index SIWeb of Science doesn’t either
Nor does Scopus
At scale, journal-held supplementary data files are the
‘darkest corners’ of science“Specimens were deposited in the collections of the California Academy of Sciences' Department of Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)” 10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
![Page 19: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/19.jpg)
I don’t just find in-text mentions.
I’m trying to match them up to our NHM Data Portal records too!
Specimens in RED do not appear to be on the Data Portal ...yet
Blue globe represents a PLOS ONE paper
![Page 20: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/20.jpg)
Blue globe represents a PLOS ONE paper
Very few specimens occur in more than one paper
Can you guess what BMNH 37001 is?Hint: it’s famous! Grey represents an NHMUK specimen
![Page 21: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/21.jpg)
Mining over 200 subscription access / non-PMC journalsfrom 2000 <-> 2015 inclusive
Nature + Science + PNAS + Phytotaxa + ZootaxaBioOne Journals (131)Springer Journals (32)Wiley Journals (22)Taylor & Francis Journals (14)Elsevier Journals (12)Oxford University Press Journals (8)SciELO Journals (7) [Open Access but not in PMC] Ecological Society of America Journals (6)Geological Society Journals (4)CSIRO Journals (4)Cambridge University Press Journals (3)Royal Society Journals (2)
Journal-omics!
![Page 22: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/22.jpg)
Thanks to a recent change in UK copyright law:
text and data mining for non-commercial research purposes is legal (in the UK),
(provided that you have legitimate access to the resource you want to mine e.g. a paid-for institutional subscription)
http://blogs.lse.ac.uk/impactofsocialsciences/2014/06/04/the-right-to-read-is-the-right-to-mine-tdm/
![Page 23: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/23.jpg)
Image credit: Ubiquity Press
http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine
![Page 24: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/24.jpg)
So far… (very much still in progress)
![Page 25: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/25.jpg)
Almost nothing in Nature & Science ‘full (short) text’
Context: 15 years worth of full text research in Nature & Science examined.
Science: only 11 NHM specimens found in ~39,600 texts.
Nature: similar story. <30 specimens in 14,132 ‘full’ texts.
Clearly there are more, but it’s all buried in supplementary materials :(
![Page 26: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/26.jpg)
Shoving all the research details into non-searchable supplementary materials is bad for science
● For the avoidance of doubt, this is not a criticism of authors. This is squarely aimed at journals that artificially restrict the ‘length’ of research articles online.
e.g. Prufer, K. et al. 2014. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 2014, 505, 43-49.
7-pages (in paper), 12-pages (in PDF, with extra data tables & figures)
The supplementary data file? 249 pages!
![Page 27: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/27.jpg)
Someone needs to build a searchable index of supplementary data. ASAP
![Page 29: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/29.jpg)
“Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following
species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)http://dx.doi.org/10.1371/journal.pone.0061998
Huge potential to go beyond mere linking-up of identifiers.This specimen & others have been CT scanned in the PLOS ONE paper.
We could do data, media and knowledge ‘repatriation’ back to the museum/portal.
![Page 30: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/30.jpg)
Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013) The Evolution of Bat Vestibular Systems in the Face of Potential Antagonistic Selection Pressures for Flight and Echolocation. PLoS ONE 8(4): e61998. doi:10.1371/journal.pone.0061998
Openly-licensed data on specimens, published elsewhere, could be re-incorporated back into the online museum catalogue. A one-stop shop for information.
Beyond-linking:repatriation of knowledge
This is a CT-scan of “BMNH 76.3.15.14”.Without mining, I wouldn’t know this data exists.
Perhaps it could also be made available on the portal?
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
![Page 31: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/31.jpg)
Does published info make it back ‘home’ to the collections?
BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi”
I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9
It’s now called: Petrochromis horii n. sp. , according to the paper.
What mechanisms are there to update newer information back into the collection?
Content mining could definitely help keep collections data up-to-date!
![Page 32: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/32.jpg)
Acknowledgements
● Sincere thanks to:○ The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining○ Nancy Chillingsworth (IPR, NHM London)○ Mark Wilkinson (Life Sciences, NHM London)○ Peter Murray-Rust & the ContentMine team○ Vince Smith (Life Sciences, NHM London)○ Ben Scott (NHM Data Portal Lead Architect)○ Rod Page (University of Glasgow)○ All of the Biodiversity Informatics team
http://contentmine.org/
![Page 33: Museum impact: linking-up specimens with research published on them](https://reader031.fdocuments.in/reader031/viewer/2022021917/589eb2001a28ab38288b706f/html5/thumbnails/33.jpg)
Please ask me questions!Feedback appreciated :)