Using optical character recognition (OCR) output in digitization:

22
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Using optical character recognition (OCR) output in digitization: SPNHC, June 26, 2014 Symposium: Progress in Natural History Collections Digitisation Canolfan Mileniwm Cymru \ Wales Millennium Centre, Cardiff Bay Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, Elpseth Haston, find Deb on Twitter @idbdeb @iDigBio See your data before it's in the database and after #spnhc20 14 #digitiz ation #collect ions

description

# spnhc2014 #digitization #collections. Using optical character recognition (OCR) output in digitization:. See your data before it's in the database and after. SPNHC, June 26, 2014 Symposium: Progress in Natural History Collections Digitisation - PowerPoint PPT Presentation

Transcript of Using optical character recognition (OCR) output in digitization:

Page 1: Using optical character recognition (OCR) output in digitization:

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Using optical character recognition (OCR) output in digitization:

SPNHC, June 26, 2014 Symposium: Progress in Natural History Collections Digitisation

Canolfan Mileniwm Cymru \ Wales Millennium Centre, Cardiff BayDeborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, Elpseth Haston,

find Deb on Twitter @idbdeb @iDigBio

See your data before it's in the database and after

#spnhc2014#digitization#collections

Page 2: Using optical character recognition (OCR) output in digitization:

2

What is iDigBio?NIBA - NSF - ADBC - iDigBio - TCN - PEN

facilitate use of biodiversity dataenable digitisationportal accesssustainability – community collaboration

Page 3: Using optical character recognition (OCR) output in digitization:

3

Minimal Data Capture “filed as” namehigher geographybarcode image

all sheets in folder get the same initial data

only the barcode differs

Biological collection data capture: a rapid approach using curatorial data

Trend

filed as name

Page 4: Using optical character recognition (OCR) output in digitization:

4

Would you like to…?enter records faster?use the ditto feature often?find duplicates quickly?find the labelsfind the labels with lots of handwriting?create your own record sets to transcribe?

by collectorby country or countyby your Great Aunt Penelopeby taxonby language

create cogent sets to speed up validation and database updates?make transcribers / validators jobs easier (paid and volunteer)?

Page 5: Using optical character recognition (OCR) output in digitization:

5

Got Text?

Got Handwriting?

Page 6: Using optical character recognition (OCR) output in digitization:

6

Next imagine output from 1000s of labels or notebooks or text files!

No. ....2L31.National Herbarium of CanadaFLORA OF’T TERRITORIES.Hab. and Loc., Arctic Coast west of Mackenzie River delta:Between King Pt. and Kay Pt., 69° 12’ N., and 138° to138° 30’ W.. .Collector, A. E. Porsild July 23-25, 1934

OCR

Label

Page 8: Using optical character recognition (OCR) output in digitization:

9

OCR text

Page 9: Using optical character recognition (OCR) output in digitization:

Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013.

Seeing the dark data…

Page 10: Using optical character recognition (OCR) output in digitization:

11

It’s surprising what can be used to help filter specimens – the black art of search terms!

Page 12: Using optical character recognition (OCR) output in digitization:

13

Inside the 1899 Harriman Expedition

Page 13: Using optical character recognition (OCR) output in digitization:

14

Page 14: Using optical character recognition (OCR) output in digitization:

Overall Word Cloud Workflow

OCROutputOCR

Output

OCREngineOCR

EngineOCREngine

Crowdsourcing

(BVP)

Index (Solr)

OCR confidence

(n-gram)

Images

OCROutput

DwCParsedOutput

WordCloud

Cluster(carrot2)

Histogram(Google Charts, Facet Explorer)

Web Service

(Jason Davies)

Google Charts: http://developers.google.com/chart/interactive/docs/galleryN-gram: http://github.com/idigbio-citsci-hackathon/OCR-Error-EstimationFacet explorer: http://github.com/idigbio-citsci-hackathon/facet-explorer

Jason Davies WC: http://www.jasondavies.com/wordcloud/Apache Solr: http://lucene.apache.org/solr/

carrot2: http://project.carrot2.org/

Some work from the recent iDigBio CITSCribe Hackathon

Page 15: Using optical character recognition (OCR) output in digitization:

16

Word Clouds usingN-gram Scoring,Faceting,Solr + Carrot2

Page 16: Using optical character recognition (OCR) output in digitization:

17

Use for initial sort or validation

Imagine Integration with current software

Page 17: Using optical character recognition (OCR) output in digitization:

18

Page 18: Using optical character recognition (OCR) output in digitization:

19

Working Group Collaboration - WorkflowsSetting up

OCRRunning

OCR

Machine Learning

Natural Language Processing

Page 19: Using optical character recognition (OCR) output in digitization:

20

Sample Workflows with OCR integratedNew workflow sample OCR protocols

Got one?Got a resource for these?Got new ideas for how to use the text data to improve

the data?Let’s share!

Page 20: Using optical character recognition (OCR) output in digitization:

21

Managing your crowdsourcing data behind the scenesOCR too!

Page 21: Using optical character recognition (OCR) output in digitization:

22

OCR use, a bit more…aOCR WG, JRA Synthesys3, …user-interface interest groupexemplar ML and NLP workflowscombining with Voice recognition software (Macroalgal TCN)

Got Text?Got Handwriting?

Page 22: Using optical character recognition (OCR) output in digitization:

23

Diolch yn fawr!

Andrea Matsunaga, Researcher, iDigBio Miao Chen, Indiana University, Data to Insight Center Jason Best, Botanical Research Institute of Texas Sylvia Orli, IT Head, Smithsonian Botany Department William Ulate, Technical Director, BHL Reed Beaman, Informatics Specialist, iDigBio Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE) iDigBio Augmenting Optical Character Recognition WG

Work presented here

made possible by many

and especially…

MaCC TCN

SALIX