Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and...

37
Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing Chris Freeland Technical Director, Biodiversity Heritage Library BioSystematics Berlin 2011 22 Feb 2011 http://biodiversitylibrary.org/page/330

description

22 Feb 2011. BioSystematics Berlin 2011.

Transcript of Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and...

Page 1: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Chris FreelandTechnical Director,

Biodiversity Heritage Library

BioSystematics Berlin 201122 Feb 2011

http://biodiversitylibrary.org/page/33061402

Page 2: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Digitization

http://biodiversitylibrary.org/page/6165462

Page 3: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Workflow

Selection Preparation

Post Production(Re)publication

Digitization

Conservation

Page 4: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Scanning Derivatives

• XML• JP2

• PDF• JPG• TXT• DJVu

Master Derivatives

PDF

OCR

XML

JP2

Storage

Files are stored & sync’d across

BHL clusters

Page 5: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Optical Character Recognition (OCR)

http://biodiversitylibrary.org/page/2836705

Page 6: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

OCR is a *BIG* challenge

• All book / literature digitization projects affected, not just BHL

• Especially problematic in BHL– More than 50 languages represented in BHL– Dates of publication from 1400’s to 2000’s– Irregular typeface / typesetting– Multiple languages on one page

• Botanical descriptions in Latin

Page 7: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Abbild ungen und Beschreibungen der

Fische Syriens, nebst

einer neuen Classification und Characteristik sämmtlicher Gattungen

der i

JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in

Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.

STUTTGART. E. Schweizerbart' sehe Verlagshandlung,

1843.

Page 8: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �

', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �

r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

Page 9: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

2007 Name Finding Study

>35% OCR error rate for names only

1 Insert Space 8 n->v

2 Omit Space 9 l->i

3 e->c 10 r->i

4 u->I 11 u->ii

5 u->n 12 h->l

6 i->l 13 h->ii

7 c->e 14 e->o

Top OCR errors

35.16%

Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.

Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008.http://www.tdwg.org/proceedings/article/view/380

Page 10: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

• WikiSource• Trove - National Library of Australia

Manual techniques for text correction

Page 11: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

WikiSource Example

http://biostor.org/wiki/Page:Spixiana1999zool.djvu/293

Page 12: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Goal: Semi-automated text correction

• OCR + Machine Learning + Users– Let machines do raw processing– Develop algorithms for natural language

processing & machine learning– Build a community of (human) users to help

• reCAPTCHA as an example– Why not just use reCAPTCHA?

• Google bought it

*More work needed here*

Page 13: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Scientific names mapping

http://biodiversitylibrary.org/page/27782237

Page 14: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Image from ScannerConverted to text via OCRName finding via TaxonFinder Extract namesSubmit to NameBankTaxonFinder API response

Name Finding in action

with uBio’s TaxonFinder…

Page 15: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 16: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 17: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 18: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Crowdsourcing

http://biodiversitylibrary.org/page/20965795

Page 19: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 20: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 21: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 22: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 23: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 24: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 25: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 26: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 27: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 28: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

CiteBank: http://citebank.org

• New search index to BHL content

• Platform for journals/publishers/societies in need of tools to store & share their digitized content

• Access to “crowdsourced” articles from BHL scans

Page 29: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 30: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 31: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 32: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 33: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Page 34: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Crowdsourcing Statistics & Analysis

• Analysis– http://biodiversitylibrary.blogspot.com/2009/04/p

df-article-metadata-analysis.html– At that time, more than 80% of the PDFs created

had metadata attached by users• More than 50% contributed accurate article-level

information

• New analysis over more data this summer / fall– Now have more than 58,000 PDFs to analyze

Page 35: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Open Data = More Use

• Scholars– Rod Page

• iPhylo• BioGUID• BioStor

– Ryan Schenk• Other Apps

– EarthCape– ZipecodeZoo

Page 36: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Conclusion

• BHL is a massive dataset useful for multidisciplinary research– Systematics– Natural Language Processing– Humanities

• BHL is open– Free to use at http://biodiversitylibrary.org– Open access data for scholarly use & reuse

• BHL has APIs and data exports to enable reuse– BHL data can be incorporated into other virtual research

environments (EOL, Scratchpads, BioStor, others)

Page 37: Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

Questions?

Chris FreelandTechnical Director, Biodiversity Heritage LibraryDirector, Center for Biodiversity Informatics, Missouri Botanical Garden Missouri Botanical Garden4344 Shaw Blvd.St. Louis, MO 63110 USA

BioSystematics Berlin 201122 Feb 2011

Email: [email protected]

Twitter: @chrisfreeland

Blog / info: chrisfreeland.com