Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and...
-
Upload
chris-freeland -
Category
Technology
-
view
2.008 -
download
1
description
Transcript of Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and...
Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
Chris FreelandTechnical Director,
Biodiversity Heritage Library
BioSystematics Berlin 201122 Feb 2011
http://biodiversitylibrary.org/page/33061402
Digitization
http://biodiversitylibrary.org/page/6165462
Workflow
Selection Preparation
Post Production(Re)publication
Digitization
Conservation
Scanning Derivatives
• XML• JP2
• PDF• JPG• TXT• DJVu
Master Derivatives
OCR
XML
JP2
Storage
Files are stored & sync’d across
BHL clusters
Optical Character Recognition (OCR)
http://biodiversitylibrary.org/page/2836705
OCR is a *BIG* challenge
• All book / literature digitization projects affected, not just BHL
• Especially problematic in BHL– More than 50 languages represented in BHL– Dates of publication from 1400’s to 2000’s– Irregular typeface / typesetting– Multiple languages on one page
• Botanical descriptions in Latin
Abbild ungen und Beschreibungen der
Fische Syriens, nebst
einer neuen Classification und Characteristik sämmtlicher Gattungen
der i
JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in
Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.
STUTTGART. E. Schweizerbart' sehe Verlagshandlung,
1843.
*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �
', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �
r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
2007 Name Finding Study
>35% OCR error rate for names only
1 Insert Space 8 n->v
2 Omit Space 9 l->i
3 e->c 10 r->i
4 u->I 11 u->ii
5 u->n 12 h->l
6 i->l 13 h->ii
7 c->e 14 e->o
Top OCR errors
35.16%
Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.
Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008.http://www.tdwg.org/proceedings/article/view/380
• WikiSource• Trove - National Library of Australia
Manual techniques for text correction
WikiSource Example
http://biostor.org/wiki/Page:Spixiana1999zool.djvu/293
Goal: Semi-automated text correction
• OCR + Machine Learning + Users– Let machines do raw processing– Develop algorithms for natural language
processing & machine learning– Build a community of (human) users to help
• reCAPTCHA as an example– Why not just use reCAPTCHA?
• Google bought it
*More work needed here*
Scientific names mapping
http://biodiversitylibrary.org/page/27782237
Image from ScannerConverted to text via OCRName finding via TaxonFinder Extract namesSubmit to NameBankTaxonFinder API response
Name Finding in action
with uBio’s TaxonFinder…
Crowdsourcing
http://biodiversitylibrary.org/page/20965795
CiteBank: http://citebank.org
• New search index to BHL content
• Platform for journals/publishers/societies in need of tools to store & share their digitized content
• Access to “crowdsourced” articles from BHL scans
Crowdsourcing Statistics & Analysis
• Analysis– http://biodiversitylibrary.blogspot.com/2009/04/p
df-article-metadata-analysis.html– At that time, more than 80% of the PDFs created
had metadata attached by users• More than 50% contributed accurate article-level
information
• New analysis over more data this summer / fall– Now have more than 58,000 PDFs to analyze
Open Data = More Use
• Scholars– Rod Page
• iPhylo• BioGUID• BioStor
– Ryan Schenk• Other Apps
– EarthCape– ZipecodeZoo
Conclusion
• BHL is a massive dataset useful for multidisciplinary research– Systematics– Natural Language Processing– Humanities
• BHL is open– Free to use at http://biodiversitylibrary.org– Open access data for scholarly use & reuse
• BHL has APIs and data exports to enable reuse– BHL data can be incorporated into other virtual research
environments (EOL, Scratchpads, BioStor, others)
Questions?
Chris FreelandTechnical Director, Biodiversity Heritage LibraryDirector, Center for Biodiversity Informatics, Missouri Botanical Garden Missouri Botanical Garden4344 Shaw Blvd.St. Louis, MO 63110 USA
BioSystematics Berlin 201122 Feb 2011
Email: [email protected]
Twitter: @chrisfreeland
Blog / info: chrisfreeland.com