OCR and SALIX Parsing
description
Transcript of OCR and SALIX Parsing
OCR andSALIX Parsing
Daryl LaffertyArizona State University
October, 2012
SALIX:Semi-Automatic Label Information eXtraction
SALIX was developed at Arizona State University from 2009 through 2012.
Over 55,000 ASU Herbarium specimen labels were digitized using SALIX
Ideal SALIX Process Flow
The ideal process flow is: Photograph the specimen label
Perform OCR on the photograph
Have SALIX parse the resulting text into database categories
Upload the results to the database
Practical SALIX Process Flow
The actual process flow has added steps: Photograph the specimen label
Perform OCR on the photograph
Correct any OCR errors. Tweak the text layout
Have SALIX parse the resulting text into database categories
Correct any mis-parsed results
Upload the results to the database
OCR Workflow We use a ABBYY Professional Version 10 We capture an image of the full specimen,
and another of just the label for OCR. Processing is done in batch mode, usually
run over night on a folder containing hundreds of images.
The result is a single text file with one label per page.
OCR errors are corrected in the text file before processing with SALIX
The SALIX User Interface
Manual Data Entry
A label that results in many OCR errors
A label that results in few OCR errors
Label Length and Quality We first categorized 4 different label types, with the
following average characteristics:
We then had 3 students each process 10 labels of each category (40 labels total through SALIX and
typed into Symbiota form.
Sample Throughput Data
Conclusions
S=4
√E
OCR quality has a strong effect on semi-automated parsing throughput using SALIX.
OCR using ABBYY in Batch Mode was most efficient for our workflow.
The relationship is roughly:
where
S = Ratio of SALIX Throughput/Typing Throughput
andE = OCR Error rate stated as OCR Errors per 100
words
(Obviously, the relationship isn't accurate as E approaches zero, i.e. less than about 2 Errors/100 words)
Acknowledgements
All of the data presented here was from Anne Barber's Master's Thesis, completed at ASU in May, 2012.
Anne also developed the process flow that helped optimize SALIX throughput.
The overall project was under the direction of Les Landrum, curator of the ASU Herbarium.