Post on 28-Jan-2016
description
Recognition and Retrieval from Document Image Collections
Million Meshesha(Roll No.: 200299004)
Centre for Visual Information Technology,
International Institute of Information Technology,
Hyderabad, India
Advisor: Dr. C. V. Jawahar
Introduction
• Emergence of large Digital Libraries like UDL, DLI, etc.– One million book archival
activities at Mega Scanning center – IIIT-H
• Involvement of Google, Yahoo, Microsoft in massive digitization project
The aim of digitization is for easier preservation and make documents freely accessible to the globe.
• Global effort to digitize and archive large collection of multimedia data – Most of them are printed books
Needs to design efficient means of access to the Needs to design efficient means of access to the content.content.
The Direct Approach• Recognition-based access to documents
– Easy to integrate into a standard IR framework– Success of text image retrieval mainly depends on the
performance of OCRs
Document Images
Preprocessing and
Segmentation
Feature Extraction
Search engine
Classification
Database
Optical Character Recognition
TextualQuery
Cross lingual
Retrieval Text Documents
Post-processing
Text Documents
Challenges• The state-of-the-art OCR engines recognize documents printed in Latin and some Oriental scripts – with few errors in each page for high quality images
• Unavailability of robust OCRs for indigenous scripts of African and Indian languages.
• Challenges in developing OCRs for scripts with complex shape and large number of characters.
• Lack of specialized recognizers for large document image collections.
• Diversity and quantity of documents archived in digital libraries.
Alternate Approach: Recognition-Free
Feature Extraction
Clustering and
Indexing
Document Images
Preprocessing and
Segmentation
RenderingTextualQuery Retrieval
Cross Lingual
Document Images
Database
Search engine
Comparison of the Two ApproachesRecognition-based Recognition-freeNeeds recognition before Retrieve without explicit
retrieval recognition e.g. Text search engines e.g. CBIR, CBVR
Less offline processing High offline processing (excluding recognition)
Fast and efficient algorithmsSlow & inefficient schemes
Compact representation Bulky representation
Content/language More of content/languagedependent independent
Challenging to build Relatively easy to build with(because of recognizers) certain level of acceptable
performance
Review of OCR Systems• Conventional OCRs follow sequential steps:
Thresholding
Normalization
Skew Detection/ Correction
Noise Removal Algorithms
Text/Image Block
identification
Geometric Layout Analysis
Line Segmentation
Word Segmentation
Component Analysis
Structural Features like Shape,
contour etc.
Transformation Domain
Features like DFT, DCT
Global and Local Features
Bayesian statistical classifier
SVM classifier
Neural Network
Lexical Information
Dictionary and Punctuation
Rules
Statistical Information
“Anatomy of a Versatile Page Reader“, H.Baird, Proc. of IEEE, Vol. 80, no.7, July,1992.
“Omnidocument Technologies”, IM. Bokser, Proc. of IEEE, Vol 80, no.7, July,1992
Preprocessing
Document Layout Analysis
Segmentation
Feature Extraction
Classification
Post Processing
Review of Recognition-Free• Manmatha et al:
– Proposed the word spotting idea for matching word images from handwritten historical manuscripts. – Used dynamic time warping (DTW) for word image matching.– Selected profile features for matching handwritten word images.
• Chaudhury et al.: – Exploited the structural characteristics of the Indian scripts to access them at word level. – Employed geometric features, and suffix trees for indexing.
• Trenkle and Vogt: – Experimented on word level image matching. – Extracted features at the baseline, concavities, line segments, junctions, dots and stroke directions and
computed a distance metric.
• Srihari et al.:– Spotting words from document images of Devanagari, Arabic and Latin.– Used Gradient, Structural and Concavity (GSC) features.– Implement correlation similarity measure for word spotting.
• AK Jain and Anoop M. Namboodiri: – Employed DTW based word-spotting for Indexing and retrieval of on-line documents.– Extract features such as the height of the sample point, direction and curvature of strokes.
T. Rath and R. Manmatha, "Word Image Matching Using Dynamic Time Warping", Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2, pp. 521--527, 2003.
Santanu Chaudhury, Geetika Sethi, Anand Vyas and Gaurav Harit, "Devising Interactive Access Techniques for Indian Language Document Images", Proc. of the Seventh International Conference on Document Analysis and Recognition (ICDAR), 2003, Pp. 885-889
J. M. Trenkle and R. C. Vogt, "Word Recognition for Information Retrieval in the Image Domain", Symposium on Document Analysis and Information Retrieval, pp. 105-122, 1993.
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts," Vivek: Indian Journal of Artificial Intelligence
A.K. Jain and Anoop M. Namboodiri, "Indexing and Retrieval of On-line Handwritten Documents", Proc. of the Seventh International Conference on Document Analysis and Recognition (ICDAR), 2003, pp. 655-659
Major Contributions1.Study indigenous African scripts for document understanding
• First attempt to introduce the challenges toward the recognition and retrieval of indigenous African scripts.
2.Design an OCR for recognizing Amharic printed documents • test on real-life document images (books, magazines and newspapers).
3.Propose an architecture of self adaptable book recognizer • demonstrate its application on document images of book.
4.Propose an efficient matching and feature extraction schemes• Performance analysis on datasets of word-form variants, degradations
and printing variations in word images.
5.Construct an indexing scheme by applying IR principles for efficient searching in document images.– experiment its efficiency on document images of book and newspapers.
African Scripts• Africa is the 2nd largest continent in the world, next to Asia.• There are around 2500 languages spoken in Africa, which
are either: – Installed by conquerors of the past and use a modification
of the Latin and Arabic scripts. – Indigenous languages with their own scripts.
E.g. Amharic (Ethiopia), Vai (West Africa), Bassa (Liberia), Mende (Sierra Leone), etc.
• Document image analysis and understanding research is very limited for indigenous African scripts. – Few attempts are available for Amharic scripts. – Other indigenous scripts are not yet studied
Characters are complex in shape
Their existence is not known by most researchers
Most are not used asofficial languages
Mende script Vai scriptBassa script
Amharic Language/Script• Large number of characters
– More than 300 characters
• Vowel formation
• Existence of visually similar characters
• Frequently occurring characters
• Amharic word morphology– Have rich word morphology
• Amharic (like Hindi) is verb-final language, modifiers usually precede the nouns they modify.
– the word order in English sentences: Subject-Verb-Object
– the word order in Amharic and Hindi is Subject-Object-Verb
Recognition from “A” Document Image
• Preprocessing– Binarization:
• Convert gray pixels into binary.
– Skew detection and correction:• Ensure that the page is aligned properly
– Noise removal• Remove artifacts in the image
• Segmentation
– Line segmentation • Identify lines in a text.
– Word segmentation• Identify words in a text line.
– Character segmentation• Detect each character from
segmented word.
• Feature extraction– Consider the entire component
image as a feature.
– PCA• Used for dimensionality
reduction.• Reduces to character/
connected components sub-space.
– LDA• Extracts optimal discriminant
vector and reduces to classification sub-space
• Classification– DDAG based architecture for
multi-class SVMs.
– Support Vector Machines (SVMs) at each node.
1,4
1,3
1,2 2,3
2,4
3,4
1 2 3 4
Amharic OCR is developed on top of an OCR for Indian Languages.
C. V. Jawahar, MNSSK Pavan Kumar, SS Ravi Kiran: A Bilingual OCR for Hindi-Telugu Documents and its Applications. ICDAR 2003: 408-412
andConsider characters
D. H. Foley and J. W. Sammon. An optimal set of discriminant vectors. IEEETrans. on Computing, 24:271-278, 1975.
Experimental ResultsDocument Accuracy (%)
LaserJet Printouts
Fonts (PowerGeez, Visual Geez, Agafari,
Alpas)
96.51
Sizes(10, 12, 14, 16)
98.49
Styles(Normal, Bold, Italic)
95.65
Real-life
Books 91.45
Newspapers 88.23
Magazines 90.37
BlobCutMerge
Comments• Present day OCRs do not improve
the performance over time. – Performance on the first and last
pages of the book are statistically identical.
• OCRs are designed to convert a single document image into a textual representation.
• Omni-font OCRs are rare even for English. – Performance degrades with quality,
unseen fonts, etc.
OCR for a collection (e.g. book) has to be different from OCR designed for an isolated single page.
Can we design a recognizer for document image collections; say, Book recognizer ?
Our Strategy• Enable OCR learn from its experience through feedback at
normal operation that comes from postprocessor.– The conventional open-loop system of classifier followed by post-
processor is closed.
• Learns from both correctly classified and misclassified examples.
• Extends knowledge gained from one page to other pages– Iterates and perfects on a page (a set of pages).
• Improves its performance over time to varying document image collections in fonts, sizes and styles, Quality
Apply machine learning procedures to build an intelligent OCR
ComparisonConventional OCRs• Designed for a single page• No feedback; top-down serial process• Failures are costly: any error at intermediate level results in
wrong output of system• Offline training• Performance declines or static
Our new approach, Book recognizer
• Designed for multiple pages• Feedback based flexible design• Any error at an intermediate level can be corrected by using
proper feedback.• More of online learning• Performance improves overtime
Self adaptable OCR Design
Recognizer
ModelBase
Document Images
RecognizedTexts
Post Processor
Validator
RefinedSamples
SampleDatabase
Sampler
Classifier
FilteredsamplesSamples
SelectedSamples Rejected
Samples
…
• Pass new samples for training
LabelerLabeled samples
•Label unlabelled data
Model
• Produces error-corrected words.
• Such words are candidate for feedback
• Detection of outliers• Validation in image
space
• Incremental learning
lnformation
lntormatlon
lnformation
lnformation
told iold
idol
i
dol
•Add samples to their proper class
Learning online
• Experiment on poor quality book
• Initial accuracy was less than 70% – a very low accuracy was obtained
• Within few iterations of learning, the recognition accuracy improved near to 96%.
2nd iteration accuracy = 88.24%More iteration accuracy = 91.08%More iteration accuracy = 94.82%Initial accuracy = 65.24% Final accuracy = 95.26%
Results on font and style variations
Further Issue• OCR is a long-term solution.
– Needs some time to come up with a workable system.
• But our problem is immediate. – A number of documents are already archived and
ready for use.
Can we access the content of document images without explicit recognition?
Word Spotting
Collection Query Matching ScoreProfessor University 10.38Alexander University 14.44Smith University 12.21until University 9.32recently University 16.43head University 17.34chemistry University 14.56Columbia University 15.10University University 0.51American University 18.71Chemical University 14.32Society University 12.13died University 19.11native University 18.10
Word Search by Word Spotting
Matching
Query
Render
Christian
FeatureExtraction
Efficient Matching Scheme• Matching techniques:
–Cross Correlation –Dynamic Time Warping (DTW)
• Aligns and finds the best match between pairs of word images with different size.
• Trace back to identify the optimal warping path (OWP)
Performance analysis shows that DTW outperforms Cross correlation
Recall Precision F-score
DTW 89.58% 90.81% 90.19%
cross correlation 76.43% 78.83% 77.61%
Challenges in Word Image Search• Degradation of documents
– Cuts, blobs, salt and pepper, erosion of border pixels, etc.
• Print variations– A word image may vary in size, style, font and quality.
• Morphological variation – A word may have different variants.
“Stemming” of Word Images
• Two possible variants of a word:(i) formed by adding prefix and/or suffix to the root word),
e.g. 'connect‘ ‘connects', ‘connecting', 'reconnect‘…
(ii) synonymous words. E.g. ‘connect‘ ‘join', ‘attach‘ …
• It is observed that most of the word form variations takes place either at the beginning or at the end.
• Needs matching algorithm which can “penalize” mismatches in the beginning or at the end.
Propose a novel DTW-based partial matching scheme
DTW-based Morphological MatcherPartition OWP (with length L) into beginning, middle
and end regions of length k (L/3) eachfor i = 1 to k do
if there is matching cost concentration at the beginning reduce extra cost from the total matching scoreelse break.
end for for i = L down to 2k do
if there is matching cost concentration in the end reduce extra cost from the total matching score else break
end for Normalize the matching score by the length of the
optimal warping path.
Performance of partial matching
Item
Before After
Recall Precision F-score Recall Precision F-score
Font 83.35 91.83 86.95 95.90 98.20 97.03
Size 87.38 91.39 89.30 96.80 99.42 98.09
Style 75.62 80.25 77.84 88.94 94.73 91.69
Degradation
85.82 88.49 87.04 91.74 96.26 93.92
Degraded Words
Salt and Pepper Blobs CutsComplex script Historic documents
Degradation Modeling
• Cuts and breaks • Blobs• Salt and pepper• Erosion of boundary
pixels
We built datasets using our degradation models for English, Hindi and Amharic.
Invariant Feature Selection• Investigate various features:
– Profiles (upper, lower, projection, transition)– Statistical moments (mean, standard deviation, skew) – Region-based moments (zero-order moment, first-order moment,
central moment)– Transform Fourier representations
• Global vs. Local Features– Global features: compute a single value.– Local features: compute 1D representation following vertical strips of a
word.– Local features perform better than global features
• For better performance combine local features of profiles, moments and transform domain representations
Recall Precision F-score
Global features 53.32% 50.24% 51.73%
Local features 82.92% 80.53% 81.71%
Invariant Feature Selection
• To test the performance of combined features the DTW matching algorithm is modified
• Combined local features of profiles and moments are invariant to degradations and printing variations.
Test result on degraded word images
Degradation
Hindi Amharic English
Recall Precision
F-score
Recall Precision
F-score Recall Precision
F-score
Cuts 92.34 92.41 92.37 93.72 94.93 94.32 93.76 88.15 90.87
Salts & pepper
93.28 93.17 93.20 96.88 97.11 96.99 96.56 96.02 96.29
Blobs 85.95 92.33 89.03 89.46 93.48 91.43 89.79 86.43 88.08
Erosion 92.77 92.58 92.67 94.91 95.72 95.31 92.38 93.29 92.83
Information Retrieval from “Document Images”
• Users expect more than just searching for documents that contain their query word.– Expectation for the popularity of text search.
• Retrieve relevant documents in ranked order.
• Remove effects of stopwords in the retrieval process.
• Fast search and efficient delivery of documents.
• How can we meet users requirements?
Construct an indexing scheme to organize word images following IR principles.
Mapping IR techniques for Document Image Retrieval
Modules Purpose
Algorithm(s) Used
Text search engine Current work
Stemming
words
Group word variant
Language modeling
e.g. Porter algorithm
Morphological
matching using DTW
Stopword
detection
Remove common words
Stop word list Inverse document
Frequency (IDF)
Relevance
measurement
Rank documents
Term frequency
(TF)
Modified TF/IDF
Clustering Group index
terms
--- Improved hierarchical
clustering
Indexing data structure
Organize index lists
Inverted index and signature file
Inverted index
Indexing Document Images
StemmingStopword
Detection Relevance Measure
Inverted Indexing
Index terms
IR Measures and Clustering
Word Images
Template (Keywords)
Index list
Clustered English Words
Clustered words vary in:
Fonts
Sizes
Styles
Forms
Quality
Clustered Amharic Words
Test results on datasets of the various fonts, sizes and styles
TypeHindi Amharic
Recall Precision F-score Recall Precision F-score
Fonts 91.28 92.85 91.88 93.05 93.87 93.45
Styles 83.29 84.01 83.64 89.09 89.80 89.44
Sizes 94.59 96.94 95.74 95.99 96.34 96.26
Normal
Bold
Italic
10121416
PowerGeezVisualGeez
AgafariAlpas
Performance: Precision vs. Recall graph
• The graph shows effectiveness of our scheme
• it increases both precision and recall by moving the entire curve up and out to the right.
Concluding Remarks• African scripts
– Introduce for the first time indigenous African scripts– Initial attempt to recognize Amharic documents with good results to extend it to
other indigenous African scripts.– Needs engineering effort to make it applicable for real-life situations
• Recognizer design– New attempt to propose self-adaptable recognizer for document image collections
with the help of machine learning algorithms– Encouraging results for developing recognizer for large document image
collections– Further work is needed for extending the framework to many of the complex Indian
and African scripts
• Document image indexing and Retrieval– Propose DTW-based partial matching scheme to perform morphological matching– Design invariant feature extraction scheme to degradation and printing variations– Apply IR principles, and construct clustering and indexing scheme.– Needs solving system related issues for practical online retrieval from large corpus
Million Meshesha and C. V. Jawahar, ``Indigenous Scripts of African Languages", African Journal of Indigenous Knowledge Systems, Vol. 6, No 2, pp. 132 - 142, 2007.
Million Meshesha and C. V. Jawahar, “Optical Character Recognition of Amharic Documents”, African Journal of Information and Communication Technology", Vol. 3, No. 2, pp. 53 - 66, June 2007.
Million Meshesha and C. V. Jawahar, “Self-Adaptable Recognizer for Document Image Collections", In Proc. of Int. Conf. on Pattern Recognition and Machine Intelligence (LNCS), 2007.
Million Meshesha and C. V. Jawahar, “Matching Word Images for Content-based Retrieval from Printed Document Images", International Journal of Document Analysis and Recognition (IJDAR) (in press).
Million Meshesha and C. V. Jawahar, Indexing Word Images for Recognition-free Retrieval from Printed Document Databases, Information Sciences: An International Journal (revised & submitted).
Scope for Future Work
• Develop an online system for searching hundreds of books over the Web
• Recognition and retrieval of complex documents (such as camera-based, handwritten, etc.).
• Apply advanced image preprocessing techniques to enhance image quality for large collection of document images.
• Retrieval of documents in presence of OCR errors and scope for hybrid approaches.
Publications: Conference Papers• Million Meshesha and C. V. Jawahar, “Self-Adaptable Recognizer for
Document Image Collections", In Proc. of Int. Conf. on Pattern Recognition and Machine Intelligence (LNCS), 2007.
• A. Balasubramanian, Million Meshesha, C. V. Jawahar, “Retrieval from Document Image Collections", In Proceedings of 7th IAPR Workshop on Document Analysis Systems (DAS), Nelson, New Zealand, (LNCS 3872), 2006, pp 1-12.
• Sachin Rawat, K. S. Sesh Kumar, Million Meshesha, Indiraneel Deb Sikdar, A. Balasubramanian and C. V. Jawahar, “Semi-automatic Adaptive OCR for Digital Libraries", In Proceedings of 7th IAPR Workshop on Document Analysis Systems (DAS), Nelson, New Zealand, (LNCS 3872), 2006, pp 13-24.
• K. Pramod Sankar, Million Meshesha, C. V. Jawahar, “Annotation of Images and Videos based on Textual Content without OCR", In Workshop on Computation Intensive Methods for Computer Vision, Part of 9th European Conference on Computer Vision (ECCV), Austria, 2006.
• Million Meshesha and C. V. Jawahar, “Recognition of Printed Amharic Documents", In Proceedings of 8th International Conference of Document Analysis and Recognition (ICDAR), Seoul, Korea, Sep 2005, Volume 1, pp 784-788
• C. V. Jawahar, Million Meshesha, A. Balasubramanian, “Searching in Document Images", In Proceedings of Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), 2004, pp. 622-627.
Publications: Journal Articles• Million Meshesha and C. V. Jawahar, “Matching Word Images for
Content-based Retrieval from Printed Document Images", International Journal of Document Analysis and Recognition (IJDAR) (in press).
• C. V. Jawahar, A. Balasubrahmanian, Million Meshesha and Anoop Namboodiri, “Retrieval of Online Handwriting by Synthesis and Matching", Pattern Recognition (in press).
• Million Meshesha and C. V. Jawahar, “Optical Character Recognition of Amharic Documents”, African Journal of Information and Communication Technology", Vol. 3, No. 2, pp. 53 - 66, June 2007.
• Million Meshesha and C. V. Jawahar, ``Indigenous Scripts of African Languages", African Journal of Indigenous Knowledge Systems, Vol. 6, No 2, pp. 132 - 142, 2007.
• Million Meshesha and C. V. Jawahar, Indexing Word Images for Recognition-free Retrieval from Printed Document Databases, Information Sciences: An International Journal (revised & submitted).
Thank you