Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for...

26
Language Models for Text Recognition: An Overview Shravya Shetty and Sargur Srihari CEDAR University at Buffalo State University of New York, USA

Transcript of Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for...

Page 1: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Language Models for Text Recognition: An Overview

Shravya Shetty and Sargur SrihariCEDARUniversity at BuffaloState University of New York, USA

Page 2: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Plan of Presentation

1. Language modelsN-gram probabilitiesGenerative model for text recognition Improvements to n-gram models

2. Text Recognition3. Sentence Level Language Models4. Word Level Language Models5. Conclusion6. References

Page 3: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

1. Language Models

Language Models are probabilistic models that capture language regularitiesN-gram models:

unigram, bigram and trigram word models: capture word dependenciespart of speech models: capture word class dependences

Shown to be effective/improve performance in language technologies

Page 4: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Applications of Language Models

Speech RecognitionMachine TranslationInformation ExtractionText recognition (OCR, Handwriting)

Post-processing of recognition results Spelling correctionMINDS

ML(statistical), IR, NLP, DAR, ASR

Page 5: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Language Model for Text RecognitionMinimum error rule of recognizing text image:

Task is to determine conditional probability for each word sequenceAlternatively, by Bayes rule

arg max { P (word sequence | text image) }word sequences

arg max { P(text image |word sequence) x P(word sequence)}word sequences

Page 6: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Generative Model for text recognition arg max { P(text image |word sequence) x P(word sequence)}word sequences

• Basically a generative model where joint probability is computed• First Term a conditional probability,

computed by say Naïve Bayes• Second term calculated using the n-gram

model (and chain rule)

Page 7: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

N-gram Language Model

Probability of word sequence 1..N isp(w1

N) = Π p(wi/wi-1) bigramorΠ p(wi/wi-1, wi-2) trigram

Shown to be effective in natural language applications

Page 8: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Limitations of n-gram models

Sequences not occurring in training data have zero probabilityMany n-grams occur too few/many times due to corpusLong-term dependencies are thereImprovements to n-gram models possible

Page 9: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Techniques for Improving n-gram ModelsSmoothing

Probability mass of n-grams with frequency greater than a threshold is redistributed across all the n-grams

Clustering Predict cluster of similar words instead of single word.

Caching Recently observed words are likely to occur again. Combine with more general models

Skippingwords not directly adjacent to target word contain useful information

Sentence-mixture modelsModel different kinds of sentences separately

Page 10: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

2. Text Recognition as a Document Processing Task

OCRICROHRILR

Page 11: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Processing Steps

Scan Line Segmentation Word Segmentation

Page 12: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Isolated Word Recognition

Task: Convert Word Image to its Textual Form

Analytic RecognitionSegments of word image matched to characters of words in lexicon

Holistic RecognitionWord shape matched to prototypes or features of words in lexicon.

Combining ResultsNeural Network

Page 13: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Language Models in Text Recognition

Language models can be at inter-word level (sentence) or inter-character (word) level

Sentence levellanguage model learnt from a corpus consisting of sentences

Word levellanguage model learnt using dictionaries

Page 14: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

3. Sentence Level Language Models

Page 15: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Recognition Post-processing: Finding most likely word sequence using word level models

Word n-gramsWord-class n-grams(POS, NE)

To make recognitionchoicesor to limit choices

Page 16: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Language model for correcting recognition results

An implementation for handwritten essays:

N best list of word recognition results are used

Second order HMM is used to incorporate trigram model

Find most likely sequence of hidden states given a sequence of observed paths in a second order HMM– Viterbi Path

Can improve performance when sentence follows average statistics of the languageTrigram language modelSmoothing using Interpolated Kneyser-Ney • Modified backoff distribution

based on no of contexts• Higher- and lower-order

distributions are combined

Page 17: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Results with Language Modeling (Handwritten Essays)

FourEssaysORIGINAL TEXT

Lady Washington role was hostess for the nation. It’s different because Lady Washington was speaking for the nation and Anna Roosevelt was only speaking for the people she ran into on wer travet to see the president.

Lady washingtons role was hostess for the nation first to different because lady washingtons was speeches for for martha and taylor roosevelt was only meetings for did people first vote polio on her because to see the president

WORD RECOGNITION

204

(100%)

124

(62%)

LANGUAGE MODELING145

(72%)Lady washingtons role was hostess for the nation but is different because george washingtons was different for the nation and eleanor roosevelt was only everything for the people first ladies late on her travel to see the president

Page 18: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Correction using Word Class

Words in language model are replaced by corresponding POS and NE tagsBigram probabilities for word classes learnt from corpusWord correction performed using Viterbi decodingSlightly improved performance

Page 19: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

4. Word Level Language Models

Page 20: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Word level models for Latin alphabet

P e w s y w a wn n l v n i a

For each character segment, train recognizer (NN, SVM, etc.) to recognize letter Learn P(letter | character image)Include the language modelP(letter | character image) × P(letter | previous letters)

Recognition and language model are tightly integratedWord recognition uses continuous density HMMsSearch space is a network of word HMM’s and word transition modeled by a language model

Page 21: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Character and Alphabet Models in Indic Languages

n-grams in a character can be at the character level or the alphabet levelExample of character level bigram

Example of alphabet level bigram

Page 22: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Character/Alphabet N-grams in Devanagari

126 alphabets and 5538 charactersBigram and trigram frequency counts can be obtained for them. E.g., the Emille corpus

N-grams at character level give lower perplexity (less uncertainty)

Page 23: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Word Level Language Model for Devanagari

Determine character pathRecognize charactersUse Language model

Page 24: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Syntactic Rules for Devanagari WordsFormation of characters determined by rulesCan be used to reject character sequencesExample rules –

Half-consonant cannot have a vowel modifier

Only one vowel modifier is allowed on a consonant

Illegal ------

Page 25: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

5. Conclusion

Statistical language models found useful in all Statistical language models found useful in all language technologieslanguage technologies

Can reduce error rate by 1/3 or Can reduce error rate by 1/3 or ½½Tight coupling of Tight coupling of HMMsHMMs for text recognition and nfor text recognition and n--gram language models possiblegram language models possibleIndic Languages have special challengesIndic Languages have special challenges

Character and Alphabet setsCharacter and Alphabet setsMore structure in Indic LanguagesMore structure in Indic Languages

POS models (POS models (vibhaktivibhakti) should work well for Sanskrit) should work well for SanskritHybrid models are usefulHybrid models are useful

ReferencesReferences

Page 26: Language Models for Text Recognition: An Overviewsrihari/talks/mysore-031507.pdf · Techniques for Improving n-gram Models Smoothing Probability mass of n-grams with frequency greater

Referencesi. J. Hull, A Computational Theory of Visual Word Recog.,

PhD dissert, CEDAR 1988.ii. R. Srihari, C. Ng, C. Baltus, Language Models in On-line

Handwriting Recognition, 3rd IWFHR, 1993iii. G. Kim, Handwritten Word and Phrase Recognition, PhD

Dissert, CEDAR1996.iv. A. Vinciarelli, S. Bengio H. Bunke, Off-line recognition of

unconstrained Handwritten sentences using HMM and Statistical Language Models, 2003.

v. S. Srihari, H. Srinivasan, C. Huang, S. Shetty, Spotting words in Latin Devanagari and Arabic Scripts, Vivek, 2006.

vi. S. Srihari, R. Srihari, H. Srinivasan, S. Shetty, “On scoring handwritten essays”, IJCAI 2007.

vii. S. Kompalli, A Stochastic Framework for Font-Independent Devanagari OCR, PhD dissert, CEDAR 2007.