How did we get to where we are in DIA and OCR, and where are we now? George Nagy Professor Emeritus,...
-
Upload
eunice-matthews -
Category
Documents
-
view
215 -
download
2
Transcript of How did we get to where we are in DIA and OCR, and where are we now? George Nagy Professor Emeritus,...
How? gn 1
How did we get to where we are in DIA and OCR, and
where are we now?
George NagyProfessor Emeritus, RPI
3/2/2012
How? gn 2
How did we get to where we are in DIA and OCR, and where are we now?George Nagy, DocLab†, RPI
AbstractSome properties of pattern classifiers used for character recognition are presented. The improvement in accuracy resulting from exploiting various types of context – language, data, shape – are illustrated. Further improvements are proposed through adaptation and field classification. The presentation is intended to be accessible, though not necessarily interesting, to those without a background in digital image analysis and optical character recognition.
3/2/2012
How? gn 3
Last week (SPIE - DRR)
• Interactive verification of table ground truth• Calligraphic style recognition• Asymptotic cost of document processing
3/2/2012
How? gn 4
Today• Classifiers
– pattern recognition and machine learning
• Context– Data– Language
• Style (shape context)– Intra-class (adaptation)– Inter-class (field classification)
3/2/2012
How? gn 5
Tomorrow
“Three problems that, if solved, would make an impact on the conversion of documents to symbolic electronic form”
1. Feature design for OCR2. Integrated segmentation and recognition3. Green interaction
3/2/2012
How? gn 8
CLASSIFIERSA pattern is a vector whose elements represent numerical observations of some object .
Each object can be assigned to a category or class.The label of the pattern is the name of the object's category.
Classifiers are machines or algorithmsInput: one or more pattern(s)Output: label (s) of the pattern(s)
3/2/2012
Example (average RGB, perimeter)
3/2/2012 How? gn 9
(22, 75, 12, 33) aspen
(14, 62, 24, 49) sycamore
(40, 55, 17, 98) maple
Object Pattern Label
How? gn 10
Traditional open-loop OCR System
3/2/2012
meta-parameters (e.g. regularization, estimators)
transcript
training set
parameter estimation
operationaldata(features)
classifier parameters correction,
reject entry
patterns and labels
patterns
labels
rejects
CLASSIFIER
Classifiers and Features • OCR classifiers operate on feature vectors that are
either the pixels of the bitmaps to be classified, or values computed by a class-independent transformation of the pixels.
• The dimensionality of the feature space is usually lower than the dimensionality of the pixel space.
• Features should be invariant to aspects of the patterns that don’t affect classification (position, size, stroke-width, color, noise).
3/2/2012 How? gn 11
How? gn 12
Representation
3/2/2012
samples
Feature Spaceof two features
O O
OOO O O O O OO
X X
X X X X X X
X X X
X
equiprobability contours
x1
x2
decisionboundary
How? gn 13
Some classifiers
3/2/2012
NEAREST
NEIGHBOR
GAUSSIAN
QUADRATIC LINEAR
BAYES MULTILAYER NEURAL
NETWORK
SUPPORT VECTOR
MACHINE
SIMPLE
PERCEPTRON
Nonlinear vs. Linear classifiers
x and y are features; A and B are classes
2-D:
Quadratic classifier:Iff ax + by + cx2 + dy2 + exy + f > 0, then (x,y) A
5-D:Linear classifier with s = x, t = y, u = x2, v = y2, w = xy
Iff as + bt + cu + dv + ew + f > 0, then (x,y) A
3/2/2012 How? gn 14
Quadratic to linear decision boundary
3/2/2012 How? gn 15
X X O O X X x u
X X
X X
O Oax2 > c
u = x vv = x2
bv > c
SVMs, like many other classifiers, need only dot products to compute decision boundary from training samples
SUPPORT VECTOR MACHINE (V. Vapnik)
Transformation only implicit:max min { f(vi•vj ) } by QPMercer’s theorem: vi•vj =K(xi•xj)
i.e., compute dot-products in high-dim space from via kernels in low-dim space
0 X
X
z
y
0
resists over-training
0 X X 0 x
Kernel-induced transformationx v = (y,z)
0 X
X
z
y
0
3/2/2012 How? gn 16
SVMs, like many other classifiers, need only dot products to compute decision boundary from training samplesvia the “Kernel Trick”
Some classifier considerations
• Linear decision functions are easier to optimize
• Classes may not be linearly separable
• Estimation of decision boundary parameters in higher-dimensional spaces needs more training samples
• The training set is often not representative of the test set
• The size of the training set is always too small
• Outliers that don’t belong to any class cause problems
3/2/2012 How? gn 17
Bayes error is unreachable without an infinite training set;
is zero unless some patterns belong to more than one class
3/2/2012 How? gn 18
Bayes error
SSDIP GN Character Recognition 19May 4, 2008
CLASSIFIER BIAS AND VARIANCE
0 0 X 0 X X 0 X X 0 X 0 X 0 X 0 0 0 0 0 0 0 Complex Classifier Error____ Bias ------- Error Variance ……. Simple Classifier Error____ Bias ------- Variance …….
Number of training samples CLASSIFIER BIAS AND VARIANCE DON’T ADD!
Any classifier can be shown to be better than any other.
Training set # 1 Training set # 2
True decision boundary
How? gn 20
CONTEXT
Assigned label, part-of-speech, meaning, value, shape, style, position, color, ...
of other character/word/phrase patterns
3/2/2012
3/2/2012 How? gn 21
Data contextThere is no February 30
Legal amount = courtesy amount (bank checks)
Total = sum of values
Date of birth < date of marriage ≤ date of death
Data frames:
email addresses, postal addresses, telephone numbers,
$amounts, chemical formulas (C20H14O4), dates, units
(m/s2, m·s−2, or m s−2), libcatnos (978-0-306-40615-7,
lLB2395.C65 1991, 823.914) copyrights, license plates, ...
3/2/2012 How? gn 22
Language modelsLetter n-grams
unigrams: e (12.7%); z (0.07%)bigrams: th (3.9%), he, in, nt, ed, en (1.4%)trigrams: the (3.5%), and, ing, her, hat, his, you, are
Lexicon (stored as trie, hash-table, n-grams)(20K-100K words – 3x as many for Italian than for English)domain-specific: chemicals, drugs, family/first names,
geographic gazetteers, business directories, abbreviations and acronyms, …
Syntax – probabilistic rather than rule based grammars
3/2/2012 How? gn 23
Post-processingHanson, Riseman, & Fisher 1975:
Contextual postprocessor (based on n-grams)
• Current OCR systems generate multiple candidates for each word.
• They use a lexicon (with or w/o frequencies) or n-grams to select top candidate word.
• Language-independent OCR engines have higher error rates. Context beyond word-level is seldom used.
• Entropy (English text) of letters, bigrams, 100-grams, ..Shannon 1951 4.163.563.3 ... 1.3 bits/charBrown et al 1992: <1.75 bits/char (used 5 x 108 tokens)
Probabilistic context algorithms
• Markov Chain• Hidden Markov Methods (HMM)• Markov Random Fields (MRF)• Bayesian Networks
• Cryptanalysis
3/2/2012 How? gn 24
Raviv 1967No context: P(C|x) ~ P(x|C)
No features: P(C|x) ~ P(C)
0th order Markov: P(C|x) ~ P(x|C) P(C) (prior)
1st order: P(C|x1,...xn) ~ P(xn|C) P(C|x1,...xn-1)
(This is an iterative calculation)
Raviv estimated letter bigram and trigram probabilities from 8 million characters.
3/2/2012 How? gn 25
Estimating rare n-grams (like ...rta...)(smoothing, regularization)
The quick brown fox jumped over the tired dog
P(e) = 5/39 = 13% (12.7 %)
P(u) = 2/39 = 5% ( 2.8 %)
P(a) = 0/39 = 0% ( 8.2 %)
Laplace’s Law of Succession for
k instances from N items of m classes:
P^(x) = (k+1) / (N+m) P^(a) = (0+1)/(39+26) = 1.5 %
Next approximation: (k+a) / (N+ma), which also sums to unity
3/2/2012 How? gn 27
HIDDEN MARKOV MODELS: A or B? (3 states, 2 features)
1 2 3 1 2 3
MODEL A MODEL B
TRAINING: joint probs via Baum-Welch Forward-Backward (EM)
(0,1) (0,0) (0,1) (1,1) (0,1) (0,0) (0,1) (1,1)
(0.2, 0.3) (0.3, 0.6) (0.7, 0.8) (0.3, 0.1) (0.5, 0.4) (0.4, 0.8)
statestime
1
2
3
time
3/2/2012 How? gn 28
HMM situated between Markov chains & Bayesian networks
Find joint posterior probability of sequence of feature vectors
States are invisible; only, features can be observed
Transitions may be unidirectional or not, skip or not.
States may represent partial characters, characters, words, or separators and merge-indicators
Observable features are window-based vectors
HMMs can be nested (e.g. character HMM in word HMM)
Training: estimate state-conditional feature distributions and state transition probabilitiescomplex Baum-Welch Forwards-Backwards EM algorithm
3/2/2012 How? gn 29
Widely used for script
E.g.:
Using a Hidden-Markov Model in Semi- Automatic Indexing of Historical Handwritten Records
Thomas Packer, Oliver Nina, Ilya Raykhel
Computer Science, Brigham Young University FHTW 2009
3/2/2012 How? gn 30
Markov Random Fields ( ~ 2000 for OCR)
• Maximize an energy function
• Models more complex relationships within cliques of features
• Applied so far mainly to on- and off-line Chinese and Japanese handwriting
• More widely used in computer vision
3/2/2012 How? gn 31
Bayesian Networks (J. Pearl, > 1980)
• Directed acyclic graphical model• Models cliques of dependencies between variables• Nodes are variables, edges represent dependencies• Often designed by intuition instead of structure learning• Learning and inference algorithms (message passing)
• Applied to document classification
3/2/2012 How? gn 32
How? gn 33
Inter-pattern Class Dependence(Linguistic Context)
3/2/2012
G E O N G EG E O R G E
Classes
Features
36
Inter-pattern Feature Dependence(order-independent: Style)
3/2/2012How? gn
The shape of the ‘w’ depends on the shape of the ‘v’
How? gn3/2/2012 37
OCR via decoding a substitution cipher
Cipher text: 1 2 . 2 . 2 . .2 5 2 . . 5 2 . 5
an unknown sentence
Cluster thebitmaps:
1 2 5 LANGUAGE MODEL:
N-gram frequencies,Lexicon,…
[Nagy & Casey, 1966; Nagy & Seth, 1987 Ho &Nagy, 2000
Thanks to J.J. Hull]
DECODER
1 a 2 n 5 e
3/2/2012 How? gn 40
Decoded text
chapter I 2 LOOMINGSCall me Ishmael. Some years ago – never mind how long precisely – having little or no money inmy purse, and nothing particular to interest me onshore, I thought I would sail about a little and see
the watery part of the world. It is a way I have ...
chapter i _ bee_inds
_all me ishmaels some years ago__never mind how long precisely __having little or no money in my purses and nothing particular to interest me on shores i thought i would sail about a little and see the watery part of the worlds it is a way i have ...
GT
How? gn 41
STYLE-CONSTRAINED CLASSIFICATION
(AKA style-conscious or style-consistent classification)
3/2/2012
How? gn 44
Intra-class and inter-class style(aka weak and strong style)
3/2/2012
With thanks to Prof. C-L Liu
INTRA-CLASS INTER-CLASS
Adaptation for intra-class style, field classification for inter-class style
How? gn 45
Adaptation and Style
3/2/2012
(2) adaptable (long test fields)
trainingtest
(4) continuous styles(short test fields)
(3) discrete styles
(1) representative (5) weakly constrained
Adaptive algorithms
cf.:
stochastic approximation (Robbins-Munro algorithm)self-training, self-adaptation, self-correction, unsupervised / semi-supervised learning,transfer learning, inductive transfer,co-training, decision-directed learning, ...
3/2/2012 How? gn 46
How? gn 47
Traditional open-loop OCR System
3/2/2012
training set
parameter estimation
operationaldata(bitmaps)
classifier parameters
meta-parameters (e.g. regularization, estimators)
correction,reject entry
transcript
patterns and labels
patterns
labels
rejects
CLASSIFIER
How? gn 48
Supervised learning
3/2/2012
training set
parameter estimation
operationaldata(features)
classifier parameters CLASSIFIER
meta-parameters
correction,reject entry
transcript
keyboarded labels of rejects and errors
Generic OCR System that makes use ofpost-processed rejects and errors
How? gn 49
Adaptation (Jain, PAMI 00: “Decision directed classifier”)
3/2/2012
training set
parameter estimation
operationaldata(features)
classifier parameters CLASSIFIER
meta-parameters
correction,reject entry
transcript
classifier assigned labels
Field estimation, singlet classification
How? gn 50
Self-corrective recognition (@IBM, IEEE IT 1966) (hardwired features and reference comparisons
3/2/2012
accepted
rejectedREFERENCEGENERATOR
CATEGORIZER
FEATUREEXTRACTOR
SCANNER
INTIAL REFERENCES NEW REFERENCES
SOURCEDOCUMENT
How? gn 51
Results: self-corrective recognition (Shelton & Nagy 1966)
3/2/2012
Training set: 9 fonts, 500 characters/font, U/C
Test set: 12 fonts, 1500 characters/font, U/C
96 n-tuple features, ternary reference vectors
Error Reject
Initial: 3.5% 15.2%
After self correction: 0.7% 3.7%
22,500 patterns
How? gn 52
Self-corrective classification
3/2/2012
adapted to a single font
z
1
1
1
7
7
7
7
1Omnifont classifier
original boundary
54
Baird skeptical! Results (DR&R 1994)
3/2/2012 How? gn
100 fonts, 80 symbols each from Baird’s defect model (6,400,000 characters)
Size (pt) Error reduction
% fonts improved Best Worst
6 x 1.4 100 x 4 x 1.0
10 x 2.5 93 x 11 x 0.8
12 x 4.4 98 x 34 x 0.9
16 x 7.2 98 x 141 x0.8
His conclusion: good investment: large potential for gain, low downside risk
55
Results: adapting both means and variances(Harsha Veeramachaneni 2003 IJDAR 2004)
3/2/2012How? gn
NIST Hand-printed digit classes, with 50 “Hitachi features”
Train Test % Error Before
Adapt means
Adaptvariance
SD3 SD3 1.1 0.7 0.6
SD7 5.0 2.6 2.2
SD7 SD3 1.7 0.9 0.8
SD7 2.4 1.6 1.7
SD3+SD7 SD3 0.9 0.6 0.6
SD7 3.2 1.9 1.8
Writer Adaptation by Style Transferfor many-class problems (Chinese script)
• Patterns of single-writer field transformed to style-free space using style transfer matrix (STM)
• Supervised adaptation: learning STM from labeled samples• Unsupervised adaptation: learning STM from test samples
Zhang & Liu, Style transfer matrix learning for writer adaptation, CVPR, 2011
• STM Formulation– Source point set– Target point set– Objective:
– Solution:
• Application to Writer Adaptation– Source point set: writer-specific data– Target point set: parameters in basic classifier
How? gn 59
Field classifiers and Adaptive classifiers
• A field classifier classifies consecutive finite-length fields of the test set. – Subsequent patterns do not benefit from knowledge
gained in earlier patterns. Exploits inter-class style.
• An adaptive classifier is a field classifier with a field that encompasses an entire (isogenous) test set.– The last pattern benefits from information from the first
pattern. The first pattern benefits from information from the last pattern. Exploits only intra-class style.
3/2/2012
60
Field Classifier for Discrete Styles (Prateek Sarkar ICPR 2000, PAMI `05)
Optimal for multimodal feature distributions from weighted Gaussians
m-dimensional feature vector, n-pattern field, s styles
field-class c* = (c1, c2, ......, cn)
= f (style means and covariances)
Parameters estimated via Expectation MaximizationFor a field of n characters from m classes there are mn ordered and (m+n-1)!/n!(m-1)! unordered field classes
3/2/2012 How? gn
Singlet and Style-constrained field classification boundaries (2 classes, 2 styles, 1 Gaussian feature per pattern)
3/2/2012 How? gn 62First Pattern
Se
con
d P
atte
rn
How? gn 63
Top-style: a computable approximation(with style-unsupervised estimation via EM)
3/2/2012
Style-conscious Top-style Singlet classifier
• Experiments– Digits of six fonts– Moment features
– Training fields of L=13, 14,430 patterns– Test fields L=2,4
Field Classifier for Continuous Styles (H. Veeramachaneni IJDAR `04, PAMI `05, `07)
3/2/2012 How? gn 65
With Gaussian features, feature distribution for any field lengthcan be computed from class means and class-pair-conditional feature cross-covariance matrices
66
Style-conscious quadratic discriminant field classifier
• Class-style means are normally distributed about the class means• k is the singlet class-conditional feature covariance matrix
• Cij is the class-pair-conditional cross-covariance matrixestimated from pairs of same-source singlet class-means
• SQDF approximates the optimal discrete style field classifier well when inter-class distance >> style variation
• Inter-class style is order-independent:P(5 7 9 | [ x1 x2 x3]) = P( 7 5 9 | [ x2 x1 x3])
+ SQDF avoids Expectation Maximization in discrete style method- Supralinear in the number of features and classes because
the size of the N-pattern field covariance matrix is (Nxd)2 and for M classes there are (M+N-1)!/N!(M-1)! matrices
3/2/2012 How? gn
Results: style-constrained classification - short fields
3/2/2012 How? gn 68
Field error rate (%)Field length: L=2 L=5
Test data w/o style with style w/o style with style
SD3 1.4 1.3 3.0 2.5SD7 2.7 2.4 5.3 4.5
Continuous style-constrained classifier, trained on ~ 17,000 characters and tested on ~17,000 characters. 25 top principal component “Hitachi” blurred directional features.
Field-trained (i.e. word) classification vs. style-constrained classification
3/2/2012 How? gn 69
Training set for field classification
000000010010
99989999
(104 classes)
Training set forstyle classification
000102
9899
(102 classes with order)
Field Length = 4
classifier parameters for longer field length computed from pair parameters (because Gaussian variables defined completely by covariance)
How? gn 70
Style context versus Linguistic context
3/2/2012
Two digits in an isogenous field: ...... 5 6 .....
with feature vectors: x, y
and class labels: 5, 6
Style:
Language context: P(x y | 5, 6) P( y x | 6, 5 ) )
Intra-class style: P(x y | 5, 5) P(x | 5) P(y | 5)
Inter-class style: P(x y | 5, 6) = P( y x|6, 5 ) P(x|5) P(y|6)
How? gn 71
Weakly-constrained data
3/2/2012
training test3 classes, 4 multi-class styles
given p(x), find p(y), where y=g(x)
72
Recommendations for OCR systems that improve with use
3/2/2012 How? gn
Never let the machine rest: design it so that it puts every coffee-break to good use.
Don’t throw away edits (corrected labels): use them.
Classify style-consistent fields, not characters:adapt on long fields, exploit inter-class style in short fields.
Use order rather than position.
Let the machine guess: lazy decisions.
Make use of all possible contexts: style, language, shape, layout, structure, and function.
Please help to increase computer literacy!
Prateek Sarkar George Nagy [email protected] [email protected]
Rensselaer Polytechnic Institute, U.S.A.
Prateek Sarkar George Nagy [email protected] [email protected]
Rensselaer Polytechnic Institute, U.S.A.
Classification of style-constrained pattern-fieldsClassification of style-constrained pattern-fields
Style is a manner of rendering patterns. Patterns are rendered in many different styles.
Style consistency constraint: Patterns in a field are rendered in the same style.
A field is a group of patterns with a common origin (isogenous patterns).a a a a a a a a a
x1 7
1
x2
7171
7777
1717
77/1177/11
7171
1111
1717Modeling style consistency can help improve
classification accuracy.
1171
77 17
1171
77 17
d
d
% field errordStyle Singlet
0 74.7 74.71 57.2 60.32 38.5 45.13 22.3 28.74 10.5 14.85 3.9 6.06 1.1 1.9
p ( x1 x2 | 17 ) = p ( x1 | 1 ) × p ( x2 | 7 )
= [ .5 p ( x1 | ) + .5 p ( x1 | ) ] × [ .5 p ( x2 | ) + .5 p ( x2 | ) ]= .25 [ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | )
+ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) ]= .5 [ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) ]
p ( x1 x2 | 17 ) = .5 p ( x1 x2 | ) + .5 p ( x1 x2 | ) all (c1c2 )
(c1*c2
*) = arg max p(x1x2
| c1c2 ).P[c1c2]
Classifier Style consistency modelSinglet Model
Intr
oduc
tion
Intr
oduc
tion
Obj
ect
ive
Obj
ect
ive
Mod
el
Mod
el
Res
ults
R
esul
ts
Unsupervised style estimation: Patterns in a field, and class labels are observed,but style of the field is unobserved. Apply the EM algorithm for estimating model parameters.
Simulation of two-class, two-style problem with unit variance Gaussian distributions.
Application to recognition of digit fields:15-25% reduction in errors (relative) was observed in laboratory experiments on handprinted digit recognition.
Improvement in accuracy was more for longer fields.
Reference: Prateek Sarkar. Style consistency in pattern fields. PhD thesis, Rensselaer Polytechnic Institute, U.S.A., 2000.
Prateek Sarkar, August 2000 for ICPR, Barcelona, September 2000
3/2/2012 How? gn 74