How did we get to where we are in DIA and OCR, and where are we now? George Nagy Professor Emeritus,...

74
How did we get to where we are in DIA and OCR, and where are we now? George Nagy Professor Emeritus, RPI 3/2/2012 How? gn 1

Transcript of How did we get to where we are in DIA and OCR, and where are we now? George Nagy Professor Emeritus,...

How? gn 1

How did we get to where we are in DIA and OCR, and

where are we now?

George NagyProfessor Emeritus, RPI

3/2/2012

How? gn 2

How did we get to where we are in DIA and OCR, and where are we now?George Nagy, DocLab†, RPI

AbstractSome properties of pattern classifiers used for character recognition are presented. The improvement in accuracy resulting from exploiting various types of context – language, data, shape – are illustrated. Further improvements are proposed through adaptation and field classification. The presentation is intended to be accessible, though not necessarily interesting, to those without a background in digital image analysis and optical character recognition.

3/2/2012

How? gn 3

Last week (SPIE - DRR)

• Interactive verification of table ground truth• Calligraphic style recognition• Asymptotic cost of document processing

3/2/2012

How? gn 4

Today• Classifiers

– pattern recognition and machine learning

• Context– Data– Language

• Style (shape context)– Intra-class (adaptation)– Inter-class (field classification)

3/2/2012

How? gn 5

Tomorrow

“Three problems that, if solved, would make an impact on the conversion of documents to symbolic electronic form”

1. Feature design for OCR2. Integrated segmentation and recognition3. Green interaction

3/2/2012

How? gn 6

The beginning(almost) 1953

3/2/2012

How? gn 7

Shepard’s 9 features(slits on a Nipkow disk)

3/2/2012

How? gn 8

CLASSIFIERSA pattern is a vector whose elements represent numerical observations of some object .

Each object can be assigned to a category or class.The label of the pattern is the name of the object's category.

Classifiers are machines or algorithmsInput: one or more pattern(s)Output: label (s) of the pattern(s)

3/2/2012

Example (average RGB, perimeter)

3/2/2012 How? gn 9

(22, 75, 12, 33) aspen

(14, 62, 24, 49) sycamore

(40, 55, 17, 98) maple

Object Pattern Label

How? gn 10

Traditional open-loop OCR System

3/2/2012

meta-parameters (e.g. regularization, estimators)

transcript

training set

parameter estimation

operationaldata(features)

classifier parameters correction,

reject entry

patterns and labels

patterns

labels

rejects

CLASSIFIER

Classifiers and Features • OCR classifiers operate on feature vectors that are

either the pixels of the bitmaps to be classified, or values computed by a class-independent transformation of the pixels.

• The dimensionality of the feature space is usually lower than the dimensionality of the pixel space.

• Features should be invariant to aspects of the patterns that don’t affect classification (position, size, stroke-width, color, noise).

3/2/2012 How? gn 11

How? gn 12

Representation

3/2/2012

samples

Feature Spaceof two features

O O

OOO O O O O OO

X X

X X X X X X

X X X

X

equiprobability contours

x1

x2

decisionboundary

How? gn 13

Some classifiers

3/2/2012

NEAREST

NEIGHBOR

GAUSSIAN

QUADRATIC LINEAR

BAYES MULTILAYER NEURAL

NETWORK

SUPPORT VECTOR

MACHINE

SIMPLE

PERCEPTRON

Nonlinear vs. Linear classifiers

x and y are features; A and B are classes

2-D:

Quadratic classifier:Iff ax + by + cx2 + dy2 + exy + f > 0, then (x,y) A

5-D:Linear classifier with s = x, t = y, u = x2, v = y2, w = xy

Iff as + bt + cu + dv + ew + f > 0, then (x,y) A

3/2/2012 How? gn 14

Quadratic to linear decision boundary

3/2/2012 How? gn 15

X X O O X X x u

X X

X X

O Oax2 > c

u = x vv = x2

bv > c

SVMs, like many other classifiers, need only dot products to compute decision boundary from training samples

SUPPORT VECTOR MACHINE (V. Vapnik)

Transformation only implicit:max min { f(vi•vj ) } by QPMercer’s theorem: vi•vj =K(xi•xj)

i.e., compute dot-products in high-dim space from via kernels in low-dim space

0 X

X

z

y

0

resists over-training

0 X X 0 x

Kernel-induced transformationx v = (y,z)

0 X

X

z

y

0

3/2/2012 How? gn 16

SVMs, like many other classifiers, need only dot products to compute decision boundary from training samplesvia the “Kernel Trick”

Some classifier considerations

• Linear decision functions are easier to optimize

• Classes may not be linearly separable

• Estimation of decision boundary parameters in higher-dimensional spaces needs more training samples

• The training set is often not representative of the test set

• The size of the training set is always too small

• Outliers that don’t belong to any class cause problems

3/2/2012 How? gn 17

Bayes error is unreachable without an infinite training set;

is zero unless some patterns belong to more than one class

3/2/2012 How? gn 18

Bayes error

SSDIP GN Character Recognition 19May 4, 2008

CLASSIFIER BIAS AND VARIANCE

0 0 X 0 X X 0 X X 0 X 0 X 0 X 0 0 0 0 0 0 0 Complex Classifier Error____ Bias ------- Error Variance ……. Simple Classifier Error____ Bias ------- Variance …….

Number of training samples CLASSIFIER BIAS AND VARIANCE DON’T ADD!

Any classifier can be shown to be better than any other.

Training set # 1 Training set # 2

True decision boundary

How? gn 20

CONTEXT

Assigned label, part-of-speech, meaning, value, shape, style, position, color, ...

of other character/word/phrase patterns

3/2/2012

3/2/2012 How? gn 21

Data contextThere is no February 30

Legal amount = courtesy amount (bank checks)

Total = sum of values

Date of birth < date of marriage ≤ date of death

Data frames:

email addresses, postal addresses, telephone numbers,

$amounts, chemical formulas (C20H14O4), dates, units

(m/s2, m·s−2, or m s−2), libcatnos (978-0-306-40615-7,

lLB2395.C65 1991, 823.914) copyrights, license plates, ...

3/2/2012 How? gn 22

Language modelsLetter n-grams

unigrams: e (12.7%); z (0.07%)bigrams: th (3.9%), he, in, nt, ed, en (1.4%)trigrams: the (3.5%), and, ing, her, hat, his, you, are

Lexicon (stored as trie, hash-table, n-grams)(20K-100K words – 3x as many for Italian than for English)domain-specific: chemicals, drugs, family/first names,

geographic gazetteers, business directories, abbreviations and acronyms, …

Syntax – probabilistic rather than rule based grammars

3/2/2012 How? gn 23

Post-processingHanson, Riseman, & Fisher 1975:

Contextual postprocessor (based on n-grams)

• Current OCR systems generate multiple candidates for each word.

• They use a lexicon (with or w/o frequencies) or n-grams to select top candidate word.

• Language-independent OCR engines have higher error rates. Context beyond word-level is seldom used.

• Entropy (English text) of letters, bigrams, 100-grams, ..Shannon 1951 4.163.563.3 ... 1.3 bits/charBrown et al 1992: <1.75 bits/char (used 5 x 108 tokens)

Probabilistic context algorithms

• Markov Chain• Hidden Markov Methods (HMM)• Markov Random Fields (MRF)• Bayesian Networks

• Cryptanalysis

3/2/2012 How? gn 24

Raviv 1967No context: P(C|x) ~ P(x|C)

No features: P(C|x) ~ P(C)

0th order Markov: P(C|x) ~ P(x|C) P(C) (prior)

1st order: P(C|x1,...xn) ~ P(xn|C) P(C|x1,...xn-1)

(This is an iterative calculation)

Raviv estimated letter bigram and trigram probabilities from 8 million characters.

3/2/2012 How? gn 25

Error–Reject curves show improvement in classification accuracy

3/2/2012

How? gn 26

Estimating rare n-grams (like ...rta...)(smoothing, regularization)

The quick brown fox jumped over the tired dog

P(e) = 5/39 = 13% (12.7 %)

P(u) = 2/39 = 5% ( 2.8 %)

P(a) = 0/39 = 0% ( 8.2 %)

Laplace’s Law of Succession for

k instances from N items of m classes:

P^(x) = (k+1) / (N+m) P^(a) = (0+1)/(39+26) = 1.5 %

Next approximation: (k+a) / (N+ma), which also sums to unity

3/2/2012 How? gn 27

HIDDEN MARKOV MODELS: A or B? (3 states, 2 features)

1 2 3 1 2 3

MODEL A MODEL B

TRAINING: joint probs via Baum-Welch Forward-Backward (EM)

(0,1) (0,0) (0,1) (1,1) (0,1) (0,0) (0,1) (1,1)

(0.2, 0.3) (0.3, 0.6) (0.7, 0.8) (0.3, 0.1) (0.5, 0.4) (0.4, 0.8)

statestime

1

2

3

time

3/2/2012 How? gn 28

HMM situated between Markov chains & Bayesian networks

Find joint posterior probability of sequence of feature vectors

States are invisible; only, features can be observed

Transitions may be unidirectional or not, skip or not.

States may represent partial characters, characters, words, or separators and merge-indicators

Observable features are window-based vectors

HMMs can be nested (e.g. character HMM in word HMM)

Training: estimate state-conditional feature distributions and state transition probabilitiescomplex Baum-Welch Forwards-Backwards EM algorithm

3/2/2012 How? gn 29

Widely used for script

E.g.:

Using a Hidden-Markov Model in Semi- Automatic Indexing of Historical Handwritten Records

Thomas Packer, Oliver Nina, Ilya Raykhel

Computer Science, Brigham Young University FHTW 2009

3/2/2012 How? gn 30

Markov Random Fields ( ~ 2000 for OCR)

• Maximize an energy function

• Models more complex relationships within cliques of features

• Applied so far mainly to on- and off-line Chinese and Japanese handwriting

• More widely used in computer vision

3/2/2012 How? gn 31

Bayesian Networks (J. Pearl, > 1980)

• Directed acyclic graphical model• Models cliques of dependencies between variables• Nodes are variables, edges represent dependencies• Often designed by intuition instead of structure learning• Learning and inference algorithms (message passing)

• Applied to document classification

3/2/2012 How? gn 32

How? gn 33

Inter-pattern Class Dependence(Linguistic Context)

3/2/2012

G E O N G EG E O R G E

Classes

Features

How? gn 34

Inter-pattern Class-Feature Dependence

3/2/2012

How? gn 35

Inter-pattern Feature Dependence(Order-dependent: Ligatures, Co-articulation)

3/2/2012

36

Inter-pattern Feature Dependence(order-independent: Style)

3/2/2012How? gn

The shape of the ‘w’ depends on the shape of the ‘v’

How? gn3/2/2012 37

OCR via decoding a substitution cipher

Cipher text: 1 2 . 2 . 2 . .2 5 2 . . 5 2 . 5

an unknown sentence

Cluster thebitmaps:

1 2 5 LANGUAGE MODEL:

N-gram frequencies,Lexicon,…

[Nagy & Casey, 1966; Nagy & Seth, 1987 Ho &Nagy, 2000

Thanks to J.J. Hull]

DECODER

1 a 2 n 5 e

3/2/2012 How? gn 38

An unusual typeface

[Ho, Nagy 2000]

3/2/2012 How? gn 39

Text printed with Spitz glyphs

3/2/2012 How? gn 40

Decoded text

chapter I 2 LOOMINGSCall me Ishmael. Some years ago – never mind how long precisely – having little or no money inmy purse, and nothing particular to interest me onshore, I thought I would sail about a little and see

the watery part of the world. It is a way I have ...

chapter i _ bee_inds

_all me ishmaels some years ago__never mind how long precisely __having little or no money in my purses and nothing particular to interest me on shores i thought i would sail about a little and see the watery part of the worlds it is a way i have ...

GT

How? gn 41

STYLE-CONSTRAINED CLASSIFICATION

(AKA style-conscious or style-consistent classification)

3/2/2012

How? gn 42

Inter-pattern Feature Dependence(Style)

3/2/2012

How? gn 43

Style-constrained field classification

3/2/2012

How? gn 44

Intra-class and inter-class style(aka weak and strong style)

3/2/2012

With thanks to Prof. C-L Liu

INTRA-CLASS INTER-CLASS

Adaptation for intra-class style, field classification for inter-class style

How? gn 45

Adaptation and Style

3/2/2012

(2) adaptable (long test fields)

trainingtest

(4) continuous styles(short test fields)

(3) discrete styles

(1) representative (5) weakly constrained

Adaptive algorithms

cf.:

stochastic approximation (Robbins-Munro algorithm)self-training, self-adaptation, self-correction, unsupervised / semi-supervised learning,transfer learning, inductive transfer,co-training, decision-directed learning, ...

3/2/2012 How? gn 46

How? gn 47

Traditional open-loop OCR System

3/2/2012

training set

parameter estimation

operationaldata(bitmaps)

classifier parameters

meta-parameters (e.g. regularization, estimators)

correction,reject entry

transcript

patterns and labels

patterns

labels

rejects

CLASSIFIER

How? gn 48

Supervised learning

3/2/2012

training set

parameter estimation

operationaldata(features)

classifier parameters CLASSIFIER

meta-parameters

correction,reject entry

transcript

keyboarded labels of rejects and errors

Generic OCR System that makes use ofpost-processed rejects and errors

How? gn 49

Adaptation (Jain, PAMI 00: “Decision directed classifier”)

3/2/2012

training set

parameter estimation

operationaldata(features)

classifier parameters CLASSIFIER

meta-parameters

correction,reject entry

transcript

classifier assigned labels

Field estimation, singlet classification

How? gn 50

Self-corrective recognition (@IBM, IEEE IT 1966) (hardwired features and reference comparisons

3/2/2012

accepted

rejectedREFERENCEGENERATOR

CATEGORIZER

FEATUREEXTRACTOR

SCANNER

INTIAL REFERENCES NEW REFERENCES

SOURCEDOCUMENT

How? gn 51

Results: self-corrective recognition (Shelton & Nagy 1966)

3/2/2012

Training set: 9 fonts, 500 characters/font, U/C

Test set: 12 fonts, 1500 characters/font, U/C

96 n-tuple features, ternary reference vectors

Error Reject

Initial: 3.5% 15.2%

After self correction: 0.7% 3.7%

22,500 patterns

How? gn 52

Self-corrective classification

3/2/2012

adapted to a single font

z

1

1

1

7

7

7

7

1Omnifont classifier

original boundary

How? gn 53

Adaptive classification

3/2/2012

54

Baird skeptical! Results (DR&R 1994)

3/2/2012 How? gn

100 fonts, 80 symbols each from Baird’s defect model (6,400,000 characters) 

 

 

Size (pt) Error reduction

% fonts improved Best Worst

6 x 1.4 100 x 4 x 1.0

10 x 2.5 93 x 11 x 0.8

12 x 4.4 98 x 34 x 0.9

16 x 7.2 98 x 141 x0.8 

His conclusion: good investment: large potential for gain, low downside risk

55

Results: adapting both means and variances(Harsha Veeramachaneni 2003 IJDAR 2004)

3/2/2012How? gn

 

 

 

NIST Hand-printed digit classes, with 50 “Hitachi features”

Train Test % Error Before

Adapt means

Adaptvariance

SD3 SD3 1.1 0.7 0.6

  SD7 5.0 2.6 2.2

SD7 SD3 1.7 0.9 0.8

  SD7 2.4 1.6 1.7

SD3+SD7 SD3 0.9 0.6 0.6

  SD7 3.2 1.9 1.8

Examples from NIST dataset

3/2/2012 How? gn 56

Writer Adaptation by Style Transferfor many-class problems (Chinese script)

• Patterns of single-writer field transformed to style-free space using style transfer matrix (STM)

• Supervised adaptation: learning STM from labeled samples• Unsupervised adaptation: learning STM from test samples

Zhang & Liu, Style transfer matrix learning for writer adaptation, CVPR, 2011

• STM Formulation– Source point set– Target point set– Objective:

– Solution:

• Application to Writer Adaptation– Source point set: writer-specific data– Target point set: parameters in basic classifier

How? gn 59

Field classifiers and Adaptive classifiers

• A field classifier classifies consecutive finite-length fields of the test set. – Subsequent patterns do not benefit from knowledge

gained in earlier patterns. Exploits inter-class style.

• An adaptive classifier is a field classifier with a field that encompasses an entire (isogenous) test set.– The last pattern benefits from information from the first

pattern. The first pattern benefits from information from the last pattern. Exploits only intra-class style.

3/2/2012

60

Field Classifier for Discrete Styles (Prateek Sarkar ICPR 2000, PAMI `05)

Optimal for multimodal feature distributions from weighted Gaussians

m-dimensional feature vector, n-pattern field, s styles

field-class c* = (c1, c2, ......, cn)

= f (style means and covariances)

Parameters estimated via Expectation MaximizationFor a field of n characters from m classes there are mn ordered and (m+n-1)!/n!(m-1)! unordered field classes

3/2/2012 How? gn

Example: 2 classes, 2 styles, field-length = 2

3/2/2012 How? gn61

Singlet and Style-constrained field classification boundaries (2 classes, 2 styles, 1 Gaussian feature per pattern)

3/2/2012 How? gn 62First Pattern

Se

con

d P

atte

rn

How? gn 63

Top-style: a computable approximation(with style-unsupervised estimation via EM)

3/2/2012

Style-conscious Top-style Singlet classifier

• Experiments– Digits of six fonts– Moment features

– Training fields of L=13, 14,430 patterns– Test fields L=2,4

Field Classifier for Continuous Styles (H. Veeramachaneni IJDAR `04, PAMI `05, `07)

3/2/2012 How? gn 65

With Gaussian features, feature distribution for any field lengthcan be computed from class means and class-pair-conditional feature cross-covariance matrices

66

Style-conscious quadratic discriminant field classifier

• Class-style means are normally distributed about the class means• k is the singlet class-conditional feature covariance matrix

• Cij is the class-pair-conditional cross-covariance matrixestimated from pairs of same-source singlet class-means

• SQDF approximates the optimal discrete style field classifier well when inter-class distance >> style variation

• Inter-class style is order-independent:P(5 7 9 | [ x1 x2 x3]) = P( 7 5 9 | [ x2 x1 x3])

+ SQDF avoids Expectation Maximization in discrete style method- Supralinear in the number of features and classes because

the size of the N-pattern field covariance matrix is (Nxd)2 and for M classes there are (M+N-1)!/N!(M-1)! matrices

3/2/2012 How? gn

How? gn 67

Example of continuous-style feature distributionsTwo classes, one feature

3/2/2012

Results: style-constrained classification - short fields

3/2/2012 How? gn 68

Field error rate (%)Field length: L=2 L=5

Test data w/o style with style w/o style with style

SD3 1.4 1.3 3.0 2.5SD7 2.7 2.4 5.3 4.5

Continuous style-constrained classifier, trained on ~ 17,000 characters and tested on ~17,000 characters. 25 top principal component “Hitachi” blurred directional features.

Field-trained (i.e. word) classification vs. style-constrained classification

3/2/2012 How? gn 69

Training set for field classification

000000010010

99989999

(104 classes)

Training set forstyle classification

000102

9899

(102 classes with order)

Field Length = 4

classifier parameters for longer field length computed from pair parameters (because Gaussian variables defined completely by covariance)

How? gn 70

Style context versus Linguistic context

3/2/2012

Two digits in an isogenous field: ...... 5 6 .....

with feature vectors: x, y

and class labels: 5, 6

Style:

Language context: P(x y | 5, 6) P( y x | 6, 5 ) )

Intra-class style: P(x y | 5, 5) P(x | 5) P(y | 5)

Inter-class style: P(x y | 5, 6) = P( y x|6, 5 ) P(x|5) P(y|6)

How? gn 71

Weakly-constrained data

3/2/2012

training test3 classes, 4 multi-class styles

given p(x), find p(y), where y=g(x)

72

Recommendations for OCR systems that improve with use

3/2/2012 How? gn

Never let the machine rest: design it so that it puts every coffee-break to good use.

Don’t throw away edits (corrected labels): use them.

Classify style-consistent fields, not characters:adapt on long fields, exploit inter-class style in short fields.

Use order rather than position.

Let the machine guess: lazy decisions.

Make use of all possible contexts: style, language, shape, layout, structure, and function.

Please help to increase computer literacy!

73

Thank you!

3/2/2012 How? gn

http://www.ecse.rpi.edu/~nagy/

Prateek Sarkar George Nagy [email protected] [email protected]

Rensselaer Polytechnic Institute, U.S.A.

Prateek Sarkar George Nagy [email protected] [email protected]

Rensselaer Polytechnic Institute, U.S.A.

Classification of style-constrained pattern-fieldsClassification of style-constrained pattern-fields

Style is a manner of rendering patterns. Patterns are rendered in many different styles.

Style consistency constraint: Patterns in a field are rendered in the same style.

A field is a group of patterns with a common origin (isogenous patterns).a a a a a a a a a

x1 7

1

x2

7171

7777

1717

77/1177/11

7171

1111

1717Modeling style consistency can help improve

classification accuracy.

1171

77 17

1171

77 17

d

d

% field errordStyle Singlet

0 74.7 74.71 57.2 60.32 38.5 45.13 22.3 28.74 10.5 14.85 3.9 6.06 1.1 1.9

p ( x1 x2 | 17 ) = p ( x1 | 1 ) × p ( x2 | 7 )

= [ .5 p ( x1 | ) + .5 p ( x1 | ) ] × [ .5 p ( x2 | ) + .5 p ( x2 | ) ]= .25 [ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | )

+ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) ]= .5 [ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) ]

p ( x1 x2 | 17 ) = .5 p ( x1 x2 | ) + .5 p ( x1 x2 | ) all (c1c2 )

(c1*c2

*) = arg max p(x1x2

| c1c2 ).P[c1c2]

Classifier Style consistency modelSinglet Model

Intr

oduc

tion

Intr

oduc

tion

Obj

ect

ive

Obj

ect

ive

Mod

el

Mod

el

Res

ults

R

esul

ts

Unsupervised style estimation: Patterns in a field, and class labels are observed,but style of the field is unobserved. Apply the EM algorithm for estimating model parameters.

Simulation of two-class, two-style problem with unit variance Gaussian distributions.

Application to recognition of digit fields:15-25% reduction in errors (relative) was observed in laboratory experiments on handprinted digit recognition.

Improvement in accuracy was more for longer fields.

Reference: Prateek Sarkar. Style consistency in pattern fields. PhD thesis, Rensselaer Polytechnic Institute, U.S.A., 2000.

Prateek Sarkar, August 2000 for ICPR, Barcelona, September 2000

3/2/2012 How? gn 74