Trivandrum

Unlocking the Handwritten Content in

Document Images

Venu Govindaraju

[email protected]

Scanner

Storage

OCR

Noisy TextNewton Kinematics Notes

Query

FormsLetters Notes

Handwritten Documents Relevance

Outline

Recognition Postal Applications Paradigms Fusion

Search IR Models Word Spotting

Challenge of Handwriting

Input

Output20187

+2246Handwriting Recognition

Postal Context (138 mil records) ZIP Code 30% of ZIP Codes

contain a single street name

5% of ZIP Codes contain a single primary number

2% of ZIP Codes contain a single add-on

<ZIP Code, primary number>

Maximum number of records returned is 3,071

<ZIP Code, add-on> Maximum number of

records returned is 3,070

Lex Top 1 Top 2

10 96.5 98.7

100 89.2 94.1

1000 75.3 86.3

LDR

Paradigms

Context Ranked Lexicon

Lexicon Driven OCR

LDR

Lexicon Free OCR

LFR

Segmentation Recognition Post-processing

Lexicon Free (LFR)4

5

67 82 3

1

1 32 4 5 6 7 8i[.8], l[.8] u[.5], v[.2]

w[.6], m[.3]

w[.7]

i[.7]u[.3]

m[.2]m[.1]

r[.4]

d[.8]o[.5]

-Image from 1 to 3 is a in with 0.5 confidence-Image from segment 1 to 4 is a ‘w’ with 0.7 confidence-Image from segment 1 to 5 is a ‘w’ with 0.6 confidence and an ‘m’ with 0.3 confidence

Find the best path in graph from segment 1 to 8

Lexicon Driven (LDR)

1 2 3 4 5 6 7 8 9

w[7.6]

w[7.2]r[3.8]

w[5.0]

w[8.6]

o[7.6]r[6.3]

d[4.9]

w[5.0]

o[6.6]

o[6.0]

o[7.2]o[10.6] d[6.5]

d[4.4]

r[7.5]r[6.4]

o[7.8]r[8.6]

o[8.7]r[7.4]

r[7.6]

o[8.3]

o[7.7]r[5.8]

1 2 3 4 5 6 7 8 9

o[6.1]

Find the best way of accounting for characters ‘w’, ‘o’, ‘r’, ‘d’ buy consuming all segments 1 to 8

Distance between lexicon entry ‘word’ first character ‘w’ and the image between:- segments 1 and 4 is 5.0- segments 1 and 3 is 7.2- segments 1 and 2 is 7.6

Grapheme Models (LFR)

grapheme pos orientation angle

Down cusp 3.0 -90o

Up loop

Down arc

Writer Specific Modeling

Holistic Features

a) Amherst b) Buffalo c) Boston d) None of the above

ABLE TRIPTRAP

A TN

Words

Letters

Features

Interactive Models (LDR)

1-way activation[McClelland and Rumelhart 1981]

2-way interaction

Interactive Models (LDR)Phrase Level

T-crossings, loops, ascenders, descenders, length

West Central StreetWest Main StreetSunset Avenue

West Central StreetEast Central StreetSunset Avenue

West Central StreetWest Central AvenueSunset Avenue

Lexicon 1 Lexicon 2 Lexicon 3

Interactive Model

features

image

2-way interaction

Interactive ModelsCharacter Recognition

Adaptive feature selection

Adaptive number of features

Adaptive resolutions

Gradient (4) and Moment (5) Features

0 1 0 1 1 1 0 0 1

[Park and Govindaraju, IEEE CVPR 2000]

Active Recognition

ResultsActiveModel

Neural Net

KNN

Top 1% 95.7 % 96.4% 95.7%

Temp 612 976 3,777

Msec 1.45 11.5 384

Training hrs

1 24 1

10 class digit recognition

25656 training and 12242 test

(Postal +NIST)

Lex size LDR % GM %

10 96.86 96.56

100 91.36 89.12

1000 79.58 75.38

(Top 50) 98.00 98.40

20000 62.43 58.14

(Top 100) 93.59 93.39

Fusion

Identification Task

Verification Task

LDR

LFR

Question: if we find optimal and , is it necessarily ? Nf 1f 1ffN

Fusion of RecognizersType III

),( 21

11 ssfN

LDR

5.6

7.4

…

LFR

.52

.81

…

Identification task:

Amherst

Buffalo

…

Verification task:

5.6 .52Amherst

),( 22

12 ssfN

),( 211 ssf

1S

2S Ni ,...,1maxarg

SAccept

Reject

• Sum rule

• Weighted sum rule

• Product rule

• Max rule

• Rank-based methods

Traditional Fusion Rules2121

1 ),( ssssf

22

11

211 ),( swswssf

21211 ),( ssssf

),max(),( 21211 ssssf

}),,{,( 111

111Niii sssrankrs

21211 ),( iiii rrssf

)|,(),( 21211 genrrPssf iiii

Likelihood RatioVerification Tasks

Impostor

Genuine

Rec

ogni

zer

sco

re 2

Recognizer score 1

• 2 classes: imposter and genuine• Pattern classification task

),(

),(),(

21

2121

ssp

sspssf

imp

genlr

Minimum risk criteria: optimal decision boundaries coincide with the contours of likelihood ratio function:

Metaclassification with NN, SVM, etc. also possible

lrV ff

Vf

[Prabhakar, Jain 02] [Nandkumar, Jain, Das 08]

Optimal Combination functions

LFR is correct 54.8%

LDR is correct 77.2%

Both are correct 48.9%

Either is correct 83.0%

Likelihood Ratio 69.8%

Weighted Sum 81.6%

• LR combination is worse than single matcher

Vf

LRV ff

Identification Task Results

Top choice correct rate

Verification Task Results

ROC

)},,,{,,,,( 2121ik

Mkkk

Miiii ssssssfS

Independence of ScoresIn a single trial

),( 21

11 ssf

Amherst

5.6

7.4

…

Buffalo

.52

.81

…

LDR

LFR

…

),( 22

12 ssf

…. ….

)},,,{,,,,( 2121ik

Mkkk

Miiii ssssssfS Lexicon1 Lexicon i

LexiconN

Independence of ScoresIn a single trial

Recognizer 1

Recognizer M

Dependent

Dependent

Tulyakov & Govindaraju, TIFS 2009

Independent?

Optimal Combination ?:lrN ff Set size

LFR LDR Both correct

Eithercorrect

LR Weighted sum

54.8% 77.2% 48.9% 83.0% 69.8% 81.6%

6147 3366 4744 3005 5105 4293 5015

2nd choice

3rd choice

4th choice

Mean

LFR .4359 .4755 .4771 .1145

LDR .7885 .7825 .7673 .5685

Correlated Scores

Dependent on input signal

Optimal Trainable Combination Function

Minimizing misclassification cost:

)|,,...,,()|,,...,,( 2121

11

2121

11 jNNiNN sssspssssp

Classify as rather thani j

Assume that scores assigned to different classes are independent:

),()...,()...,(

)|,()...|,()...|,()|,,...,,(21212

111

212121

11

2121

11

NNimpiigenimp

iNNiiiiiNN

sspsspssp

sspsspsspssssp

),()...,()...,(),()...,()...,( 212121

11

212121

11 NNimpjjgenimpNNimpiigenimp sspsspsspsspsspssp

),(

),(

),(

),(21

21

21

21

jjimp

jjgen

iiimp

iigen

ssp

ssp

ssp

ssp ),(maxarg 21

,...,1iilr

Nissf

Nf

Tulyakov & Govindaraju IJPRAI 2009

Combination Methods Identification Tasks

Rec

og

niz

er s

core

2

Recognizer score 1

ImpostorGenuine

Rec

og

niz

er s

core

2

Recognizer score 1

ImpostorGenuine

Rec

og

niz

er S

core

2

Recognizer score 1

No!

Traditional Training mixes the genuine and imposter scores from different trials.

BR

eco

gn

izer

sc

ore

2

Recognizer score 1

ImpostorGenuine

Rex

cog

niz

er s

core

2

Recognizer score 1

ImpostorGenuine

Rec

og

niz

er s

core

2

Biometric score 1

Model Training MUST process scores from one identification trial as a single training sample.

Combination Methods Identification Tasks

• Initialize a combination function

• Get scores from the same identification trial (for all trials)• Update function so Genuine score better than any impostor score

),,,(

),,,(()

21

21

Miiiimp

Miiigen

sssp

ssspf

),,,( 21 Msssf

0,1

1()

)( 12

21

1

jsss M

MMe

f

Best Impostor Function

Sum of Logistic Functions

Iterative Methods

Likelihood Ratio

Weighted sum

Best Impostor Likelihood Ratio

Logistic Sum

Neural Network

LFR & LDR 69.84 81.58 80.07 81.43 81.67

li & C 97.24 97.23 97.01 97.34 97.39

li & G 95.90 95.47 95.99 96.17 96.29

Outline


Search Lexicon Reduction Word Spotting IR Models

Search for Handwritten Documents

LexiconGood Quality10K 1K

Historical10K 1K

Medical4K

Top 1 (%) 57 67 12 28 20

Top 3 (%) 69 72 22 44 27

Top 10 (%) 74 75 32 72 42

• Lexicons are typically large: >5K• Need around 70% accuracy

Strategy• Reduce lexicon size using topic categorization (DAS 06;08)• Use Top-N choices returned by OCR (ICDAR 07)

•Pre Hospital Care ReportWNY: 250,000 filed a yearNYC: 50,000 filed in a dayPDAs not popular

•OHR issuesLoosely constrained writing styleLarge lexiconsHeterogeneous data

6,700 carbon forms stored at 300 DPI1000 PCR forms ground truthed

Search EngineHandwritten Forms

Search Engine for Medical Forms

•Find all people who reported asthma problems in NY•How many people with high blood pressure are on medication X?•Is there an epidemic breaking?

Topic Categorization Lexicon Reduction

Lex FreeLarge Lexicon> 5K

HandwrittenMedical

Documents

ICR Features

~33% wordRecognition rate(10 points gain)

Topic Categorization

Select Reduced Lexicon~2.5K

Lex Driven

ICR Features Index

cohesion(wa ,wb ) z f (wa ,wb )

f (wa )* f (wb ))

DIGESTIVE-SYSTEM FQ CHSN PHRASE30 0.72 PAIN INCIDENT5 0.31 PAIN TRANSPORTED42 0.54 PAIN CHEST52 0.81 STOMACH PAIN9 0.25 HOME PAIN6 0.43 VOMITING ILLNESS

Topic Features

(Chu-Carroll, et al., 1999)

Bt, c At, c

At, e2

e1

n

IDF( t) log2

n

c( t)

X t, c IDF(t)Bt, c

z cos(x,y) xyT

xi2 yi

2

i1

n

i1

n

Topic Categorization

35

Results

CLT to RLT CL to RL CLT to ALT CLT to SLT

HR 7.48% 7.42% 17.58% 7.42%

Error Rate 10.78% 10.88% 24.53% 10.21%

C: complete lexiconR: reduced lexiconA: category givenS: features syntheticT: truth present

Outline


Search Lexicon Reduction Word Spotting IR Models

Urgent Issue of our Times

Vast, irreplaceable, culturally vital legacy

collections of historical documents are competing

ineffectively for attention with billions of digital

documents

Thus historical archives are threatened with

neglect, perceived irrelevance, …. & eventually,

oblivion?

Threat: ‘If it’s not in Google, it doesn’t exist!’

Baird 2003

What is possible today?• View Document Images

Document Enhancement

[Shi, Setlur, and Govindaraju 2008]

Transcript-Mapping

1787 Thomas Jefferson letter and its transcript

Image

Transcript

+ +

What is not possible today?

Multilingual Document Corpus

Retrieved Documents

English

Hindi Sanskrit

Translations of “strength”

Crosslingual Retrieval

SEARCHHandwritten Documents

Image – Based

Use Image Based

Features

OCR - Based

Use OCR Recognition

Results

Query rendered

Poor performance in multiple writer scenarios

Image Based Methods

(Rath 07 IJDAR)

SEARCHHandwritten Documents

Image – Based

Use Image Based

Features-

OCR - Based

Use OCR recognition

results

Indexing Retrieval

Handwriting Recognition

Vector IR Model (TF-IDF)

Set of terms {ti};

Set of documents {dj} of length {Lj}

Term Frequency (TF)

Inverted Document Frequency-IDF

Query TF

Similarity

j

jiji L

freqtf ,

,

}0 |{#

}{#log

,

jij

ji freqd

didf

otherwise ,0

query in is if ,1,

qttf i

qi

qii

ijij tfidftfqd ,,),(sim

jitf ,terms

back 0.024

.

.

.

0.008pain

}pain"" ,back"{"q

.

.

.

.

.

.

.

.

.

iidf

4.1

2.4

.

.

.

.

.

.

.

.

.

qitf ,

1

1

0

...

0

0

...

0

0

...

0

),sim( qd j

[Baeza-Yates99]

Modifications to VM

L

freqtf ji

ji,

,

}0|{#

}{#log

,

jij

ji freqd

didf

Classic VM: computes the tf and IDF from the OCR’ed text (top-1)

L

freqtf jiocr

ji

}{E ,,

5.0}{E|#

}{#log

,

jij

jocri freqd

didf

Modified VM: computes the tf and idf from the top-n choices of word recognition

Required Inputs

Word segmentation result

Word recognition likelihoods

Estimation

: word images]...[ 21 Lwwww

L

kkiji wtfreqE

1, )|Pr(}{

)|pain""Pr( kw 0.02 0.01 0.2 0.01 0.01

}{ ,pain"" jfreqE

…Doc dj

[Rath 04, Howe 05]

Estimating Term Frequency

wI

wiwji ItIfreq )|Pr(Pr}{E ,

wI

)Pr( wI

)head"Pr(" w|I

)arm"Pr(" w|I

)pelvis"Pr(" w|I

...

1 1 5.0 1 ...

...

...

...

......

2.0

05.0

01.0

7.0

07.0

01.0... ... ... ...

8.0

01.0

002.0 01.0

07.0

03.0

,...}pelvis"",arm"",head""{:}{ 210 tttti

...07.0101.05.0

7.0105.01

)|arm""Pr(Pr

}{E ,1

wI

ww

j

II

freqdj

Estimating Segmentation

Word Segmentation Gap between adjacent

connected components above a threshold D

Generate multiple hypotheses with multiple D

If hypothesis Iw overlaps

m other hypotheses, then

wIPr

1

1Pr

m

Iw

d > D

3 hypotheses

wIPr2

1

3

1

2

1

m 1 2 1

Top-Rank (Top-S candidates involved)

Weighted Top-Rank

Empirical

rate OCR )1(R- toprate OCR R- top)|Pr( wi It

otherwise ,0

)rank(1 if ,1

)|Pr(St

SIt iwi

))rank((R it

Word Recognition )|Pr( wi It

i

d

i

d

iwi

i

i

et

etIt

2

2

2

2

2

2

)Pr(

)Pr()|Pr(

Thank you!

Trivandrum

Education

Transcript of Trivandrum