Trivandrum

55
Unlocking the Handwritten Content in Document Images Venu Govindaraju [email protected]

description

Venu-talk at TCS

Transcript of Trivandrum

Page 1: Trivandrum

Unlocking the Handwritten Content in

Document Images

Venu Govindaraju

[email protected]

Page 2: Trivandrum

Scanner

Storage

OCR

Noisy TextNewton Kinematics Notes

Query

FormsLetters Notes

Handwritten Documents Relevance

Page 3: Trivandrum

Outline

Recognition Postal Applications Paradigms Fusion

Search IR Models Word Spotting

Page 4: Trivandrum

Challenge of Handwriting

Page 5: Trivandrum

Input

Output20187

+2246Handwriting Recognition

Page 6: Trivandrum

Postal Context (138 mil records) ZIP Code 30% of ZIP Codes

contain a single street name

5% of ZIP Codes contain a single primary number

2% of ZIP Codes contain a single add-on

<ZIP Code, primary number>

Maximum number of records returned is 3,071

<ZIP Code, add-on> Maximum number of

records returned is 3,070

Lex Top 1 Top 2

10 96.5 98.7

100 89.2 94.1

1000 75.3 86.3

LDR

Page 7: Trivandrum

Paradigms

Context Ranked Lexicon

Lexicon Driven OCR

LDR

Lexicon Free OCR

LFR

Segmentation Recognition Post-processing

Page 8: Trivandrum

Lexicon Free (LFR)4

5

67 82 3

1

1 32 4 5 6 7 8i[.8], l[.8] u[.5], v[.2]

w[.6], m[.3]

w[.7]

i[.7]u[.3]

m[.2]m[.1]

r[.4]

d[.8]o[.5]

-Image from 1 to 3 is a in with 0.5 confidence-Image from segment 1 to 4 is a ‘w’ with 0.7 confidence-Image from segment 1 to 5 is a ‘w’ with 0.6 confidence and an ‘m’ with 0.3 confidence

Find the best path in graph from segment 1 to 8

Page 9: Trivandrum

Lexicon Driven (LDR)

1 2 3 4 5 6 7 8 9

w[7.6]

w[7.2]r[3.8]

w[5.0]

w[8.6]

o[7.6]r[6.3]

d[4.9]

w[5.0]

o[6.6]

o[6.0]

o[7.2]o[10.6] d[6.5]

d[4.4]

r[7.5]r[6.4]

o[7.8]r[8.6]

o[8.7]r[7.4]

r[7.6]

o[8.3]

o[7.7]r[5.8]

1 2 3 4 5 6 7 8 9

o[6.1]

Find the best way of accounting for characters ‘w’, ‘o’, ‘r’, ‘d’ buy consuming all segments 1 to 8

Distance between lexicon entry ‘word’ first character ‘w’ and the image between:- segments 1 and 4 is 5.0- segments 1 and 3 is 7.2- segments 1 and 2 is 7.6

Page 10: Trivandrum

Grapheme Models (LFR)

grapheme pos orientation angle

Down cusp 3.0 -90o

Up loop

Down arc

Writer Specific Modeling

Holistic Features

Page 11: Trivandrum

a) Amherst b) Buffalo c) Boston d) None of the above

ABLE TRIPTRAP

A TN

Words

Letters

Features

Interactive Models (LDR)

1-way activation[McClelland and Rumelhart 1981]

2-way interaction

Page 12: Trivandrum

Interactive Models (LDR)Phrase Level

T-crossings, loops, ascenders, descenders, length

West Central StreetWest Main StreetSunset Avenue

West Central StreetEast Central StreetSunset Avenue

West Central StreetWest Central AvenueSunset Avenue

Lexicon 1 Lexicon 2 Lexicon 3

Interactive Model

features

image

2-way interaction

Page 13: Trivandrum

Interactive ModelsCharacter Recognition

Adaptive feature selection

Adaptive number of features

Adaptive resolutions

Gradient (4) and Moment (5) Features

0 1 0 1 1 1 0 0 1

[Park and Govindaraju, IEEE CVPR 2000]

Page 14: Trivandrum

Active Recognition

Page 15: Trivandrum

ResultsActiveModel

Neural Net

KNN

Top 1% 95.7 % 96.4% 95.7%

Temp 612 976 3,777

Msec 1.45 11.5 384

Training hrs

1 24 1

10 class digit recognition

25656 training and 12242 test

(Postal +NIST)

Lex size LDR % GM %

10 96.86 96.56

100 91.36 89.12

1000 79.58 75.38

(Top 50) 98.00 98.40

20000 62.43 58.14

(Top 100) 93.59 93.39

Page 16: Trivandrum

Fusion

Identification Task

Verification Task

LDR

LFR

Page 17: Trivandrum

Question: if we find optimal and , is it necessarily ? Nf 1f 1ffN

Fusion of RecognizersType III

),( 21

11 ssfN

LDR

5.6

7.4

LFR

.52

.81

Identification task:

Amherst

Buffalo

Verification task:

5.6 .52Amherst

),( 22

12 ssfN

),( 211 ssf

1S

2S Ni ,...,1maxarg

SAccept

Reject

Page 18: Trivandrum

• Sum rule

• Weighted sum rule

• Product rule

• Max rule

• Rank-based methods

Traditional Fusion Rules2121

1 ),( ssssf

22

11

211 ),( swswssf

21211 ),( ssssf

),max(),( 21211 ssssf

}),,{,( 111

111Niii sssrankrs

21211 ),( iiii rrssf

)|,(),( 21211 genrrPssf iiii

Page 19: Trivandrum

Likelihood RatioVerification Tasks

Impostor

Genuine

Rec

ogni

zer

sco

re 2

Recognizer score 1

• 2 classes: imposter and genuine• Pattern classification task

),(

),(),(

21

2121

ssp

sspssf

imp

genlr

Minimum risk criteria: optimal decision boundaries coincide with the contours of likelihood ratio function:

Metaclassification with NN, SVM, etc. also possible

lrV ff

Vf

[Prabhakar, Jain 02] [Nandkumar, Jain, Das 08]

Page 20: Trivandrum

Optimal Combination functions

LFR is correct 54.8%

LDR is correct 77.2%

Both are correct 48.9%

Either is correct 83.0%

Likelihood Ratio 69.8%

Weighted Sum 81.6%

• LR combination is worse than single matcher

Vf

LRV ff

Identification Task Results

Top choice correct rate

Verification Task Results

ROC

Page 21: Trivandrum

)},,,{,,,,( 2121ik

Mkkk

Miiii ssssssfS

Independence of ScoresIn a single trial

),( 21

11 ssf

Amherst

5.6

7.4

Buffalo

.52

.81

LDR

LFR

),( 22

12 ssf

…. ….

Page 22: Trivandrum

)},,,{,,,,( 2121ik

Mkkk

Miiii ssssssfS Lexicon1 Lexicon i

LexiconN

Independence of ScoresIn a single trial

Recognizer 1

Recognizer M

Dependent

Dependent

Tulyakov & Govindaraju, TIFS 2009

Independent?

Page 23: Trivandrum

Optimal Combination ?:lrN ff Set size

LFR LDR Both correct

Eithercorrect

LR Weighted sum

54.8% 77.2% 48.9% 83.0% 69.8% 81.6%

6147 3366 4744 3005 5105 4293 5015

2nd choice

3rd choice

4th choice

Mean

LFR .4359 .4755 .4771 .1145

LDR .7885 .7825 .7673 .5685

Correlated Scores

Dependent on input signal

Page 24: Trivandrum

Optimal Trainable Combination Function

Minimizing misclassification cost:

)|,,...,,()|,,...,,( 2121

11

2121

11 jNNiNN sssspssssp

Classify as rather thani j

Assume that scores assigned to different classes are independent:

),()...,()...,(

)|,()...|,()...|,()|,,...,,(21212

111

212121

11

2121

11

NNimpiigenimp

iNNiiiiiNN

sspsspssp

sspsspsspssssp

),()...,()...,(),()...,()...,( 212121

11

212121

11 NNimpjjgenimpNNimpiigenimp sspsspsspsspsspssp

),(

),(

),(

),(21

21

21

21

jjimp

jjgen

iiimp

iigen

ssp

ssp

ssp

ssp ),(maxarg 21

,...,1iilr

Nissf

Nf

Tulyakov & Govindaraju IJPRAI 2009

Page 25: Trivandrum

Combination Methods Identification Tasks

Rec

og

niz

er s

core

2

Recognizer score 1

ImpostorGenuine

Rec

og

niz

er s

core

2

Recognizer score 1

ImpostorGenuine

Rec

og

niz

er S

core

2

Recognizer score 1

No!

Traditional Training mixes the genuine and imposter scores from different trials.

Page 26: Trivandrum

BR

eco

gn

izer

sc

ore

2

Recognizer score 1

ImpostorGenuine

Rex

cog

niz

er s

core

2

Recognizer score 1

ImpostorGenuine

Rec

og

niz

er s

core

2

Biometric score 1

Model Training MUST process scores from one identification trial as a single training sample.

Combination Methods Identification Tasks

Page 27: Trivandrum

• Initialize a combination function

• Get scores from the same identification trial (for all trials)• Update function so Genuine score better than any impostor score

),,,(

),,,(()

21

21

Miiiimp

Miiigen

sssp

ssspf

),,,( 21 Msssf

0,1

1()

)( 12

21

1

jsss M

MMe

f

Best Impostor Function

Sum of Logistic Functions

Iterative Methods

Likelihood Ratio

Weighted sum

Best Impostor Likelihood Ratio

Logistic Sum

Neural Network

LFR & LDR 69.84 81.58 80.07 81.43 81.67

li & C 97.24 97.23 97.01 97.34 97.39

li & G 95.90 95.47 95.99 96.17 96.29

Page 28: Trivandrum

Outline

Recognition Postal Applications Paradigms Fusion

Search Lexicon Reduction Word Spotting IR Models

Page 29: Trivandrum

Search for Handwritten Documents

LexiconGood Quality10K 1K

Historical10K 1K

Medical4K

Top 1 (%) 57 67 12 28 20

Top 3 (%) 69 72 22 44 27

Top 10 (%) 74 75 32 72 42

• Lexicons are typically large: >5K• Need around 70% accuracy

Strategy• Reduce lexicon size using topic categorization (DAS 06;08)• Use Top-N choices returned by OCR (ICDAR 07)

Page 30: Trivandrum

•Pre Hospital Care ReportWNY: 250,000 filed a yearNYC: 50,000 filed in a dayPDAs not popular

•OHR issuesLoosely constrained writing styleLarge lexiconsHeterogeneous data

6,700 carbon forms stored at 300 DPI1000 PCR forms ground truthed

Search EngineHandwritten Forms

Page 31: Trivandrum

Search Engine for Medical Forms

•Find all people who reported asthma problems in NY•How many people with high blood pressure are on medication X?•Is there an epidemic breaking?

Page 32: Trivandrum

Topic Categorization Lexicon Reduction

Lex FreeLarge Lexicon> 5K

HandwrittenMedical

Documents

ICR Features

~33% wordRecognition rate(10 points gain)

Topic Categorization

Select Reduced Lexicon~2.5K

Lex Driven

Page 33: Trivandrum

ICR Features Index

Page 34: Trivandrum

cohesion(wa ,wb ) z f (wa ,wb )

f (wa )* f (wb ))

DIGESTIVE-SYSTEM FQ CHSN PHRASE30 0.72 PAIN INCIDENT5 0.31 PAIN TRANSPORTED42 0.54 PAIN CHEST52 0.81 STOMACH PAIN9 0.25 HOME PAIN6 0.43 VOMITING ILLNESS

Topic Features

Page 35: Trivandrum

(Chu-Carroll, et al., 1999)

Bt, c At, c

At, e2

e1

n

IDF( t) log2

n

c( t)

X t, c IDF(t)Bt, c

z cos(x,y) xyT

xi2 yi

2

i1

n

i1

n

Topic Categorization

35

Page 36: Trivandrum

Results

CLT to RLT CL to RL CLT to ALT CLT to SLT

HR 7.48% 7.42% 17.58% 7.42%

Error Rate 10.78% 10.88% 24.53% 10.21%

C: complete lexiconR: reduced lexiconA: category givenS: features syntheticT: truth present

Page 37: Trivandrum

Outline

Recognition Postal Applications Paradigms Fusion

Search Lexicon Reduction Word Spotting IR Models

Page 38: Trivandrum

Urgent Issue of our Times

Vast, irreplaceable, culturally vital legacy

collections of historical documents are competing

ineffectively for attention with billions of digital

documents

Thus historical archives are threatened with

neglect, perceived irrelevance, …. & eventually,

oblivion?

Threat: ‘If it’s not in Google, it doesn’t exist!’

Baird 2003

Page 39: Trivandrum

What is possible today?• View Document Images

Page 40: Trivandrum

Document Enhancement

[Shi, Setlur, and Govindaraju 2008]

Page 41: Trivandrum

Transcript-Mapping

1787 Thomas Jefferson letter and its transcript

Image

Transcript

+ +

Page 42: Trivandrum

What is not possible today?

Page 43: Trivandrum
Page 44: Trivandrum

Multilingual Document Corpus

Retrieved Documents

English

Hindi Sanskrit

Translations of “strength”

Crosslingual Retrieval

Page 45: Trivandrum

SEARCHHandwritten Documents

Image – Based

Use Image Based

Features

OCR - Based

Use OCR Recognition

Results

Query rendered

Page 46: Trivandrum

Poor performance in multiple writer scenarios

Image Based Methods

(Rath 07 IJDAR)

Page 47: Trivandrum

SEARCHHandwritten Documents

Image – Based

Use Image Based

Features-

OCR - Based

Use OCR recognition

results

Page 48: Trivandrum

Indexing Retrieval

Handwriting Recognition

Page 49: Trivandrum

Vector IR Model (TF-IDF)

Set of terms {ti};

Set of documents {dj} of length {Lj}

Term Frequency (TF)

Inverted Document Frequency-IDF

Query TF

Similarity

j

jiji L

freqtf ,

,

}0 |{#

}{#log

,

jij

ji freqd

didf

otherwise ,0

query in is if ,1,

qttf i

qi

qii

ijij tfidftfqd ,,),(sim

jitf ,terms

back 0.024

.

.

.

0.008pain

}pain"" ,back"{"q

.

.

.

.

.

.

.

.

.

iidf

4.1

2.4

.

.

.

.

.

.

.

.

.

qitf ,

1

1

0

...

0

0

...

0

0

...

0

),sim( qd j

[Baeza-Yates99]

Page 50: Trivandrum

Modifications to VM

L

freqtf ji

ji,

,

}0|{#

}{#log

,

jij

ji freqd

didf

Classic VM: computes the tf and IDF from the OCR’ed text (top-1)

L

freqtf jiocr

ji

}{E ,,

5.0}{E|#

}{#log

,

jij

jocri freqd

didf

Modified VM: computes the tf and idf from the top-n choices of word recognition

Page 51: Trivandrum

Required Inputs

Word segmentation result

Word recognition likelihoods

Estimation

: word images]...[ 21 Lwwww

L

kkiji wtfreqE

1, )|Pr(}{

)|pain""Pr( kw 0.02 0.01 0.2 0.01 0.01

}{ ,pain"" jfreqE

…Doc dj

[Rath 04, Howe 05]

Page 52: Trivandrum

Estimating Term Frequency

wI

wiwji ItIfreq )|Pr(Pr}{E ,

wI

)Pr( wI

)head"Pr(" w|I

)arm"Pr(" w|I

)pelvis"Pr(" w|I

...

1 1 5.0 1 ...

...

...

...

......

2.0

05.0

01.0

7.0

07.0

01.0... ... ... ...

8.0

01.0

002.0 01.0

07.0

03.0

,...}pelvis"",arm"",head""{:}{ 210 tttti

...07.0101.05.0

7.0105.01

)|arm""Pr(Pr

}{E ,1

wI

ww

j

II

freqdj

Page 53: Trivandrum

Estimating Segmentation

Word Segmentation Gap between adjacent

connected components above a threshold D

Generate multiple hypotheses with multiple D

If hypothesis Iw overlaps

m other hypotheses, then

wIPr

1

1Pr

m

Iw

d > D

3 hypotheses

wIPr2

1

3

1

2

1

m 1 2 1

Page 54: Trivandrum

Top-Rank (Top-S candidates involved)

Weighted Top-Rank

Empirical

rate OCR )1(R- toprate OCR R- top)|Pr( wi It

otherwise ,0

)rank(1 if ,1

)|Pr(St

SIt iwi

))rank((R it

Word Recognition )|Pr( wi It

i

d

i

d

iwi

i

i

et

etIt

2

2

2

2

2

2

)Pr(

)Pr()|Pr(

Page 55: Trivandrum

Thank you!