Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12....

31
Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta Andrea Vedaldi Andrew Zisserman Visual Geometry Group (VGG) University of Oxford ICVGIP 2018, Hyderabad

Transcript of Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12....

Page 1: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Learning to Read by SpellingTowards Unsupervised Text Recognition

Ankush Gupta Andrea Vedaldi Andrew Zisserman

Visual Geometry Group (VGG)

University of Oxford

ICVGIP 2018, Hyderabad

Page 2: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Recognition

tion of regular fits of the gout, one or more joints

Imaged Text ASCII Text

Assumes localisation is given

• Word / line level bounding boxes

Page 3: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Let’s solve this

Page 4: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Recognition: Sequence Learning 101

ConvNet

a..no..

<stop>

Sequence Model (e.g. RNNs)

Paired Image / Annotations

tion of regular fits of the gout one or more joints

part of the brain the tunica arachnoides was

rated and of a pale yellow colour and with the

Lots of paired data

Page 5: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Recognition: Paired Data?

Manual labour

• Expensive

• Boring..

Synthetic Data

• New engine for each domain

• Complex pipelines

• Domain gapJaderberg et al., NIPS DLW 2014

Gupta et al., CVPR16, BMVC18

Page 6: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Can we do without paired data?

Page 7: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

We leverage this structure for supervision.

Language is highly structured

• There are ~8 billion random strings of length 7 (26 characters)

but only 15K are valid English strings

• The frequency of characters and words, and their co-occurrence

(n-grams etc.) further constrain the output

Page 8: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Method

Page 9: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Recognition

2 sub-problems

1. Segment text image into characters, and cluster to consistent class

2. Assign each cluster to correct “character” label

à Solve for a |A | x |A | permutation matrix

where, A is the alphabet, e.g.: 26 English letters {a,b,c,…,z}

visual

language

Page 10: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Unpaired Text RecognitionLearn from unaligned text corpora, and text-images

Fully-Conv Recognition Net

images with <= L characters

|A |“fake” text

Valid Language

Strings

e.g. from: WMT,

NewsGroup etc.

Discriminator

“real” text

Adversarial Loss

L

28:

English letters {a,b,c,…,z}

+ space

+ pad

Softmax

each position

one-hot

visual language

Page 11: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Unpaired Text Recognition

Page 12: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

The “recognizer” can generate valid text without “recognizing”

Pitfall: Uncorrelated Predictions

à Fool the discriminator without solving the task

e.g. use “text-image” as noise à learn generator for valid English strings

Fully-Conv Recognition Net

“ this is a valid English string which looks real to the discriminator ”invalid

recognition

Page 13: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Uncorrelated Predictions: Solution

Local recognizer

restricted

receptive field

(~ 3 characters)

. . . . . tion of regular fits of the gout one or more joints

Global discriminator

checks validity of

the entire sentence Discriminator

No Reconstruction unlike CycleGAN

Because text à image is highly ambiguous

Page 14: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Experiments

Page 15: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Synthetic Text Images

• Fixed-width font

• WMT Newscrawl text source (EMNLP datasets)

• Control over nuisance factors à used for analysis

• 100K training sequences, 1K test sequences

Page 16: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Images

• Google Books scans

• Non-fixed width font

• Varying word spacing due to alignment

• “See-through” from back

• Different case (small / capital)

• Italics / noise / fading etc.

• 3K training lines, 300 test lines

(no overlapping pages)

• Use Google OCR output as

unaligned “text source”

Page 17: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Training StrategySample images and valid strings independently

Fully-Conv Recognition Net Valid Language

Strings

Discriminator

Adversarial Loss

softmax

predictionsone-hot

strings

Page 18: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Results

Page 19: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Synthetic Text Images

• ~99% character accuracy

• ~95% word accuracy

• Trained on 24-length sequences à test on 3, 5, 7, 9, 11, 24, 32, 48

à generalization to other lengths

test accuracy

Page 20: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Images

Why?

Varying spacing / non-fixed width font challenging for

fully-conv. recognizer

45% word accuracy!

Page 21: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Images

Why?

Varying spacing / non-fixed width challenging for fully-conv. recognizer

à Let features travel using a “skip-RNN” in the last layer

Conv

RNN

Skip

Page 22: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Images

85% character accuracy (now vs. 45% before)!

96.2% character accuracy.

Page 23: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Example

the different forms in which this disease ap

pears have rendered it necessary to divide it

into regular and aregular gout in the former

the attacks of which are known by the denomina

tion of regular fits of the got onne o more joints

of the extremities become inflamed painful and

tender and frequently in an exquisite deqree a

symptoneatic fever proportioned to the degree of

pain and inflammation with evening exacerba

tions accompand the other complaints which dis

tress the patient for uncertain perions sometimes

for several weeks wo he the fit goes off the piont

which have been the seat of the disease are always

found to have become rigid and inflexible in pro

portion to the degree in which the disease has

existed in themuffeequently remaining enlarged

and incapable of free motion for a considerable

timer cen he other hand the patient at the sas

time experiences so perfect an exemption from

disease as generally to lead to the opinion that

the fit has occasioned the most salutary changes

in the sten in ■ne■ e erd y i er

in the irregular gout the affection of the joints

is much less confined than in the former some e

times it leaves the joints at fort tttached and

fixes on some distant part■ and sometimes after

harassing the patient by making a circuit in■

Ground Truth Prediction

the different forms in which this disease ap

pears have rendered it necessary to divide it

into regular and irregular gout in the former

the attacks of which are known by the denomina

tion of regular fits of the gout one or more joint

of the extremities become inflamed painful and

tender and frequently in an exquisite degree a

symptomatic fever proportioned to the degree of

pain and inflammation with evening exacerba

tions accompany the other complaints which dis

tress the patient for uncertain periods sometimes

for several weeks when the fit goes off the joints

which have been the seat of the disease are always

found to have become rigid and inflexible in pro

portion to the degree in which the disease has

existed in them■ frequently remaining enlarged

and incapable of free motion for a considerable

time on the other hand the patient at the same

time experiences so perfect an exemption from

disease ag generallyto leadu to the opinion that

the fit has occasioned the most salutary changes

in the system ■ ■■■ ■ ■i■■

in the irregular gout the affection of the joints

is much less confined than in the former yiomeu

timcs it leaves the joints at first attacked and

fixes on some distant part■ and sometimes after

harassing the patient by making a circuit in■

Image

Page 24: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Analysis

Page 25: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Effect of Sequence Length on Training Convergence

Training with sequences of

different lengths:

• Short lengths 3-5:

no convergence

• Longer sequence à

faster convergence

13 > 11 > 9 > 7

Page 26: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Which symbol is learnt first?

learning order

1

2

3

4

train

ing r

un

s i e u a o l t d h g k m y c p r n f b x w v z j q

g e i s n l a u r d o h t k y m c v w x p b j f q z

g i e o a s u p t d l c n r w h k y v m b z q f j x

a e t s d i u l o h r n y m v w b j x f z c g p k q

low high

• Symbols are learnt roughly in the order of their frequency.

(Spearman’s rank correlation coefficient ρ = 0.80, p-value < 1e−5).

Page 27: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Corpus Ablation

• Completely unrelated lexicon (#3): small adverse effect

• Related lexicon (#2): no such effect

88

90

92

94

96

98

100

WMT WMT (No overlap) War & Peacechar word

Page 28: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Extensions

• Not text-image specific

à apply to any input domain, as long as output is still language

• Examples:

• Speech

• Sign language / gestures

• Lip reading

Page 29: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Learning to Read by SpellingTowards Unsupervised Text Recognition

Ankush Gupta Andrea Vedaldi Andrew Zisserman

Any Questions?

Visual Geometry Group (VGG)

University of Oxford

ICVGIP 2018, Hyderabad

Page 30: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Which symbol is learnt first?

0 2000 4000 6000 8000 10000

iterations

0.0

0.2

0.4

0.6

0.8

1.0

chara

cte

r accura

cy

a

et

s

d

iulohr n ym v w b

jx fz

cg p

k

q

Page 31: Learning to Read by Spelling - University of Oxfordvgg/publications/2018/Gupta... · 2018. 12. 28. · Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Learning Dynamics

tttttttttttttttttttttttttttttttttttttttttttrssssss

ttttttt ny nytt nr nrttttttt ny ntt nrttttttt zzzz

bcorote to whol th ticunthss tidio tiostolonzzzz

trougfht to ferr oy disectins it has dicomered

brought to view by dissection it was discovered

Tra

inin

g

ite

rati

on

s

1. First {space} / segmentation into words is learnt

2. Next, common words like {to, it}

3. Last, less frequent characters like {v, w}.

The last transcription also corresponds to the ground-truth (punctuations are not modelled). The colour bar on the right indicates the accuracy (darker means higher accuracy).