Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition...

69
Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals

description

04 Jul 2005Istvan Marosi OCR Internals Main tasks of an OCR system: Image acquisition Get image B/W Scanning Gray Scanning Color Scanning Load from image file Preprocess image Layout recognition Text recognition User assisted correction Result exportation

Transcript of Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition...

Page 1: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

Dr. István MarosiScansoft-Recognita, Inc., Hungary

SSIP 2005, Szeged

Character Recognition Internals

Page 2: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognitionText recognitionUser assisted correctionResult exportation

Page 3: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionGet image

B/W ScanningGray ScanningColor ScanningLoad from image file

Preprocess imageLayout recognitionText recognitionUser assisted correctionResult exportation

Page 4: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionGet imagePreprocess image

Color separationThresholdingDespecklingRotationDeskewing

Layout recognitionText recognitionUser assisted correctionResult exportation

Color SeparationDe-speckle, de-skew

Page 5: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

The Preprocessed ImageJoined chars

Page 6: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Joined charsThe Preprocessed Image

Page 7: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

The Preprocessed ImageJoined chars

Page 8: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

The Preprocessed ImageBroken chars

Page 9: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

The Preprocessed ImageBroken chars

Page 10: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

The Preprocessed ImageBroken chars

Page 11: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisition

Layout recognitionText zones

Columns of flowed textTablesInverse text

Graphic zonesText recognitionUser assisted correctionResult exportation

Page 12: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisition

Layout recognitionText zonesGraphic zones

Line ArtPhoto

Text recognitionUser assisted correctionResult exportation

Page 13: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognition

Text recognitionSegmentationCalculation of Feature Vector ElementsClassificationLanguage AnalysisVoting

User assisted correctionResult exportation

Page 14: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

SegmentationWhat are those pixel groups belonging to a single letter?

Page 15: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

SegmentationWhat are those pixel groups belonging to a single letter?

Page 16: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

SegmentationWhat are those pixel groups belonging to a single letter?

Page 17: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

SegmentationWhat are those pixel groups belonging to a single letter?

Page 18: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

SegmentationWhat are those pixel groups belonging to a single letter?

Page 19: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

SegmentationWhat are those pixel groups belonging to a single letter?

Page 20: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

SegmentationWhat are those pixel groups belonging to a single letter?

Page 21: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognition

Text recognitionSegmentationCalculation of Feature Vector ElementsClassificationLanguage AnalysisVoting

User assisted correctionResult exportation

Page 22: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Calculation of FV Elements: Contour Tracing

Find a (new) white-black transitionFollow the “edge” of the pixels using the MIN or MAX ruleAdministrate the already traced white-black transitionsCollect information while going aroundAnd repeat the process on new shapes ...

Page 23: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Contour TracingFind a (new) white-black transition

Follow the “edge” of the pixels using the MIN or MAX ruleAdministrate the already traced white-black transitionsCollect information while going aroundAnd repeat the process on new shapes ...

Page 24: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Contour TracingFind a (new) white-black transitionFollow the “edge” of the pixels using the MIN or MAX rule

if black(a) then turn(ccw)else if black(b) then forwardelse turn(cw)

a b

Page 25: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Contour TracingFind a (new) white-black transitionFollow the “edge” of the pixels using the MIN or MAX rule

if black(a) then turn(ccw)else if black(b) then forwardelse turn(cw)

a b

if white(b) then turn(cw)else if white(a) then forwardelse turn(ccw)

ab

Page 26: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Contour TracingFind a (new) white-black transitionFollow the “edge” of the pixels using the MIN or MAX ruleAdministrate the already traced white-black transitionsCollect information while going aroundAnd repeat the process on new shapes ...

Page 27: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Some Easily Calculatable Data

Problem #1

Turning CW: In=In-1+1Turning CCW: In=In-1-1 Going Forward: In=In-1

Page 28: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Some Easily Calculatable Data

Problem #2

Turning CW: In=In-1+1Turning CCW: In=In-1-1 Going Forward: In=In-1

Page 29: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Some Easily Calculatable Data

Problem #3

Going Up: In=In-1-Xn

Going Down: In=In-1+Xn

Going Right: In=In-1

Going Left: In=In-1

Page 30: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Some Easily Calculatable Data

Problem #4

Going Up: In=In-1-Xn

Going Down: In=In-1+Xn

Going Right: In=In-1

Going Left: In=In-1

Page 31: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognition

Text recognitionSegmentationCalculation of Feature Vector ElementsClassificationLanguage AnalysisVoting

User assisted correctionResult exportation

AB A B

Page 32: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

A

B AB

Classification; Training modelsRestricted Coulomb Energy (RCE) Network(Dr. Leon Cooper, Dr. Charles Elbaum)

Page 33: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsRestricted Coulomb Energy (RCE) Network(Dr. Leon Cooper, Dr. Charles Elbaum)

A

B AB

Page 34: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Page 35: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Default radius Rmax

Page 36: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Page 37: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Default radius Rmax

Page 38: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Page 39: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Page 40: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Default radius Rmax

Page 41: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Page 42: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Decreased radius

Page 43: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Page 44: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Decreased radius Rmin

Page 45: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Classification; Training modelsNestor Learning System (NLS)

Pass 2Decreased radius

Page 46: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognition

Text recognitionSegmentationCalculation of Feature Vector ElementsClassificationLanguage AnalysisVoting

User assisted correctionResult exportation

Page 47: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

VotingText recognition in OmniPage Pro

OCR Engines available:Caere’s engine (codename: Salt & Pepper)

Recognita’s engine (codename: Paprika)

ScanSoft’s engine (codename: Fireworx)

Page 48: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Text recognition in OmniPage ProOCR Engines available:

Caere’s engine (Salt & Pepper)

Uses a Matrix Matching based algorithmfeature set: 40 cells of an 8x5 gridgood overall description of a shapeweaker at detailed structure

Recognita’s engine (Paprika)

Uses a Contour Tracing based algorithmfeture set: convex and concave arcs on the contourgood detailed description of a shapeweaker at overall structure

Voting

Page 49: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Text recognition in OmniPage ProOCR Engines available:

Caere’s engine (Salt & Pepper)

Recognita’s engine (Paprika)

ScanSoft’s engine (Fireworx)

Segmentation algorithms:

Voting

Page 50: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Text recognition in OmniPage ProOCR Engines available:

Caere’s engine (Salt & Pepper)

Recognita’s engine (Paprika)

ScanSoft’s engine (Fireworx)

Segmentation algorithms:Developed by independent groupsHave different strengths and weaknesses

Voting

Page 51: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Text recognition in OmniPage ProOCR Engines availableSegmentation algorithms

Conclusion:They are complementaryLet’s create a voting system

Voting

Page 52: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Voting strategiesExternal „Black box”voting

~20% gain

Image

Paprika Salt &Pepper

Vote

Txt 3 Txt 1

Dict

Final Txt

Voting

Fire-worx

Txt 2

Page 53: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Voting strategiesExternal „Black box”voting

Internal „Shape”voting

Voting Image

Paprika

Fire-worx

BronzeTxt 3

Txt 2

Dict

Final Txt

Salt &Pepper

Txt 1

Page 54: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Paprika

Original segmentation:Every independent connected component is a

character

Good segmentation: recognizeBad segmentation: reject

Image

Recognize originalsegmentationK.B.

Voting

Page 55: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Paprika

Image

Recognize originalsegmentation

Txt 2Train adaptive classifier

from original shapes

K.B.

AdaptiveK.B.

VotingTxt 1

Page 56: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Paprika

Try several segmentationsLoop if unrecognizable

Image

Recognize originalsegmentation

Txt 2Train adaptive classifier

from original shapes

Recognize broken andjoined shapes

K.B.

AdaptiveK.B.

VotingTxt 1

Dict

Page 57: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Paprika

Image

Recognize originalsegmentation

Txt 2Train adaptive classifier

from original shapes

Recognize broken andjoined shapes

K.B.

AdaptiveK.B.

Train adaptive classifierfrom ‘ugly’ shapes

VotingTxt 1

Dict

Page 58: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Paprika

Image

Recognize originalsegmentation

Txt 3

Txt 2Train adaptive classifier

from original shapes

Recognize broken andjoined shapes

K.B.

AdaptiveK.B.

Train adaptive classifierfrom ‘ugly’ shapes

Recognize more brokenand joined shapes Try several segmentations

Loop if unrecognizable

VotingTxt 1

Dict

Page 59: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

Image

Paprika

Fire-worx

BronzeTxt 3

Txt 1

Dict

Final Txt

Salt &Pepper

Txt 1

Voting strategies

~60% gain

Voting

Page 60: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognitionText recognition

User assisted correctionBy the user’s random editing...

Pop-up verifierManual Training

By proofreading of doubtful wordsResult exportation

Page 61: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognitionText recognition

User assisted correctionBy the user’s random editing...By proofreading of doubtful words

Correct: User dictionaryChanged: IntelliTrain

Remember trained charactersApply them on following pages

Result exportation

Page 62: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

IntelliTrainRecognized word: sorneUüng

Page 63: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

IntelliTrainRecognized word: sorneUüngFixed word: something

Page 64: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

IntelliTrainRecognized word: sorneUüngFixed word: something

Page 65: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

IntelliTrainRecognized word: sorneUüngFixed word: somethingSubstitutions found: m rn

thi Uü

Page 66: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

IntelliTrainRecognized word: sorneUüngFixed word: somethingSubstitutions found: m rn

thi UüPerform automatically:

Learn image pattern and substitution infoFind similar substituted (‘blue’) text on actual pageMatch against pattern of substitution and correctFind such errors on following pages, too

Page 67: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognitionText recognitionUser assisted correction

Result exportationCombine pages into a Document

Header / Footer recognitionPage numbersHyperlinks (e.g. „See Table 20”)

Save results

Page 68: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi

OCR InternalsMain tasks of an OCR system:

Image acquisitionLayout recognitionText recognitionUser assisted correction

Result exportationCombine pages into a DocumentSave results

doc filee-mailSpeech synthesizer

Page 69: Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

04 Jul 2005 Istvan Marosi