Document Image Analysis Lecture 11: Word Recognition and Segmentation

UC Berkeley CS294-9 Fall 2000 11- 1

Document Image AnalysisLecture 11: Word Recognition and

Segmentation

Richard J. FatemanHenry S. Baird

University of California – BerkeleyXerox Palo Alto Research Center


The course so far….

• DIA overview, objectives, measuring success

• Isolated-symbol recognition:– Symbols/glyphs, models/features/classifiers

– image metrics, scaling up to 100 fonts of full ASCII

– last 2 lectures: • ‘best’ classifier none dominates but: voting helps

• combinations of randomized features/ classifiers!


Recall: we can often spot words when characters are unclear…

• Crude segmentation into columns,

paragraphs, lines, words

• Bottom up, by smearing horiz/ vert … or

• Top down, by recursive x-y cuts

• what we really want is WORD recognition,

most of the time.


Recall the scenario (lecture 9)

Lopresti & Zhou (1994)


The flow goes one way

• No opportunity to correct failures in segmentation at symbol stage

• No opportunity to object to implausible text at the next stage.

• (providing alternative character choices gives limited flexibility)


Recall: Character-by-Character Voting Succeeds & Fails

Majority vote (the most commonly used method)


High accuracy requires some cleverness

• In fact, some words, even in cleanly typeset text

high-resolution scanned, have touching characters

• In noisy or low resolution images, adjacent

characters may be nearly entirely touching or broken

(or both touching and broken!)

• If we accept the flowchart model: we need perfect

segmentation to feed the symbol recognition module

• If we reject the flowchart: OK, where do we go from

here?


Compare alternative approaches

• First clarify the word recognition problem and see how to approach it.

• Next we see how good a job can we do on segmentation (a fall-back when can’t use the word recognition model).

• Robustness might require both approaches (multiple algorithms again!)


Formalize the word recognition problem (TKHo)

Machine printed, ordinary fonts (var. width)• Cut down on the variations

– NOT:

• A word is all in same font/size [shape= feature]• [we could trivialize task with one font, e.g. E-13B]

• Known lexicon (say 100,000 English words)• 26^6 is 308 million; our lexicon is < 0.3% of this• [trivialize with 1 item (check the box, say “yes”..)]

• Applications in mind: post office, UNLV bakeoff


Word Recognition: Objective


At Least Three Approaches


In reality, a combination:

Later we will find that additional processing: inter-word statistics or even natural language parsing may be incorporated in the ranking.


CharacterRecognitionApproach

Symbol recognition is done at the character level.Contextual knowledge is used only at the ranking stage


One error in character segmentation can distort many characters

Input word image

Character Segmentation

Segmented and normalized characters

Recognition decisions


How to segment words to characters?

•Aspect ratio (fixed width, anyway)•Projection profile•Other tricks


Projection Profiles


Modified Projection profiles

“and” adjacent columns


Poor images: confusing profiles


The argument for more context

Similar shapes in different contexts, in each case different characters, or parts of them.


Segmentation- basedApproach

Segment the word to characters. Extract the features from normalized charcter images. Concatenate the feature vectors to form a word feature vector. The character features are compared in the context of a word.

(Works if segmentation is easy but characters are difficult to recognize in isolation)


Segmentation- basedWordRecognition

Note that you would not have much chance to recognize these individual characters!


Word-shapeAnalysisApproach

Squeeze out extra white space, locate global reference lines (upper, top, base, bottom: Xxp )

TKH partions a word into 40 cells: 4 vertical regions and 10 horizontal.

Some words have no descender or ascender regions: Hill


Word transformations


Detecting base, upper, top by smearing


The 40 area partitions


Stroke Directions


Edges, Endpoints


Cases Each Approach isBest At …


Most effective features?

•Best: Defined locally, yet containing shape information: stroke vectors, Baird templates

•Less effective: very high level “holes”; very low level “pixel values”

•Uncertainly/ partial matching is important/•TK Ho..


TKHo’s experiments

•Context: Zip code recognition•Redundancy check requires reading the whole address•33850 Postal words•Character recognizer trained on 19151 images•77 font samples were used to make prototypes



Five (10?) methods used in parallel1. A fuzzy character template matcher

plus heuristic contextual postprocessor

2. Six character recognizers3. Segmentation-based word

recognizer using pixel values4. Word shape analyzer using strokes5. Word shape analyzer using Baird

templates



Many interesting conclusions..1. If several methods agree, they are

almost always (99.6%) correct or right on second choice (100%)

2. Classifiers can be dynamically selected

Document Image Analysis Lecture 11: Word Recognition and Segmentation

Documents

Transcript of Document Image Analysis Lecture 11: Word Recognition and Segmentation