High Accuracy Optical Character Recognition

download High Accuracy Optical Character Recognition

of 7

Transcript of High Accuracy Optical Character Recognition

  • 7/30/2019 High Accuracy Optical Character Recognition

    1/7

    218 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO . 2, FEBRUARY 1995

    High Accuracy Optical Character RecognitionUsing Neural Networks with Centroid DitheringHadar I. Avi-Itzhak, Thanh A. Diep, and Harry Garland

    Abstract--Optical character recognition (OCR) refers to a processwhereby printed documents are transformed into ASCII files for thepurpose of compact storage, editing, fast retrieval, and other file ma-nipulations through the use of a computer. The recognition stage of anOCR process is m ade difficult by added n oise, image distortion, and thevarious character typefaces, sizes, and fonts that a docum ent may have.In this study a neural network approach is introduced to perform highaccuracy recognition on multi-size and multi-font characters; a novelcentroid-dithering raining process with a low noise-sensitivity normali-zation procedure is used to achieve high accuracy results. The studyconsists of two parts. The first part focuses on single size and single fontcharacters, and a two-laye redneural network is trained to recognize thefull set of 94 ASCII cha racter images in 12-pt Cou rier font. The secondpart trades accu racy for additional font and s ize capability, and a largertwo-layeredneural network is trained to recognize the full set of 94 AS-CII character images for all point sizes from S to 32 and for 12 com-monly used fonts. The performance of these two networks is evaluatedbased on a datab ase of more han one million character mages from thetesting data set.

    Index Terms-Pattern recognition, optical character recognition,neural networks.I. INTRODUCTION

    In todays world of information, countless forms, reports, con-tracts, and letters are generated each day; hence, the need to archive,retrieve, update, replicate, and distribute printed documents has be-come increasingly important[ l ,3]. An available technology thatautomates these tasks on computer media is optical character recog-nition (OCR); printed documents are transformed into ASCII files,which enable compact storage, editing, fast retrieval, and other filemanipulations through the use of a computer.An essential requirement for OCR lies in the development of anaccurate recognition algorithm by which digitized images are ana-lyzed and classified into the corresponding characters. A compre-hensive and recent benchmark for OCR performance with proprietarycommercial algorithms was reported by Jenkins et a1.[4], and the bestrecognition accuracy that has been achieved on data of three pointsizes (10, 12, and 14) and three different fonts (Courier, Helvetica,and Times-Roman) ranges from 99.21% to 99,95%. Published litera-ture report error rates in the order of on e percent for single-font rec-ognition and higher error rates for m ulti-fonts [5], [ 6 ] .While an errorrate in the order of one percent may appear impressive, it would gen-erate 30 errors on an average page containing 3000 characters. Sucherror rates limit the usefulness in many applications[l], [2], [8 ] andillustrate the need for a more a ccurate recognition algorithm.In this paper, we introduce a feedforward neural network schemeto recognize multi-size and multi-font character images with extremeaccuracy. This high accuracy character recognition is achieved byusing a centroid dithering training process and a low noise sensitivitynormalization procedure. The study consists of two parts. The firstpart focuses on single size and single font characters, and a two-layered neural network is trained to recognize the full set of 94 AS-

    Manuscript received Jan. 21 , 1993; Revised Feb . 21, 1994.H.I. Avi-Itzhak and T.A. Diep are with the Department of Electrical Engi-H. Garland is with Canon Research Center America, Palo Alto, CA 94304IEEECS Log Number P95053.

    neering, Stanford University, Stanford, CA 94305..

    CII character images in 12-pt Courier font3. The second part tradesaccuracy with additional font and size capability, and a larger two-layered neural network is trained to recognize the full set of 94 AS-CII character images for all point sizes from 8 to 32 and for 12 com-monly used fonts.The recognition process works as follows: First, the two-dimensional pixel array of the input character is preprocessed, nor-malized and decomposed into a vector (see Section 11. D.). Second,the vector is processed by the neural network to yield an output of 9 4numbers (see Section 11. E.). Third, the neuron in the output layerwith the highest value is declared the winner, identifying the inputcharacter image (see Section 11. E.). Fourth, a simple postprocessingalgorithm is used to detect invalid characters and to discriminatebetween characters whose images become indistinguishable duringpreprocessing. These include single quotes and commas of certainfonts and the case information of som e characters (see Section 11. (3.).The latter part of postprocessing applies to the multi-size and multi-font case only.

    11. N E U R A L ETWORKMPLEMENTATIONA. Netw ork Architecture

    The neural network used to recognize single-size and single-fontcharacter images has 3000 inputs, 20 neurons in the first layer, and94 neurons in the second or output layer. The neural network used torecognize multi-size and multi-font character images has 2500 inputs,100 neurons in the first layer, and 94 neurons in the output layer.Both networks are fully connected and feedforward w ith the nonline-arity in each neuron generated by the following sigm oid function:

    Theoretically, the multi-font neural network could be implem entedby a bank of single-font neural networks. Each neural networkwould be dedicated to recognizing one font, and character recogni-tion would be based on the highest output, The obviou s advantage ofthis approach is that it is straightforward. The disadvantages lie in thecomputational costs that are involved in training and testing thesenetworks. Perhaps more importantly, the networks cannot benefitfrom associating the correct characters and disassociating thewrong characters of other fonts, since each network w ould employtraining data from only one particular font. The lack of interdepen-dency between the networks creates redundancy and increases thecomplexity.B. Training Algorith m

    The training algorithm used in this study is known as backpropa-gation[9]. It is a steepest-descent algorithm for finding a set ofweights which minimizes the squared-error cost function:

    3Courier font is important because it is the font most often used in legaldocuments. The technique used in the development of a neural network forCourier font is general and can be applied to any other single font.

    0162-8828/95$04.000 1995 IEEE

  • 7/30/2019 High Accuracy Optical Character Recognition

    2/7

    IEEE TRANSA CTION S ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 2, FEBRUARY 1995 219

    bo40

    Th rtrh o ld in g C en ter in g YUm mm mI# 10

    t t

    12 Point Courier(a) Single-Size and Single-Font Preprocessing

    IO m m 4 bo m m w16 Point Timer

    Fig. 1. Preprocessing.

    m . ab o. bo

    U ' 40

    ' hr e sh o ld in g k l i n g

    0 ) . mI . . . . . . . . . I

    10 m 10 U Y bo m m m

    ------- AIio m m U Y bo m m w

    (b ) Multi-Size and Multi-Font Preprocessing(b ) Multi-Size and Multi-Font Preprocessing

    where di and yi designate the desired and actual outputs of the ithneuron in the last layer in response to the pt h nput pattern. N s thenumber of neurons in the output layer. The desired output for the ioutput neuron under the pt hnput pattern is a discrete Delta function:

    .=r ' : 0 : i # p= p (3 )

    C . DatabaseThe da ta consisted of pre-segmented (though not all perfect) char-acter images, scanned at 40 0 d.p.i. and in 8-bit gray scale resolution.

    The first neural network deals with single-size and single-font charac-ter recognition, and the training and testing data is of 12 point Cou-rier font. The training data is comprised of 94 digitized character im-ages with a one-to-one correspondence between each training dataand each member of the 1 2 point Courier font character set. The neu-ral network is thoroughly evaluated with testing data, comprised of1,072,452 character image s from a library of English short stories.The second neural network deals with multi-size and multi-fontcharacter recognition. The allowa ble point size ranges from 8 to 32,and the fonts include Arial, AvantGarde, Courier, Helvetica, LucidaBright, Lucida Fax, Lucida Sans, Lucida Sans Typewriter, NewCentury Schoolboo k, Palatino, Times, and Times New Roman. Thetraining data is comprised of 1,128 (or 94 x 12) character images;each member of the complete character set for each font appears ex-actly once in the training set. It is important to note that all trainingcharacter images are of 16 point size, even though the network istrained to perform recognition on multi-size characters. An explana-tion of this property is provided in the next section. The testing dataconsists of 347 ,7 12 characters or 28,976 characters for each font andhas an even mixture of 8, 12, 16, and 32 point sizes.D . Data Preprocessing and Normalization

    Before it is fed to a neural network, the digitized image is pre-processed and normalized. The preprocessing and normalization pro-

    cedure serves several purposes; it reduces noise sensitivity and makesthe system invariant to point size, image contrast, and position dis-placement.The reduction of background noise sensitivity is achieved bythresholding. The input image is filtered by zeroing those pixelswhose values are less then 20% of the peak pixel value, while the re-maining pixels are unchanged. The threshold setting is heuristic andhas been empirically shown to work well for white paper. Thethreshold setting should be adjusted accordingly when a different pa-per product is used, e.g. newspaper.Following thresholding, the resultant image is centered by position-ing the centroid of the image to the center of a fixed size frame. Thecentroid (y, y) of an image is defined as follows:

    (4 )

    For the 12 point Courier font case, a frame size of 50-by-60 pixelshas been found to be adequate in enclosing all character images. Forthe multi-size and multi-font case, a frame size of SO-by-SO pixels isemployed, and an additional scaling process must be applied to theimages. The scaling entails initially computing the radial mom ent M,:

    Next, the image is enlarged or reduced with a gain factor of lO/M,,producing images of constant radial moments. The value of this ra-

  • 7/30/2019 High Accuracy Optical Character Recognition

    3/7

    220 IEEE TRANSAC TIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO . 2, FEBRUARY 1995

    1 r

    10' - -l o 2 - -g loa - -lo4 - -IO-' - -lod

    (log scale)

    Fig. 2 . Learning curve of the single-size and single-font neural network.

    dial moment is linked to th e selected frame size of 50-by-50. From abroader perspective, this scaling process is equivalent to a point-sizenormalization procedure and enables a neural network to treat allcharacter images the same way regardless of the point size. An illus-tration of the thresholding, scaling, and centering operations is pro-vided in Fig. 1.The next ste p in the preprocessing and normalization procedure isto convert the two-dimensional images into vectors. The conversionis achieved by concatenating the rows of the two-dimensional pixelarray. It follows that the vector for the single-size and single-fontcase has 3,000 elements and that the vector for the multi-size andmulti-font case has 2,500 elements. Additionally, each vector is nor-malized to unit power:

    The normalization reduces sensitivity to varying scanner gains(image-to-background contrast) as well as different toner darkness(shades of ink). This unit-norm vector is then fed into the neural net-work.During training, there is an additional step performed on the inputdata: centroid di ther ing. The centroid dithering process applies toboth the single-size and single-font case as well as the m ulti-size andmulti-font case. The process involves dithering the centroid of thetwo-dimensional input image. After centering and scaling, the inputimage is displaced randomly and independently in both the horizontaland vertical directions over the range of [-2, +2] pixels in each di-mension; the image is shifted at random in one of twenty-five possi-ble displacement positions. The resultant image is then converted intoa vector, normalized, and fed into the network as previously de-scribed.Centroid dithering effectively creates many "different" imagesfrom a single image. The neural network is exposed to the same char-acter at different displacement positions, making the recognitionsystem invariant to input displacements. It is important to emphasizethat the dithering is performed exclusively during training and notduring testing. The approach also enables the network to toleratewidth variations in character strokes which might be caused by dif-ferent printer setting, toner levels, and variations in font im plementa-

    tion. This is particularly useful when bold face characters are encoun-tered.E. Training

    The training procedure is executed using an initial learning-rateparameter of p = 10 for both the single-size and single-font neuralnetwork and the multi-size and multi-font neural network. Thelearning progress is monitored by computing the mean squared error(m.s.e.) for each output neuron:m.s.e. =

    ( 8 )Cno. of no. of no. of no. of[shift positions] * [training patterns] * [ onts 1' output neurons]where C is the cost function as defined in equation (2).With small random initial weights, each output neuron generatesan approximate error of + O S , and the resultant mean squared erroris 0.25. The m.s.e. values as a function of training iterations areplotted and shown in Fig. 2for the single-size and single-font neuralnetwork. The single-size and single-font neural network is trainedwith 430,000 terations, and the final m.s.e. is approximatelyThe multi-size and multi-font neural network is trained with8,650,000terations, and the final m.s.e. is approximately 2.10-'.E Testing

    The testing process entails holding the w eights of the neural net-work constant and evaluating the network with testing data. Eachtesting pattern is fed to the network, and the squared error for theoutput layer is computed. This process is repeated for all of the test-ing patterns, and the cumulative squared error provides a measure forthe neural network's performance. More discussion may be found in[91, [ lo] , [ I l l .Fig. 3graphically summarizes two types of error indicators for thesingle-size and single-font neural network: (a) th e worst-case squarederror over all 94 output neurons and all 25 offset positions for eachinput character and (b) the w orst-case squared error over all 94 ut -put neurons for each input character without any offsets. Fig. 4per-tains to the multi-size and multi-font neural network, and the worst-case squared errors are taken over all fonts with a fixed point size.

  • 7/30/2019 High Accuracy Optical Character Recognition

    4/7

    IEEE TRANSACTIONS ON PATTERN ANAL YSIS A ND MACHINE INTELLIGENCE, VOL. 17, NO . 2, FEBRUARY 1995 22

    x lo"1.6 I

    g 1.2j 1m" l.'.8 i Worst case over d l displacement positionsi k o displacement onlyFig. 3. Worst case squared errors for the single- size and single-font neural network.

    Wont case over d1 displacement positions

    0.030ms tzero displ.ctment only

    3 0.mm"'3f 0.015E 0.010

    ,. J

    ..................A* .'

    1Fig. 4. Worst case squared errors for the multi-size and multi-font neural network.

    It is evident from the figures that the weights obtained from thetraining process have achieved extremely close approximations ofthe desired mappings.G. Postprocessing

    An important task of postprocessing pertains to the detection ofinvalid character inputs. More specifically, the detection is accom-plished by observing the occurrence of small responses on all outputneurons. This is an intrinsic property of a trained neural network andis very useful in discounting bad images which might result fromsegmentation errors or other defects. This information may be used

    as an error flag which provides feedback to the segmentation algo-rithm.The second function of postprocessing involves recovering lostinformation from scaling and centering multi-size and multi-fontcharacter images and is used for the multi-size and multi-font systemonly. The characters c, C, k, K, 0, , p, P, s , S , v, V , w, W, x, X, z,an d Z of certain fonts lose their case information after scaling and aretherefore recognized by the neural network without an affirmativeupper/lower case identification. This case information, however, caneasily be reconstructed by a context-based approach. The techniqueresorts to examining the radial moments of the original character im-

  • 7/30/2019 High Accuracy Optical Character Recognition

    5/7

    222 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO . 2, FEBRUARY 1995

    ages prior to scaling and is best explained by an example. Withoutloss of generality, it is assumed that the neural network identifies animage as the character c. The first step is to deduce the point size ofthis c by compu ting the gain 10 / M,, here M, s the radial mo-ment of a neighboring character that is case distinguishable. Th enext step is to calculate the radial moments of a fabricated upper caseC and a fabricated lower case c of this point size. The case in-formation is then obtained by com paring the radial moment of the in-put character c with those of the fabricated ones.Commas and single quotes of certain fonts also become indistin-guishable after centering and scaling. The discrimination betweenthese two characters is made by comparing the centroid location ofthe input character image before preprocessing to the height of theline. Finally, the numeral zero cannot be reliably distinguished fromthe letter 0 n some fonts, and similarly the numeral l , lowercase L, upper case I, and vertical bar I are ambiguous for somefonts. Under these circumstances the characters are left as they arewithout any postprocessing.

    111. RECOGNITIONE R F O R M A N C EA. Results

    In order to determine the recognition performance, a computerprogram is used to co mpare the ou tput of the recognition system withthe ASCII files which were used to generate the testing data. Thecomputer program examines each and every output character, and alldiscrepancies excluding spaces, tabs, and carriage returns are re-corded. The discrepancies are individually examined and classified aseither an erroneous or a correct recognition.

    There are two situations where discrepancies are classified as cor-rect recognition. The first case involves image corruption which ren-ders the invalid images unrecognizable. Fig. 5provides examples ofinvalid inputs due to segmentation error, scanning error, and paperresidue. As explained in Section 11. G. , the neural network automati-cally generates small responses on all the output neurons to indicatebad inputs. Such occurrence is detected during postprocessing, andthe discrepancy is not counted a s a recognition e rror. The second caseinvolves characters that are indistinguishable. These include the nu-meral O, etter O, lower case L, pper case I, numeral l ,and the vertical bar I) in some fonts. The ambiguity arises from thefact that one character from one font looks identically to a differentcharacter of another font. Hence forth, the neural network may outputanyone of the ambiguous characters when the input is ambiguous,and it is not counted as an error. Any other discrepancies which donot fall into one of these two categ ories are counted as recognition er-rors.The character-ambiguity problem does not apply to the single-sizeand single-font experiment, since all characters of the Courier fontare distinguishable. All discrepancies except those of corrupted im-ages are treated as errors. The ne ural network is required to recognizeall 9 4 characters including the difficult distinction between the lowercase L (1) and the numeral one (1).he single-size and single-font neural network is tested with 1 ,072 ,452 characters of 12 pointCourier font, and a perfect recognition accuracy has been achieved.This recognition performance exceeds any previously known resultsby at least an order of magnitude.Using The multi-size and multi-font neural network was testedwith 347 ,712 characters or 28 ,976 characters for each font with aneven m ixture of 8 , 12 , 16 , an d 32 point sizes (see Sec tion 11. part C).the performance criteria as previously described, the multi-size andmulti-font neural network has achieved a perfect recognition accu-racy. The same network was also tested with the data used for thesingle-size and single-font neural network . There was on e recognitionerror among the 1 ,072 ,452 testing characters of 12 point Courierfont. The error is documented in Fig. 6 .

    , 1

    Fig. 5.Neural Responses to Invalid Images as a Result of (a) SegmentationError, (b)Scanning Error, and (c) Paper Residue.

    INPUT OUTPUTII,I1 Recognized As qn

    B. AnalysisThe question arises: if n independent trials of an experiment have

    resulted in success, what is the probability that the next trial will re-sult in success? In this context, we employ Lap laces Special Rule ofSuccession[7] which yields an estimate of the probability of success

    n + lp=- n + 2For t he s ing le- s i ze and s ing le- fon t neura l ne tw ork , w e ob ta in

  • 7/30/2019 High Accuracy Optical Character Recognition

    6/7

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL. 17, NO . 2, FEBRUARY 1995 223

    Character

    1,072,4531,072,454p=-- - 9.99991% ,

    Samples E(A, -B,,) Var(A,, -B,,)( Chebychevupper bound)

    and fo r the mu l t i - s ize and mu l t i - fon t neu ra l ne twork ,

    eta0

    347,7 13347,714

    p = =99.99971% .

    1 13 ,060 0 .99 1 .7 .10 .~80,273 0 .99 5 .8 .10. ~72,565 0 .99 2 .9 .10. ~70.423 0 .98 5 . 6 .1 0- ~

    In o r d e r t o a v o i d b i a s t o w a r d s a n y one particular font, the nused in compu t ing th e la t te r p robabi l i ty of error is based on themul t i - s ize and mu l t i - fon t da tabase and does no t inc lude themi l l ion Cour ie r cha rac te r s .Alternately, we introduce the following statistical analysis in orderto quantify a lower bound for the recognition accuracy on futuretesting data. Given a testing image corresponding to the pth haracterwherep E [1,2, .., 41, we define two random variables:

    As in equation (2), y j, represents the output of the pt h euron ofthe output layer. Wh ile random variable A,, is always linked to the pthneuron, B,, assumes the value of the highest output among the re-maining neurons, whichever it may be. The correct recognition of acharacter requires that A,, >B . The conditional probability of errorgiven an input image of the ptR character is derived below.Prob(error I p ) =Prob(A,

  • 7/30/2019 High Accuracy Optical Character Recognition

    7/7