DevanagiriOCR on CELL BROADBAND ENGINE

20
Implementation of Devanagari OCR on Cell Broadband Engine Team Details Name Roll Number Email Id Contact No Amulya K (Team Leader) MT2012017 [email protected] 9880345463 Astha Goyal MT2012028 [email protected] 9740126090 Kodamasimham Pridhvi MT2012066 [email protected] 8123160887 Mehak Dhiman MT2012080 [email protected] 9535942619 Technical Report IIITB-OS-2012-10b April 2013

description

 

Transcript of DevanagiriOCR on CELL BROADBAND ENGINE

Page 1: DevanagiriOCR on CELL BROADBAND ENGINE

Implementation of DevanagariOCR on Cell Broadband Engine

Team Details

Name Roll Number Email Id Contact NoAmulya K (Team Leader) MT2012017 [email protected] 9880345463

Astha Goyal MT2012028 [email protected] 9740126090Kodamasimham Pridhvi MT2012066 [email protected] 8123160887

Mehak Dhiman MT2012080 [email protected] 9535942619

Technical Report IIITB-OS-2012-10b

April 2013

Page 2: DevanagiriOCR on CELL BROADBAND ENGINE

Abstract

Devanagari OCR is an Optical Character Recognition systemwhich takes scanned images of Devanagari text as input and convertsthem to corresponding editable text files. The major modules inthe system are segmentation of the image, feature extraction ofthe segmented image, recognition using these features and thenmapping the segmented image to its corresponding unicode format, toobtain an editable document. In this project, an open source systemdeveloped under GNU public license for English OCR, called GOCRsystem has been used. It has been modified GOCR system [1] suchthat it works for Devanagari script. The code now converts Devana-gari documents in the form of images, to editable Unicode text format.

Processing the images and converting it to the corresponding Uni-code format takes a lot of processing time if implemented on a singleprocessor. This is where Cell Broadband Engine (CBE) comes intopicture. It is a multicore processor having nine microprocessors onwhich data can be processed parallely which in turn reduces the over-all processing time as compared to a single processor system where theimages are processed serially. This has been utilized in this projectto process the images faster. In a document containing several im-ages, each image is assigned to a different processor, which pass theimage through OCR system parallely. Thus the existing OCR systembecomes more efficient by reducing the total processing time.

1

Page 3: DevanagiriOCR on CELL BROADBAND ENGINE

Acknowledgements

We would like to express our thanks to Prof. Shrisha Rao for the supportand guidance given to us by him in completing this project .We would like to extend our thanks to Mr. Jorg Schulenburg. He is theauthor of the open source English OCR code named ‘GOCR’, which wehave used and modified to suit Devanagari.We would also like to thank the authors of the various papers we haverefered.

2

Page 4: DevanagiriOCR on CELL BROADBAND ENGINE

List of Figures

1 CBE Architecture . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Architecture for implementation of OCR on CBE . . . . . . . 12

3 Plot of performance of Devanagari OCR on single core vsmulti core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Plot of no. of images vs time taken in serial mode againstparallel mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3

Page 5: DevanagiriOCR on CELL BROADBAND ENGINE

List of Tables

1 Comaprison of performance of Devanagari OCR on single coreagainst multiple cores of the CBE . . . . . . . . . . . . . . . . 15

2 Comparison of no. of images vs time taken in serial modeagainst parallel mode . . . . . . . . . . . . . . . . . . . . . . . 16

4

Page 6: DevanagiriOCR on CELL BROADBAND ENGINE

Contents

List of Figures 3

List of Tables 4

1 Introduction 6

2 Problem Statement 6

3 Gap Analysis 7

4 Literature Survey 8

5 Architecture 8

5.1 Optical Character Recognition System (OCR) . . . . . . . . . 8

5.2 Cell Broadband Engine (CBE) . . . . . . . . . . . . . . . . . 10

5.3 Overall system Architecture: Implementation of OCR on CBE 12

6 Implementation details 13

7 Results 14

7.1 Recognition rate . . . . . . . . . . . . . . . . . . . . . . . . . 14

7.2 Performance of the OCR system on CBE . . . . . . . . . . . 14

8 Conclusion and future work 17

9 References 18

5

Page 7: DevanagiriOCR on CELL BROADBAND ENGINE

1 Introduction

Devanagari is an age old script used for writing many languages likeSanskrit, Hindi, Marathi, Punjabi, Konkani, Thula etc. This has generateda lot of interest in developing new methods of processing documents inDevanagari. Optical Character Recognition (OCR) is an indispensible partof Document Image Analysis.

OCR is basically the process of converting printed or hand-written textinto machine-encoded text by recognizing the characters in the documentimage and converting them to a Unicode format. The output obtainedis a parsable document which a machine can interpret and perform someoperations or execute certain commands based on the text interpreted.The general applications of this can include helping the blind “read” byconverting the text in Unicode format to speech, developing intelligent carswhich are able to interpret the signs on the road and navigate on thierown without the intervention of a human being, intelligent robotics, amongmany others.

The Cell Broadband Engine Architecture (CBEA) and the Cell BroadbandEngine are the result of a joint venture by Sony, Toshiba, and IBM (STI).The Cell Broadband Engine was initially intended for application in gameconsoles and media-rich consumer-electronics devices. Implementationof an OCR on a CBE comes with the advantage of fast processing andrecognition, owing to the immense parallel processing ability the CBEbrings with it.

2 Problem Statement

The project is aimed at processing the image of a document to producemachine-encoded text (in Unicode format) as the output. This will beimplemented on the Cell processor, for printed text and possibly extendedfor hand-written text. The challenges which make the problem interestingare, the clarity of the document image provided as the input, the lightingconditions under which the image has been taken, possible degrading ofthe paper quality due to age and skew in the scanned documents. Also, asignificant difference between English and Devanagari for instance, lies in

6

Page 8: DevanagiriOCR on CELL BROADBAND ENGINE

the fact that alphabet in English language does not have anything above orbelow the line, which is not the case in Devanagari. This project caters toall these challenges.

3 Gap Analysis

The following points are noteworthy:

• The high performance provided by the CBE can be exploited for largedata applications. OCR is one such application, which has a largedataset of characters and sometimes even words, which need to bepre-processed and stored in the database, after feature extraction. Theparallelization provided by CBE will make the time taken taken forthis process negligible. OCR has not been implemented on CBE be-fore, and the fact that there is no literature about OCR on CBE is atestimony to the same.

• Devanagari is a language used widely across a large geographical area.Yet, it is yet to receive the amount of importance that really needs tobe attached with it. Whatever research that has happened on Devana-gari OCR has happened sparsely and it can be observed that there hasbeen no co-ordination among the individual research groups. Hence itis difficult to clearly pin point the exact state-of-the-art in this field.Also, the number of IEEE papers published in this field in the years2011 and 2012 put together, is only three. This clearly shows that thisarea needs to be addressed urgently.

• A lot of research on OCR in Devanagari has focused on the FeatureExtraction techniques and very few papers have been published onSegmentation techniques. Literature survey has shown that most ofthe OCR implementations suffer due to errors in the segmentationstage. This usually happens because of improper segmentation due tocharacters from one line getting joined with the characters on the linebelow, and so on. This area needs to be addressed in this project.

7

Page 9: DevanagiriOCR on CELL BROADBAND ENGINE

4 Literature Survey

Literature in OCR can be classified into two - that in segmentation andthat in recognition. The most widely used method for segmentation is thatof horizontal and vertical projection profiles[3] of an image. Kompalli et.al. [4] have proposed a recognition driven framework along with stochasticmodels, where a graph representation is used to segment characters. Thispaper also talks about the usage of linguistics and language models. Asfor recognition, Neural Networks (NN) [5] [6] have been most widely used.Singh et. al. [5] have proposed an OCR system based on artificial NN, wherethe features include mean distance, histogram of projection based on spatialposition of pixels and histogram of projection based on pixel value. [7] talksabout topographic feature extraction, where the topographic features are,closed region, convexity of strokes and straight line strokes. These featuresare represented as a shape based graph which acts as an invariant featureset for discriminating very similar type characters efficiently. [8] proposes abilingual recognizer based on PCA, followed by support vector classificationfor a Hindi-Telugu OCR. A good survey paper can be found in [9].

5 Architecture

Figure 2 shows the architecture of for the implementation of the OCRsystem on CBE. This section will be in three parts:

• The first part will describe the exact steps involved and algorithmsused in the OCR

• The second part will talk about the CBE architecture

• The third part will show how the OCR system described abovewill be implemented on CBE

5.1 Optical Character Recognition System (OCR)

An OCR system involves the following four main stages:

8

Page 10: DevanagiriOCR on CELL BROADBAND ENGINE

1. Pre-processing: Pre-processing involves threshold value detection,the process of converting the input image into a binary image. It alsoincludes filtering of the image in which we have modules to increasecontrast, clean the dust and remove coffee mug stains. Removal ofserifs ( horizontal lines glued on the lower end of the vertical lines)and making a barely connected character as connected are the otherpre processing steps.

Algorithm:

• Binarization of an image is done by taking a threshold value andconverting all intensity values above the threshold to 1 and allintensity values below the threshold to 0. This produces a ‘blackand white’ image, which is nothing but the binarized image

• Blind pixels i.e dust is removed by using fill algorithm• Serifs are removed by classifying horizontal and vertical pixels by

comparing the distance between the next vertical and next hori-zontal white pixels and then measuring mean thickness of verticaland horizontal clusters and erasing unusually thin horizontal pix-els at the bottom line

• A 2x2 filter sets pixels to 0 near barely connected pixels hencemaking them connected

2. Segmentation: Segmentation is the process of splitting up thepre-processed image, first into individual lines, then into each word ina particular line, and finally into individual characters in each word.This step is required as these characters need to be recognized in oneof the following steps in order to print it in Unicode format.

Algorithm: Horizontal Projection Profile (HPP) and VerticalProjection Profile (VPP) are the two most widely used algorithmsfor segmentation of a document image. After an image is binarized,each row of the image is summed up to get its HPP. Once the HPPis obtained, there will be a dip in the HPP at places where the imagehas white spaces between two lines. This will be taken as the referenceto segment lines. Each line is then taken and a VPP is taken on thatline to get inter-word spaces (and hence a dip in the VPP), and henceword segmentation will be done. For individual characters of a word,a connected component analysis will be done to separate individual

9

Page 11: DevanagiriOCR on CELL BROADBAND ENGINE

characters, where every connected entity (each character in this case)will be labelled as a separate object. Every part of the word below theline will be output as a separate entity and will be sent to a differentclassifier.

3. Feature Extraction: This is the process of extracting specificcharacteristics of an image (in this case, a character), which arereferred to as it’s “features”. These features will be used to separateone character from the other during the recognition stage.

Algorithm: Features which are extracted for recognition are shapebased image invariants and other essential features of a character.Other essential features include list of longest lines and bows. It alsoincludes the no. of points and lines connected between these points tomake a character. Also the relative position of these points are takeninto account.

4. Recognition: Training samples are processed initially and a databaseof their features is created and stored. Recognition is the processwhere each character will be labelled. This is done by comparingthe same features of each ‘test’ image, against the features of every‘trained’ image in the database. The class of an image with themaximum match will be reported as the match class.

Algorithm: Euclidean distance will be used for matching. Supportvector machine (SVM) will act as classifier.

5.2 Cell Broadband Engine (CBE)

The IBM Cell Broadband Engine has a multi-core architecture designed forhigh performance computing and distributed multi-core computing. Thearchitecture features nine microprocessors on single chip. The multi corechip capable of massive floating-point processing is optimized for computingintensive workloads, rich broadband media applications, signal and imageprocessing.

10

Page 12: DevanagiriOCR on CELL BROADBAND ENGINE

Figure 1: CBE Architecture

Figure 1 shows the architecture of the Cell Broadband Engine. The Cellprocessor can be split into four components:

• External input and output structures

• Power Processing Element (PPE)

• Synergistic Processing Elements (SPE)

• Element Interconnect Bus (EIB)

The Power Processing Element (PPE) which is the main processor,controls and coordinates all remaining 8 co-processing Synergetic Pro-cessing Elements (SPEs). SPEs are designed to accelerate media andstreaming workloads. Area and power efficiency are important enablersfor multi-core designs that take advantage of parallelism in applications.It is on these SPEs that the OCR will run and will be controlled by the PPE.

11

Page 13: DevanagiriOCR on CELL BROADBAND ENGINE

5.3 Overall system Architecture: Implementation of OCRon CBE

Figure 2: Architecture for implementation of OCR on CBE

Figure 2 shows the architecture for the implementation of the OCR systemon CBE.

1. Training and testing are the most important steps in an OCR system.Training is a one time process, where character samples of the Devana-gari script are taken and their features (obtained through a combina-tion of PCA and image-invariant methods) are extracted. This stepcan be performed on the CBE or the on the local Ubuntu machineitself, as it is a one time process. The features so obtained are storedin a feature database.

2. During testing, an input image (document image) is given as the inputto the CBE. The PPE instructs one of the SPEs to carry out thesegmentation process. Since there are eight SPEs, mutiple images can

12

Page 14: DevanagiriOCR on CELL BROADBAND ENGINE

be segmented during the time usually taken to segment one image.Segmentation is done by the CBE and that is the intermediate output.

3. The PPE gets the segmented words and instructs the SPE to performfeature extraction. The extracted features will be the output of thisstep. The PPE stores this in its memory temporarily. This is requiredas multiple images will be processed at a time.

4. The PPE now fetches the features of an individual image and alsothe features stored in memory. Comparison of the test features withthe trained features will happen across multiple SPEs. The matchvalues obtained by all the SPEs is compared by the PPE and the classmatched with the least value (in case of Euclidean distance) is reportedas the match class.

5. The output given by the CBE will be the matched class. This will beused to write the corresponding character in Unicode format using theDevanagari fonts present in the system.

6 Implementation details

Performance comparison has been done with respect to two aspects:

• For different number of lines of input of a single document, thetime taken to produce the OCR output using a single core has beencompared against the time taken to produce the OCR output usingmultiple cores of the CBE. The number of lines has been varied from5 to 70 in increments of 5, and the time taken to run it using a singlecore has been compared against the time taken to run it on multipleSPUs.

• For different number of input documents, the time taken to pro-duce the OCR output using a single core has been compared againstthe time taken to produce the OCR output using multiple cores ofthe CBE. The number of input documents given simultaneously to asystem has been varied from 1 to 6 (so that each image is processedon one core of the SPU - there are 6 SPUs in the Sony PS3 on whichthe system has been tested), and the time taken for processing thosemany documents using a single core has been compared with the timetaken to process all of them simultaneously using multiple SPUs.

13

Page 15: DevanagiriOCR on CELL BROADBAND ENGINE

In order to obtain a fair comparison of the time obtained, compari-son has to be done using systems with the same RAM capacity andprocessor speeds. Hence, when reporting time taken to run on a sin-gle core, only the PPU of the Cell processor has been used to run theOCR system. This represents serialization, as there is only 1 PPU andgiven a number of inputs, they have to be processed one by one. Whenreporting the time taken to run on multiple cores, the PPU along with6 SPUs of the Sony PS3 system has been used. Here, the PPU takesin the input and creates multiple threads for the SPUs. At a time, 6input documents can be processed simultaneously on the 6 SPUs, andthe time taken for such parallel execution has been reported.

7 Results

7.1 Recognition rate

The recognition rate obtained for the GOCR system modified forDevanagari is 87%. The recognition rate has been calculated as the ratioof properly recognised words to the total number of words, and has beencalculated over 50 documents.

7.2 Performance of the OCR system on CBE

Dependency of time taken, on the no. of lines of input

Table 1 shows the time taken to produce the OCR output using a singlecore as against the time taken to produce the OCR output using multiplecores of the CBE, for different no. of lines of input.

14

Page 16: DevanagiriOCR on CELL BROADBAND ENGINE

Table 1: Comaprison of performance of Devanagari OCR on single coreagainst multiple cores of the CBE

No. Of Lines Single Processor (in sec) Multi Processor(in sec)5 81 7410 182 17315 277 25120 365 33325 478 43530 560 51035 694 63040 1325 120045 1480 134050 1657 150060 1950 164070 2334 2130

Figure 3: Plot of performance of Devanagari OCR on single core vs multicore

15

Page 17: DevanagiriOCR on CELL BROADBAND ENGINE

Figure 3 shows the plot of time taken on single core and multiple coresfor different no. of lines of input. It can be seen that as the number oflines increase, the performance of the multi core CBE becomes better thanthat obtained from a single core. That is, the CBE takes much lesser timecompared to a single core processor, when the size of the input increases.

Dependency of time taken, on the no. of input documents run serially vs parallely

Table 2 shows the time taken to produce the OCR output on a single coreas against the time taken on multiple cores of the CBE, for different no. ofinput documents.

Table 2: Comparison of no. of images vs time taken in serial mode againstparallel mode

No. Of Images Serial (in sec) Parallel (in sec)1 96 952 191 1943 287 2664 382 3685 478 4356 575 523

16

Page 18: DevanagiriOCR on CELL BROADBAND ENGINE

Figure 4: Plot of no. of images vs time taken in serial mode against parallelmode

Figure 4 shows the plot of time taken on single core and multiple cores fordifferent no. of input documents. As the number of documents that are runparallely increases, the performance of the CBE system is better than thatobtained on a single processor, in terms of the time taken.

8 Conclusion and future work

Devanagari OCR has been successfully implemented on a CBE. Since theOCR is implemented in multicore architecture, the efficiency is better thanthe single core architecture.It was observed that for large input sizes CBE gave better performance ascompared to the smaller ones.

In the current implementation, the OCR is able to recognize the scannedimages of the printed text. More algorithms can also be incorporated into the

17

Page 19: DevanagiriOCR on CELL BROADBAND ENGINE

OCR and the database can be trained accordingly to work for handwrittendocuments and for different languages. The algorithms can be optimizedmore to give better recognition rate.

9 References

[1] gocr.http://jocr.sourceforge.net - Schulenburg

[2] Michael Kistler, Michael Perrone, Fabrizio Petrini, “CELL Multipro-cessor Communication Network: Built for Speed”, IEEE Micro, May-June 2006.

[3] Manoj Kumar Shukla, Dr. Haider Banka, “An Efficient SegmentationScheme for the Recognition of Printed Devanagari Script”, IJCST,Vol. 2, Issue 4, Oct.- Dec. 2011.

[4] Suryaprakash Kompalli, Srirangaraj Setlur, Venu Govindaraju, “De-vanagari OCR using a recognition driven segmentation framework andstochastic language models”, IJDAR (2009), Vol. 12, pp. 123-138, DOI10.1007/s10032-009-0086-8.

[5] Raghuraj Singh1, C. S. Yadav, Prabhat Verma, Vibhash Yadav, “Op-tical Character Recognition (OCR) for Printed Devnagari Script UsingArtificial Neural Network”, International Journal of Computer ScienceCommunication, Vol. 1, No. 1, January-June 2010, pp. 91-95.

[6] Anilkumar N. Holambe, Sushilkumar N. Holambe,Dr.Ravinder.C.Thool, “Comparative Study of Devanagari Hand-written and printed Character Numerals Recognition usingNearest-Neighbor Classifiers”, IEEE 2010, 978-1-4244-5540-9.

[7] Soumen Bag, Gaurav Harit, “Topographic Feature Extraction ForBengali and Hindi Character Images”, Signal Image Processing : AnInternational Journal (SIPIJ), Vol.2, No.2, June 2011

[8] C. V. Jawahar, M. N. S. S. K. Pavan Kumar, S. S. Ravi Kiran, “ABilingual OCR for Hindi-Telugu Documents and its Applications”,International Institute of Information Technology, Hyderabad, 2006.

[9] R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and Umapada Pal,“Offline Recognition of Devanagari Script: A Survey”, IEEE Trans-actions on Systems, Man, and Cybernetics-Part C: Applications andReviews, Vol. 41, No. 6, November 2011.

18

Page 20: DevanagiriOCR on CELL BROADBAND ENGINE

[10] U. Pal, T. Wakabayashi, and F. Kimura, “Comparative study of De-vanagari handwritten character recognition using different featuresand classifiers”, in Proc. 10th Conf. Document Anal. Recognit., 2009,pp. 1111-1115.

[11] S. Arora, D. Bhattacharjee, M. Nasipuri, D. K. Basu, M. Kundu,and L. Malik, “Study of different features on handwritten Devnagaricharacter”, in Proc. 2nd Emerging Trends Eng. Technol., 2009, pp.929-933.

19