INVESTIGATION INTO A SEGMENTATION BASED OCR FOR …Ms. Sobia Tariq Javed received a Bachelor of...

1

INVESTIGATION INTO A SEGMENTATION BASED OCR

FOR THE NASTALEEQ WRITING SYSTEM

MS Thesis

Submitted in Partial Fulfillment

Of the Requirements of the

Degree of

Master of Science (Computer Science)

AT

NATIONAL UNIVERSITY OF COMPUTER & EMERGING SCIENCES

LAHORE, PAKISTAN

DEPARTMENT OF COMPUTER SCIENCE

By

Sobia Tariq Javed

August 2007

2

Approved:

____________________

Head

(Department of Computer Science)

Approved by Committee Members:

Advisor

_________________________

Dr. Sarmad Hussain

Professor

FAST - National University

Other Members:

_________________________

Dr. Mehreen Saeed

Assistant Professor

FAST - National University

The undersigned hereby certify that they have read and recommend to the Faculty of Graduate

Studies for acceptance of a thesis entitled “Investigation into a Segmentation Based OCR for the

Nastaleeq Writing System" by Sobia Tariq Javed in partial fulfillment of the requirements for the

degree of Master of Science.

Dated: August 2007

3

To my parents

4

Vita

Ms. Sobia Tariq Javed received a Bachelor of Science degree in Computer Science from National

University of Computer and Emerging Science (NUCES), Lahore in 2005. Sobia Tariq Javed has been a

member of CRULP since 2004. She has been working as a Research Officer in different R&D works. The

research in this dissertation was carried out from 2006 to 2007.

5

Acknowledgements

First I would like to express my gratitude to Almighty Allah for the blessings He has bestowed on

me.

I would like to thank Dr. Sarmad Hussain, Associate Professor at NU-FAST for his generous

help, support and precious time that he gave me for the completion of this thesis. I would like to

thank him for his insight and supervision.

I would also like to thank Dr. Mehreen Saeed for encouraging me to complete this research.

My special thanks to Ameera Maqbool for helping me in understanding HTK tool kit for Hidden

Markov Model. She contributed a lot by providing all sorts of information regarding HTK that I

needed.

And last and the most important people my parents who made me whatever I am today. Their

prayers and continuous encouragement have led me to this point.

Sobia Tariq Javed

6

TABLE OF CONTENTS

1 OCR ............................................................................................................................ 9

1.1 The Need for OCR Systems................................................................................ 9

1.2 Generic Model of an OCR System ..................................................................... 9

1.2.1 Image Acquisition..................................................................................... 10

1.2.2 OCR Software........................................................................................... 10

1.2.2.1 Document Analysis............................................................................... 11

1.2.2.1.1 Skew Detection and Removal......................................................... 11

1.2.2.1.2 Binarization..................................................................................... 11

1.2.2.1.3 Filtering and Smoothing ................................................................. 12

1.2.2.1.4 Thinning.......................................................................................... 12

1.2.2.1.5 Normalization ................................................................................. 12

1.2.2.2 Character Recognition .......................................................................... 13

1.2.2.3 Contextual Processing........................................................................... 13

2 Literature Review ..................................................................................................... 13

2.1 Background and History ................................................................................... 13

2.2 Research Techniques for Character Recognition.............................................. 14

2.2.1 Segmentation ............................................................................................ 14

2.2.1.1 Segmentation Free Approach................................................................ 14

2.2.1.2 Segmentation Approach........................................................................ 15

2.2.1.2.1 Segmenting into Characters ............................................................ 16

2.2.1.2.2 Segmenting into Primitives............................................................. 16

2.2.2 Recognition............................................................................................... 18

2.2.2.1 Feature Extraction................................................................................. 18

2.2.2.1.1 Structural Features .......................................................................... 19

2.2.2.1.2 Statistical Features .......................................................................... 21

2.2.2.1.3 Global Transformations .................................................................. 21

2.2.2.2 Pattern Matching / Classification.......................................................... 22

2.2.2.2.1 Template Matching ......................................................................... 22

2.2.2.2.2 Statistical Classifier ........................................................................ 23

2.2.2.2.3 Clustering........................................................................................ 28

2.2.2.2.4 Nearest Neighbor ............................................................................ 28

2.2.2.2.5 Syntactical / Structural.................................................................... 29

2.2.2.2.6 Expert System ................................................................................. 30

3 Urdu Writing System ................................................................................................ 31

3.1 Character to Unicode and Glyph Mapping ....................................................... 35

3.2 Advance Script Properties ................................................................................ 36

3.2.1 Character Composition and Decomposition ............................................. 36

3.2.1.1 Different Shapes ................................................................................... 36

3.2.1.1.1 Isolated Form .................................................................................. 36

3.2.1.1.2 Initial Form ..................................................................................... 36

3.2.1.1.3 Final Form....................................................................................... 37

3.2.1.1.4 Medial Form ................................................................................... 37

3.2.1.2 Connection Form .................................................................................. 37

3.2.1.3 Diacritic Placement............................................................................... 38

7

3.2.1.4 Presence of Base Line........................................................................... 38

3.2.1.5 Overlapping .......................................................................................... 38

3.2.1.6 Upper and Lower Cases ........................................................................ 39

3.2.1.7 Ligature Formation ............................................................................... 39

3.2.1.8 Strokes .................................................................................................. 39

3.2.1.9 Script ..................................................................................................... 39

3.3 Nastaleeq Script ................................................................................................ 40

3.3.1 Properties of Nastaleeq Script................................................................... 41

3.4 Noori Nastaleeq ................................................................................................ 47

4 Problem Statement.................................................................................................... 48

4.1 Problem Scope .................................................................................................. 49

5 Methodology............................................................................................................. 50

5.1 Scan Image / Open Image................................................................................. 53

5.2 Separate Line of Text........................................................................................ 54

5.3 Set Base Line .................................................................................................... 55

5.4 Isolate Ligature and Diacritics.......................................................................... 55

5.5 Segmentation of Main body.............................................................................. 55

5.6 Framing............................................................................................................. 60

5.7 Hidden Markov Model...................................................................................... 60

5.7.1 HMM Training.......................................................................................... 61

5.7.2 HMM Recognition.................................................................................... 62

5.8 Recognizing the Ligature.................................................................................. 62

6 Results....................................................................................................................... 65

6.1 Purpose.............................................................................................................. 65

6.2 Sample .............................................................................................................. 66

6.3 Method .............................................................................................................. 66

6.4 Test Results....................................................................................................... 66

6.4.1 Character level Results ............................................................................. 66

6.4.1.1 Class of Alif .......................................................................................... 66

6.4.1.2 Class of Swad........................................................................................ 66

6.4.1.3 Class of Dal........................................................................................... 67

6.4.1.4 Class of Choti Yeh................................................................................ 67

6.4.1.5 Class of Ain .......................................................................................... 67

6.4.1.6 Class of Bay .......................................................................................... 67

6.4.1.7 Cumulative Results ............................................................................... 67

6.4.2 Ligature Level Results .............................................................................. 68

6.5 Examples........................................................................................................... 68

7 Discussion................................................................................................................. 70

7.1 Distortion in the Image ..................................................................................... 70

7.1.1 Problem Description ................................................................................. 70

7.1.2 Example .................................................................................................... 70

7.1.3 Solution..................................................................................................... 71

7.2 Similarity in Shape............................................................................................ 71


7.2.2 Examples................................................................................................... 71

7.2.3 Solution..................................................................................................... 72

8

7.3 Inconsistency in Font ........................................................................................ 72


7.3.2 Example .................................................................................................... 72

7.3.3 Solution..................................................................................................... 73

8 Conclusion ................................................................................................................ 73

9 Future Work and Enhancement ................................................................................ 74

Reference .......................................................................................................................... 75

Appendix........................................................................................................................... 80

Appendix A................................................................................................................... 80

Appendix B ................................................................................................................... 88

9

1 OCR

Optical character recognition, usually abbreviated to OCR, is computer software that is designed

to translate text images into machine-editable text document that can be opened in any word

processor or text editor. It helps to quickly digitize paper documents for further manipulations

without any manual effort. OCR began as a field of research in pattern recognition, artificial

intelligence and machine vision. [35].

1.1 The Need for OCR Systems

OCR (Optical Character Recognition) allows us to manipulate the printed data, using computer

with minimum effort and time. It converts the scanned image to text-based document, which can

be easily processed by a word processor or text editor.

Consider the benefits [51]:

� OCR saves space. Bulky files, which require a lot of space, can be easily replaced by

small sized digital documents. It saves acres of space that has been once given to paper

cabinets and boxes.

� OCR saves time. No need of time consuming retyping process. Not only the data

manipulation but also the information retrieval becomes very easy.

� OCR saves efforts. The text can be reused, edited or reformatted effortlessly.

� OCR saves worry. Digital backup copies of crucial documents can be maintained which

reduces the probability of error.

1.2 Generic Model of an OCR System

One of the main characteristics of Optical Character Recognition is to study automatic reading.

The most impressive and astonishing capability of human brain is to recognize patterns in nature.

Therefore, the aim of OCR is to emulate the human ability to read at a much faster rate by

associating symbolic identities with images of character.

The language independent text recognition process can be broadly classified into three main

categories, which are:

• Image Acquisition.

• OCR Software.

10

• Output Interface.

Figure 1 : Basic Model for OCR

1.2.1 Image Acquisition

In the very first phase of OCR we need to acquire an image, which needs to be converted.

Scanners are normally used to acquire an image. In some cases digital cameras are also used.

1.2.2 OCR Software

This is the main step of OCR where the actual translation of the text image to machine editable

text is performed. The characters from the preprocessed image are recognized and are converted

to text. This process comprises of three main operations.

11

1.2.2.1 Document Analysis

It is also called pre-processing process in which text is extracted from the document image. This

operation is very important for the better results.

Different steps involved in the Pre-Processing phase are as follows

1.2.2.1.1 Skew Detection and Removal

Skew is the distortion or tilt, by an angle, in the input image i.e. the angle of the baseline that is

not written horizontally. Sometimes it so happens that while scanning a page using a scanner, the

page is tilted to a certain angle. As a result whole text is tilted and the angle of the line makes it

inappropriate for OCR processing. So we may say that skew detection and removal plays a

significant role in output of OCR project [47].

(a) Skewed Image (b) Image after Skew Removal

Figure 2 : Skew Detection and Removal

1.2.2.1.2 Binarization

The binarization is a process, which converts a gray scale image to a black and white image. The

simplest way to do this is through thresholding, in which a histogram of the grey values of an

image is computed and the cut-off point (valley between the two peaks) is calculated. All the

pixels whose value is above the cut-off point are converted to one while all others to zero [47].

(a) Colored Image (b) Image after Binarization

Figure 3 : Binarization

12

1.2.2.1.3 Filtering and Smoothing

‘Filtering and Smoothing’ is basically done to remove noise and distortion from the image, which

may be produced in image acquisition. The filtering process is used to remove distortion and

noise produced at the time of scanning due to shot noise, dark current noise, thermal noise or

cross-coupling noise. Fine-textured noise is removed through smoothing [47].

(a) Distorted Image (b) Image after Filtering and Smoothing

Figure 4 : Filtering and Smoothing

1.2.2.1.4 Thinning

The thinning is a morphological process, which is an efficient way to express the structural

relationships in the character recognition as it removes the selected foreground pixels from the

image. It reduces the computational time and effort to traverse an image. But the technique is

very much sensitive to noise; a little disturbance can change the shape of an image.

(a) Normal Image (b) Skeleton zed Image

Figure 5 : Thinning

1.2.2.1.5 Normalization

Normalization is generally done to overcome size and orientation variation problem. Our purpose

is to correctly recognize the similar shapes with different sizes and orientations. The translation,

rotation and scaling is performed prior to the start of the actual OCR processing, to normalize the

text for size.

13

(a) Image before Normalization (b) Image after Normalization

Figure 6 : Normalization

All these steps are performed to enhance the image for better recognition.

1.2.2.2 Character Recognition

This is the main process where text from the image is read and translated into a form that the

computer can manipulate. Various techniques are applied to correctly recognize the text from the

image. The features are extracted from the images by the most common method of character

recognition, called “feature extraction” and then its shape is analyzed and its features are

compared against a set of rules that distinguishes each character. This analysis and comparison

identify the characters.

1.2.2.3 Contextual Processing

Results of recognition can be significantly improved by using the contextual information. Higher

level of information, which is not available to the classifier, is used to verify the accuracy of the

solutions returned by the classifier. One of the methods for post processing is to apply spell

checker on the classified output to correct the misspelled words due to wrong classification [47].

2 Literature Review

As Urdu (Noori Nastaleeq) is a cursive script, so most of the literature provided in this document

is either related to Urdu offline Handwritten OCR or Cursive Handwritten Latin OCR.

2.1 Background and History

The attempts to automate printed material, for further data manipulation, started prior to World

War II. The need of OCR was first realized in banks for automatic form checking. The modern

OCR technology started with the invention of GISMO-a Robert by M. Sheppard in 1951. In

1954, J. Rainbow developed a prototype machine, which was able to read uppercase typewritten

output at the incredible speed of one character per minute [32].

14

Then a series of dramatic developments took place in technology, during the late 1960s. Then in

1971, the OCR technology revolutionized bill payment systems and many others [32].

Very fast, reliable, fonts independent and less expensive with higher accuracy OCR systems are

available today.

2.2 Research Techniques for Character Recognition

This process can be further divided into two steps.

2.2.1 Segmentation

The accuracy of recognition process is heavily dependent on the type of the word decomposition

technique used because the wrong segmentation produces misrecognition or rejection. For

recognizing cursive script, there are two main approaches to deal with connected characters in a

word.

2.2.1.1 Segmentation Free Approach

In segmentation free approach, the ligature as a whole is used instead of segmenting it into

smaller units. The above approach has been used with different pattern matching techniques,

which focuses on using features of image as a whole to classify the word, for better performance.

One approach is to use line as a whole instead of further segmenting it into ligatures or words.

This technique is mainly applied with HMM where language independent global transformation

features are extracted from line and fed to HMM for recognition [5], [6], [7], [18].

In 1994, Elms, A.J [20] used segmentation free and HMM based, level building recognizer, for

the recognition of connected characters in which whole connected body is considered as one

segment. Though this method can work effectively for the Latin Script but it will not produce

fruitful results for Urdu script, which has main bodies as well as diacritics.

Figure 7 : Connected Component Method

Sometimes a combination of connected component labeling and centroid-to-centroid distance

method is used to extract ligatures from a line [26]. First a connected component labeling is

15

applied to separate each connected body irrespective of main body or diacritic. Then centroid-to-

centroid distance method is applied to associate each main body with its diacritics.

Figure 8 : Connected Component and Centroid-to-centroid Distance Method

Another effective method is to consider all the connected components touching along the baseline

as a ligature. Zahra Shah and Farah Saleem [22] used this method to separate main bodies from

diacritics.

Figure 9 : Connected Component and Baseline Method

Much of the recognition by expert system was done using segmentation free approach. The

Generalized Hough Transformation is applied on the image to detect objects in the image [13].

Most commonly used method for Latin script is to calculate vertical projection profile of each

line and to segment a line where the histogram has a zero value. That means a contour between

two gaps in the line is considered as a ligature. But again this method cannot be applied to

Nasteleeq script where the ligatures overlap.

Though a segmentation free approach has proved remarkable for the recognition process and

diminished the segmentation and computational effort, but it has also made the recognition

process more ligatures dependent in certain cases.

2.2.1.2 Segmentation Approach

The performance of any Arabic text recognition depends on how successful the system

overcomes the obstacles of cursive ness and context-sensitivity. The conventional approach is to

segment the words into either characters or symbols. Therefore, segmentation based approach has

further two divisions.

16

2.2.1.2.1 Segmenting into Characters

The ligature is correctly segmented into characters. In this approach segmenting accurately is the

most crucial part of an OCR. In segmenting the ligatures into characters, such techniques are

applied that prevent breaking up of character into more than one part. One approach is to segment

a sub-word by applying three sets of rules for the different positions in a sub-word: the first, the

last and the middle [23]. Another effective segmentation method is to search for the connection

points along the base line. One of the most common methods of segmentation uses the vertical

projection histogram and defines the connection points to be the locations where the value of the

threshold dips below a certain value.

Figure 10 : Histogram and Connection Point Method

Roberto J.Rodrigues and Antonio Carlos Gay Thome [31] used projection histogram for the

initial character segmentation and then further refined until a satisfactory performance was

reached. In [33] the combination of component labeling and vertical profile methods was used to

segment characters. First vertical projection profile method was applied on the connected

characters and then component labeling was applied to correct the mis-segmentation if any. M.

Mohamed and P. Gader [2] stated that combining segmentation-free and segmentation-based

techniques would give better results. Few segmented the word based on the special features of the

characters like slopes, peaks etc.

2.2.1.2.2 Segmenting into Primitives

Segmenting a word into parts smaller than a character called symbols i.e. segmentation at all

potential connection points and then combine them to make a character. This means segmenting

the sub-word into primitives possibly smaller than a character, like strokes, intersection points

and loops etc. This technique is a precondition for feature analysis [4].

Words are segmented into primitives based upon each pixel neighborhood. In 1995, H. Bunke, M.

Roth and E. C. Schukat-Talamazzini [17] proposed that the skeleton of a word can be broken

down at the pixel, with either one or more than three neighbors, to get loops and edges.

17

Figure 11 : Neighborhood Method

M. Mohamed and P. Gader [2] segmented the word image into primitives and then used

dynamic programming to find the best sequence of segments to match a given string.

Badr Al-Badr and Robert M. Haralick [11] described symbols in terms of shape primitives to

recognize machine-printed Arabic words without prior segmentation.

Another method is to cut the connected component into smaller parts called frames. In 1997, A.

Kornai [3] favored the character-level pre-segmentation due to the fact that sliding window

feature extraction is on average 10% worse. But it will be wrong to make such a conclusion on

this 10% difference because this minor difference may be due to the relatively little cursive nature

of data set.

Similarly few primitives can be extracted based on the special features of the characters. A 1999

system [4] by Khorsheed and Clocksin segmented the skeletonized script at end points, branch

point and cross points. Then these segments were fed to the HMM recognizer, in the descending

order relative to the horizontal value of the starting feature point, for the final recognition.

Figure 12 : Edge Based Segmentation Method

Another effective segmentation method is to search for the connection points along the base line.

The mixture of neighborhood and baseline method can also effectively segment the skeleton into

primitives [34]. The segments can then be recognized by a structure analysis approach.

Figure 13 : Neighborhood and Baseline Method

18

This approach has an advantage over the full character segmentation, as it is easier to find a set of

potential connection points, than to find the actual connection points directly.

2.2.2 Recognition

The character recognition algorithm can be sub divided in to two main components – The Feature

Extractor and the Classifier.

The descriptors, or feature set, used to describe all characters in a given character image are

analyzed and determined by the Feature Extractor which are then used as input to the character

classifier. The classifiers basically perform pattern-matching task.

Figure 14 : Character Recognizer

2.2.2.1 Feature Extraction

Simplification of the features required to describe a large set of complex data accurately and

efficiently is the main task of feature extraction. The analysis of complex data requires a large

amount of memory and computation power or a classification algorithm, which over fits the

training sample and generalizes poorly to new samples. Therefore, we can say that feature

extraction is a term used for methods of constructing combinations of the variables to overcome

these problems well still describing the data with sufficient accuracy [35].

Once the pattern is acquired, the next step is to extract the features of the pattern and pass them to

the classifier for recognition. The process of pattern recognition is heavily dependent on feature

extraction. The small feature set, which can efficiently discriminate between patterns of different

classes, should be selected. One should make sure that these features should be similar for

patterns within the same class [47].

19

The two main control approaches for feature extraction and classification are:

• Interleaved Control: In such systems, the pattern is classified incrementally i.e. few features

are extracted from a pattern and then a pattern is tried to be recognized in number of passes with

the addition of new features in each pass. So it’s the alteration between feature extraction and

classification [47].

Figure 15 : Interleaved Control

• One-Step Control: In One-Step control, first all the required features from a pattern are

extracted and then the pattern is classified [47].

Figure 16 : One-Step Control

There are three different types of features, which are:

2.2.2.1.1 Structural Features

Structural features describe a pattern in terms of its structure and geometry by giving its global

and local properties. The pattern is carefully analyzed to extract these features from it. Thus it

may be said that these features are heavily dependent on the kind of the pattern to be classified. In

20

the case of characters, the features include strokes, and bays in various directions, end points,

intersection points, loops, dots and zigzags [47].

A pseudo- neuronal system [29] for reading cursive script extracted ascender, decender and loops

from the image. A 1999 system [4] by Khorsheed and Clocksin extracted normalized length,

curvatures, endpoints, relative location and curved features which include the number of pixels

above the top feature point, below the bottom feature point, left of the left most feature point and

right of the right most feature point of that segment, from a word's skeleton for recognition. In a

template matching approach [22], features like location of dots and placement of other diacritics

for each ligature are extracted. In 2001, M. Fahmy and S. Al Ali [25] used features like junctions,

turning points, loops, strokes etc and fed them to neural networks to automatically recognize off-

line handwritten Arabic words and obtained a recognition rate of 69.7%. Then in 2002, S. Snoussi

Maddouri, H. Amiri, A. Belaïd and Ch. Choisy[27] used a combination of global (baseline,

ascenders, descenders, positions of primitives and loops) and local features (position

normalization, starting point normalization, harmonic phase normalization and size

normalization) for recognition.

Some approaches use the height, width, number of cross-points, and the category of the pattern

(character body, dot, etc), the presence and number of dots and their position with respect to the

baseline, the number of concavities in the four major directions, the number of holes and the state

of several key pixels, the number of strokes or radicals and the size of the stroke’s frame, and the

connectivity of the character.

In 2000, D. Megherbi, S.M. Lodhi, J.A. Boulenouar [12] used Fuzzy logic to make a

classification of 36 Urdu characters into seven sub-classes, namely sub-classes characterized by

seven well defined fuzzy features (number of dot(s) present in the character, place of the dot,

branch or presence of secondary stroke, ratio (height to width), slope between initial point and

final point, initial angle and begins with intersection) to characterize Urdu characters. H. Bunke,

M. Roth and E. C. Schukat-Talamazzini [17] used the feature set, having information related to

spatial location of an edge, curvature and percentage of pixels around the curved edges for offline

cursive handwriting recognition. In [8] a single hidden Markov model (HMM) was trained with

the structural features (dots, end points, branch points, cross points, simple loop, complex loop

and turning points) extracted from the words.

Sometimes features based on the concept of topological, contour and water reservoir are extracted

[33].

21

Extraction of structural features from an image is not an easy task but it can very well

accommodate distortions and variations in writing styles [47].

2.2.2.1.2 Statistical Features

The statistical features describe a pattern in terms of a set of characteristics measurements

(median, stand deviation, variance, ratios etc) extracted from the pattern. That is to say the

features based on the statistics derived from the pattern. These features normally require some

kind of computation or calculation. Zoning, characteristic loci, crossings and moments are the

statistical features that are mainly used for Arabic text recognition [47]. A multi-tier holistic

approach [26] for the offline recognition of cursive Urdu Text (Noori Nastaliq Script) used

measurements like solidity, Axes Ratio, Eccentricity; Moment based features, normalized length

feature, and curvature feature and number of holes.

Sometimes Local Line Fitting is used as features by fitting the pixels belonging to each receptive

field to a straight-line using eigenvector line fitting or orthogonal regressions [10].

Sometimes each column in a word is represented as a feature vector (transition features) [2].

In general, statistical features can be easily and rapidly extracted from a text image. Apart from

their tolerance to accommodate moderate noise and variation, they also make systems robust for

new fonts [47].

2.2.2.1.3 Global Transformations

By the transformation scheme, the pixel representation of the pattern is converted to an

alternative more abstract representation. As a result, the dimensionality of features is reduced

[47].

One approach is to use a transformation based on vertical and horizontal projections of a pattern

i.e. length of the ON pixels in columns or rows respectively. The comparison of each row or

column can further add to the information. [20], [1].

This scheme is basically used with HMMs. Sometimes same advanced technology that is used for

speech recognition is utilized that is to compute Intensity (percentage of black pixels within each

cell) as a function of vertical position; Vertical derivative of intensity (across vertical cells);

Horizontal derivative of intensity (across overlapping frames); Local slope and correlation across

a window of 2-cell square for each frame of image and pass it as a feature vector to HMM[5].

22

In another approach pixel values are used as the basic global transformational features but this

approach results in a vast amount of features for each frame. In order to reduce the number of

features a Loeve-Karhunen Transformation is performed on the grey values of each frame [7].

Unlike statistical and structural features, transformation schemes can be easily applied. They can

tolerate noise and variation but sometimes require the use of additional features to obtain high

recognition rates [47].

2.2.2.2 Pattern Matching / Classification

The classification is the main decision making stage in OCR. Classifier tries to classify the

pattern, as a member of a certain class, by comparing the extracted features to those of the model

set. Classification does not produce a unique solution instead give a set of approximate solutions

as an output [47].

Some widely used recognition techniques are discussed below.

2.2.2.2.1 Template Matching

Template matching is a pixel-by-pixel comparison of a pattern with a set of pre-defined pattern

templates. It is considered that the pattern, under consideration, belongs to the class of the

template to which it is most similar. Each extracted word from the image is compared with a

predefined shape words from a predefined data set. The basic method is to go through all the

pixels in the image and compare them to the pre-defined templates. After all templates have been

compared with the image, the unknown image’s identity is assigned as the identity of the most

similar template. Though this method is the most simple and easy to implement but it is very slow

in process [47]. In 1998, Syed S. Hyder and Ali Khoujah [23] used template matching to classify

the word. They first determined the different shapes (final, medial, isolated and initial) of each

letter in the printed cursive script and then developed context sensitive grammar to accept a shape

by pattern matching with sub-shapes or templates that constituted a class of character shapes.

Zahra Shah and Farah Saleem [22] used template matching for the recognition of type written

Urdu Nastaleeq script and obtained an accuracy rate of 79%. The two pass method for the

recognition of visible features was used. The diacritics were recognized in the first pass by

template matching and then the visible features recognition strategy was applied on the main

bodies to correct any misrecognition of diacritic as a main body.

There are several disadvantages of using template matching; firstly the limited number of fonts

can be recognized using the template matching. Secondly, a little variation in the image can

23

greatly affect the output of the OCR. Thirdly, it requires great computational effort and time for

recognition.

2.2.2.2.2 Statistical Classifier

The statistical pattern recognition is a machine intelligence approach, which generates stochastic

parameters to characterize the properties of the pattern to be recognized. Its main purpose is to

determine the class or category for a given sample. This is achieved by using statistical

methodologies such as statistical hypothesis testing, correlation and Bayes classification [32].

2.2.2.2.2.1 Neural Network

In recent years, neural network have been successfully applied in many tasks of pattern

recognition and machine learning systems. An Artificial Neural Network (ANN), information

processing paradigm, consists of larger number of highly interconnected processing elements

called neurons, whose structure is drawn from analogies with biological neural systems, working

in union to solve a specific problem. ANN, like people, learns by example [50].

Figure 17 : Neural Network

1n 1995, Myriam Côté, Eric Lecolinet, Mohamed Cheriet and Ching Y. Suen [29] recommended

pseudo- neuronal system for reading cursive script. The system was based on hierarchically

organized three levels of detectors: the feature detectors, the letter detectors and the word

detectors. Interaction between the detectors was through excitatory activation of neural type. The

system was in fact modified form of perception model of McClelland & Rumelhart with addition

of some characteristics specific to cursive script.

J. Hébert, M. Parizeau and N. Ghazzali [24] used Neural Networks for isolating handwritten

characters in cursive words without prior segmentation and without applying any constrain. The

24

key idea behind this strategy was to move a window of attention around the cursive word, search

for instances of known characters. If the current window contains some significant part of a

character, then translate and scale the window of attention in such a way that it converges to the

bounding box of that character. However, this approach can be completely trained using data sets

of isolated characters only.

In 2001, M. Fahmy and S. Al Ali [25] used geometrical features and neural networks to

automatically recognize off-line handwritten Arabic words and obtained a recognition rate of

69.7%.

Then in 2002, S. Snoussi Maddouri, H. Amiri, A. Belaïd and Ch. Choisy[27] used the idea of the

preceptor system, developed by M. Cote for Latin word recognition, for the recognition of Arabic

hand-written words. He combined a global and a local vision modeling (GVM- LVM) of the

word in a NN called Transparent Neural Network. A recognition rate of 97% was achieved. Text

was recognized first by examining the words and if that failed then the individual letters were

examined.

Again in 2002,Syed Afaq Husain and Syed Hassan Amin [26] proposed a new multi-tier holistic

approach for the offline recognition of cursive Urdu Text (Noori Nastaliq Script). The special

ligatures (Dot, Tay, Hamza and Mad) were identified first from the base ligatures using a Feed

Forward Back propagation neural network with 15 inputs, 25 hidden and 25 output neurons. The

feature vectors were fed to this neural network. It then identified the ligatures as either special

ligatures or base ligatures. In the second step, these special ligatures (diacritics) were associated

to the main bodies by using minimum centroid-to-centroid distance. Finally, the extracted

information (main bodies along with their diacritics) was fed to the Feed Forward Back

Propagation neutral network for the final recognition task. The network architecture consisted of

34 inputs, 65 hidden neurons and 45 output neurons. The performance of the system on images

containing the trained ligatures only was 100%. However untrained additional ligatures were

classified to the closet match in the training set since no rejection class was utilized. The main

disadvantage of this strategy was that only trained data set would be recognized properly.

A Back propagation neural network, which is a multilayer perception model with an input layer,

one or more hidden layers and an output layer is used in simple OCR application. Alex

Cherkasov [28] presented a technique for the development of OCR using Artificial Neural

Network.

25

Though recognition by NN has remarkable contribution to the development of OCR, but training

of neural networks is a difficult task, moreover the network created by NN is very complicated. A

mesh kind of network is created which is difficult to learn and understand.

2.2.2.2.2.2 HMM

A hidden Markov model, commonly abbreviated as HMM, is a statistical model where the system

being modeled is assumed to be a Markov process with unknown parameters. Based on this

assumption, the hidden parameters are determined from the observable parameters. These

extracted model parameters are used in the pattern recognition applications for further analysis.

The pattern may be a speech pattern or an image pattern [48].

The Hidden Markov Model has a finite set of states where each state is associated with another

through some probability distribution. Transitions among the states are governed by a set of

probabilities called state-transition probabilities. At a particular state, the outcome or observation

is generated according to the associated probability distribution called an observation symbol

probability. The state is not visible but only the outcome of the state can be seen to an external

observer, therefore states are ``hidden'' to the outside; hence the named Hidden Markov Model.

Since two types of probabilities govern the movement from one state to another so the system is

doubly stochastic in nature [48].

Figure 18 : Hidden Markov Model [49]

Continuous Hidden Markov Model and level building dynamic programming algorithm [19] were

used for the recognition of connected and degraded characters. Using a structural analysis

algorithm, a number of competing HMMs were used to recognize characters, derived from the

small segments of each word.

26

Elms, A.J [20] proposed a character recognition technique without a prior segmentation in which

global transformational features (horizontal profile) were fed to HMMs for recognition. The level

building recognizer provided the internal segmentation as well as recognition. The recognition

rate of 97.8% was achieved but the character pairs (pb) and (dq) still caused problems.

A.J. Elms along with J. Illingworth [1] again used global transformational features and HMMs

for the recognition of isolated characters and claimed a performance of 94%. They represented

the character by analyzing the pixel pattern in columns/rows of its image and linked sequential

column/rows patterns together with a HMM. Shift Invariant Hamming Distance and the center of

gravity of each column/row were used as a feature vector. Since the horizontal and vertical shape

profile requires significant computational effort and some pre-segmentation at character level, so

the system becomes inappropriate for the languages with connected script.

In 1995, H. Bunke, M. Roth and E. C. Schukat-Talamazzini [17] came forward with the off-line

recognition technique for cursive handwriting based on Hidden Markov Model. The features

were derived from the arcs of skeleton graphs of the word and fed to HMM for recognition. The

recognition rate of 98% has been achieved in experiments with cooperative writers using two

dictionaries of 150 words each.

An MD-HMM approach (one-model per word) is called Non-Ergodic HMM (NEHMM). Non-

Ergodic HMM grows linearly with the dictionary size and is computationally very expensive.

Amlan Kundu, Yang He and Mou-Yen Chen[21] came with the idea that if the computed duration

statistics from VDHMM is used to implement MD-HMM approach, then the experimental results

will improve significantly.

A 1999 system [4] by Khorsheed and Clocksin extracted structural features from a word's

skeleton for recognition, and converted the skeletonised script into an observation sequence

suitable for an HMM recognizer. Though recognition rates of up to 97% were achieved but this

technique still requires a considerable computation effort and much better preprocessing.

Moreover, this technique does not suit Arabic handwritten words, since diacritics are not written

exactly below or above each character or edge feature as described in the paper. Also, some of the

geometrical features don’t suit handwritten Arabic words [16].

Zhidong Lu, Issam Bazzi, Andras Kornai, John Makhoul [5] , suggested that image recognition

will be the same as speech recognition in which global transformational features will be

calculated for each frame of the window along the image and will be passed as feature vectors to

27

HMM for final recognition. The system only requires a sample retraining along with ground truth

for each new language or script.

Marija Bojovic & Milan D. Savic claimed that systems based on discrete HMMs are generally

much faster than systems based on semi-continuous HMMs.[6]

Simon Günter and Horst Bunke [15] proposed a new combination method for HMM based

handwritten word recognizers. They combined various HMMs at a more elementary level. The

sum of time complexities of the recognition process of the individual classifiers was much lower

than the time complexity of the above mentioned technique. Though this method produced

favorable results but the overall high time complexity of the recognition process made the system

inappropriate.

Somaya Alma’adeed, Colin Higgens, and Dave Elliman [16] presented a complete scheme for

totally unconstrained Arabic handwritten word recognition based on a Model discriminant HMM.

After applying pre-processing and skeletonization, the classification process based on the HMM

approach was used. The output was a word in the dictionary. A detailed experiment was carried

out and recognition rate of 45% was reported, which is very low rate.

There is some work done on the offline recognition system for Arabic handwritten words. The

recognition system of Mario Pechwitz and Volker Maergner was based on a semi-continuous 1-

dimensional HMM. Sliding window approach was used to collect the features and testing was

performed using the new IFN/ENIT - database of 26459 handwritten Arabic words (Tunisian

town/village names) by 411 different writers. On the word level the recognition rate was 89%[7].

In another method [8] features were extracted from the manuscript words and trained a single

hidden Markov model (HMM). The HMM was composed of multiple character models where

each model represented one letter from the alphabet. Testing was done using samples extracted

from a historical handwritten manuscript.[8]

R. El-Hajj, L. Likforman-Sulem and C. Mokbel [18] used continuous 1D HMM for the offline-

handwritten recognition. Language independent features, such as lower and upper baselines, were

extracted from the image. Thus the word variability due to lower and upper parts of words was

tackled properly. Internal segmentation was applied by the continuous 1D HMM. Initially the

recognition rate was 74.90% but it improved later on to 86.51% with the addition of the features

related to the detected baselines. This technique cannot be applied to Nastaleeq style, a cursive

writing style, where there is no specific baseline.

28

2.2.2.2.3 Clustering

Clustering is the process of organizing objects into groups whose members are similar in some

way. i.e. the objects within a same cluster are similar while they are dissimilar to the objects

belonging to other clusters. New features can be constructed, which are the abstractions of the

existing features, by using these clustering algorithms. Some algorithms, like k-means, simply

partition the feature space. Other algorithms, like single-link agglomeration, create nested

partitioning which form taxonomy [50].

Fuzzy logic is a problem-solving control system methodology, which is derived from fuzzy set

theory dealing with reasoning that is approximate rather than exactly deduced from classical

predicate logic. That is to say, it provides a remarkably simple way to draw definite conclusions

based upon imprecise, ambiguous, vague, fuzzy, blurred, noisy, or missing input information. It

mimics the human ability to make decisions. Fuzzy reasoning involves three steps Fuzzification

of the terms that appear in the conditions of rules, Inference from fuzzy rules, Defuzification of

the fuzzy terms that appear in the conclusions of rules [12].

In 2000, D. Megherbi, S.M. Lodhi, J.A. Boulenouar [12] used Fuzzy logic to make a

classification of 36 Urdu characters into seven sub-classes. Fuzzy logic is used, as it is viewed as

generalization of multi-valued logic. All kind of uncertainty and imprecision in knowledge

representation, inference, and decision analysis would be dealt properly and easily with fuzzy

logic [12].

2.2.2.2.4 Nearest Neighbor

Nearest Neighbor classifier takes a point in vector form. Then computes the distance between this

test point and the vector representation of each training set. The training data closest to the test

point is considered its nearest neighbor. This helps to allocate a class to the test point [50].

In 2005, Ali Ahmadi, Yoshinori Shirakawa, Md. Anwarul Abedin, Kazuhiro Takemura, Kazuhiro

Kamimur, Hans Jurgen Mattausch and Tetsushi Koide [30] proposed an associative memory

based system for real-time character recognition. The nearest-distance search algorithm applying

the associative memory was used for classification. Though a template matching was used for the

classification but mixed analog-digital and fully parallel architecture of the associated memory

made the technique very speedy with a small hardware size. The average search time for each

character was 45ns.

29

2.2.2.2.5 Syntactical / Structural

In Syntactical or structural pattern recognition, the pattern representation is through the symbolic

data structures such as arrays, strings, trees, or graphs, which portray the relations between

fundamental pattern components and also allow hierarchical model representations. Symbolic

representation of an unknown pattern is compared with a number of predefined objects for

recognition. This comparison is made by a symbolic match so as to figure out the similarity

measurement between the unknown input and the number of object models [32].

Very little has been done on machine printed OCR. Badr Al-Badr and Robert M. Haralick [11]

used the mathematical morphology operations to detect primitives, from machine-printed Arabic

words without prior segmentation, which were then matched with symbol models. The system

then yielded the word with the highest posterior probability as a recognized word. The advantage

of using this whole word approach versus a segmentation approach is that the result of

recognition is optimized with regard to the whole word. The accuracy rate, of preliminary

experiments using a lexicon of 42,000 words, was 99.4% for noise-free text and 73% for scanned

text.

Stephen Pearce and Maher Ahmed [14] used a powerful graph representation and an evolutionary

algorithm framework for segmentation and recognition. First segmentation hypotheses were

initialized and then refinements were made to have a fully segmented and recognized string. The

evolutionary segmentation was tested in many domains including connected digits, connected

characters and simple circuit diagrams. The performance of the evolutionary algorithm depends

heavily on the symbol recognition system used.

Roberto J.Rodrigues and Antonio Carlos Gay Thome [31] stated the use of the decision tree

algorithm for cursive script recognition based on the use of histogram as a projection profile

technique. The basic idea was to construct a tree with successive refinements, from the data of the

histograms until an acceptable performance was reached. Successive levels of the tree were

allowed based on heuristic criteria. The algorithm included three steps: Image compensation,

Initial segmentation and refinement. The results of first stage classifier were quite satisfactory.

The use of projection histograms for the process of character segmentation only solved 70% of

the problems with connected digits but the problem of trace slant still persisted.

A graph-based structural segmentation approach based on the topological relation between the

baseline and the line adjacency graph (LAG) representation of the text was used in [34] . The text

was segmented to sub-character units, which were then recognized by a structure analysis

30

approach. Separate dots and diacritics classifiers were used for their recognition. A regular

grammar that described how characters were composed from scripts was used for the final

character recognition. The segmentation algorithm failed due to characters touching in irregular

positions especially in the case of dot, descender and ascender touching.

The main disadvantage of this approach is its time and computational complexity.

2.2.2.2.6 Expert System

An expert system is a knowledge based computer program that contains some of the subject-

specific knowledge/information of one or more human experts.

Eric Lecolinet [9] proposed a model for cursive script recognition, which was based both on a

top-down recognition scheme called Backward Matching and a bottom-up feature extraction

process. Both worked in a competitive way. Visual Indexes (stroke extraction, loop extraction,

baseline determination and VI clustering and normalization) were extracted first by the bottom-up

recognition stage. Then, the symbolic descriptions of the words were verified by Backward

Matching process. This technique requires pre-processing for better results.

A new representation method for handwritten character recognition called LLF (Local Line

Fitting), was proposed by Juan-Carlos Perez, Enrique Vidal and Lourdes Sanchez [10]. They

applied geometric operation to extract the features (density and the other features that represent

the fitted straight line) from the image. The method yielded a relatively low dimensional and

distortion invariant representation. This method can be used by any geometrically based

classification systems such as Neural Networks, vector space distance-based methods,

discriminant functions, etc as it produced a fixed-size vectors. Later on many handwritten digits

and letters recognition applications used this method.

Sofien Touj, Najoua Essoukri Ben Amara and Hamid Amiri [13] proposed an approach for

Arabic character recognition based on the use of a Generalized Hough Transform (GHT).

Different feature points were extracted and different characters’ model was build. A recognition

success average rate of 93% was obtained.

U. Pal and Anirban Sarkar [33] used a combination of topological, contour and water reservoir

concept based features for the recognition of individual characters of Printed Urdu Script. A

classifier tree was designed using these features. The characters were recognized in two stages. In

the first stage, the characters were classified into groups by a feature based tree classifier. In the

second stage, more sophisticated features were used to recognize similar characters. The design

31

of a tree classifier had three components: (1) a tree skeleton or hierarchical ordering of the class

labels, (2) the choice of features at each non-terminal node, and (3) the decision rule at each non-

terminal node. It was basically a CART tree. A prototype of the system had been tested on printed

Urdu characters and achieved 97.8% character level accuracy on average.

3 Urdu Writing System

Urdu is National Languages of Pakistan. It is from Indo-Iranian subfamily of the Indo-European

family of languages. Arabic script in Nastaliq style with several extra characters is used to write

Urdu [38]. Many languages share Arabic script of writing, thus written from right to left (RTL)

[36]. In Urdu, number system is written form left to right, so Urdu writing system has both the

properties of left to right and right to left writing systems as shown in the figure [37]:

Figure 19 : LTR and RTL Directions in Urdu script

The language is not related to Arabic, but to languages of Northern India, especially Hindi [36].

Some people in the Hindustan region (Northern India) embraced Islam during the Mughal period.

They started using Arabic script and gave birth to a new language called Urdu. [36]. Others

remained Hindu and used the Devanagari alphabet. This language became Hindi [36].

Several letters of the Persian script, which itself is an adaptation of Arabic script, specific to Urdu

were added. Urdu script is cursive, the appearance of a letter changes depending on its

context/position: isolated, initial (joined on the left), medial (joined on both sides), and final

(joined on the right).

Another interesting property of Urdu writing system shows that Urdu writing system is context

sensitive, that is, characters change their shapes depending upon the characters following and

preceding it as in figure below. This change follows some rules like, shape of the next letter to

join with and the shape of the character, which is joining [37].

32

Figure 20 : Context Based Multiple Shapes of س “Seen”

A famous Muslim scholar Ibne-Muqalla developed a new font called Naskh in the year 310 AH.

Naskh font dominated all the previously invented fonts due to its easy and fast writing system

[37].

One traditional device of writing Urdu is Qalam (chisel-tip of flat-nib bamboo). Size of the flat-

nib is primary parameter of size of the shape of an alphabet. In early days, distinguished writers

also used feather as writing device. These days, normal rounded-nib writing instruments (namely:

fountain-pen, ballpoint-pen, pencil, etc.) are widely used. However, Qalams are also available in

the market and used by professional and serious script specialist (termed “Khattaat” in Urdu).

Urdu is written in Pakistan, Middle East, and Muslim parts of India, Mauritius, and South Africa

[38]. There are almost 104 million (1999 WA) people (including second language speakers), out

of which around 60 million use it as first language. But this data is about Urdu speakers; how

many of them can read and write depend on literacy rate and essentiality of Urdu in education.

Following are the primary alphabets along with their IPA symbols and pronunciation (in Urdu

and IPA):

33

Figure 21 : The Primary Alphabets along with their IPA Symbols and Pronunciation

There are some other characters, which are not unanimously considered part of Urdu primary

alphabets; they are rather considered variations of the above given alphabets. These disputed

characters are listed latter in the table showing Unicode & Glyph Mapping using sub-serial (a)

and (b) along with the serial of main primary alphabet.

Hindi/ Urdu together represent the third most spoken language in the world [41]. Urdu has also

been written using the Roman script since the days of the British Raj (British Rule in India). [41].

Technically, linguists do not distinguish between Hindi and Urdu as separate languages [35]. It is

only the script that distinguishes between the two at a substantial magnitude. Hindi is non-

cursive, straight-line, left-to-right, Sanskrit-based script, which is a total contrast to Urdu. This

argues the importance of writing systems for identity of language.

Other symbols used in Urdu writing systems are:

34

Table 1 : Diacritics (Erabs) in Urdu

Zabar

Zair

Pesh

Jazm

Mad

Tashdid

Alif -short

Chota toein

Do zabar

Do zair

Do paish

Ulti paish

Table 2 : Urdu Numerals

Zero

Five

One

Six

Two

Seven

Three

Eight

Four

Nine

Table 3 : Punctuation Marks in Urdu

Question Mark ؟

Comma ، Semi-colon

Colon ؛

: Full stop

Dash ۔

- Opening bracket

) Closing bracket

(

35

Exclaiming Mark

! Forward slash

/ Open quote

“ Close quote

“

3.1 Character to Unicode and Glyph Mapping

Urdu speakers need both Arabic script support and Urdu support. Arabic code points in the

U+0600 - U+06FF range represent all of the letters without regard to their position. It is up to the

font to show the letter with the proper appearance.

Table 4 : Different Shapes of Character in Urdu

Serial # Character-Name Unicode Alternating

Glyphs

Alternating

Glyphs

with

Context (in

a ligature*)

0

Hamzah

(without

kursi) ہ��ہ U+0621 ء___ ء#

1 Alif ا�� U+0627 ا ا__

(FINAL)

ا__

1a alif madd �ّ __� � U+0622 ا��

(FINAL)

� __�

2 Bē ے� U+0628 ب ب

(INITIAL)

� � __

(MEDIAL)

� _� _

(FINAL)

� ب_

3 Pē ے� U+067E پ پ

(INITIAL)

� � __

(MEDIAL)

� __ � __

(FINAL)

� __ �

36

3.2 Advance Script Properties

Some of the Properties of Urdu Script are:

3.2.1 Character Composition and Decomposition

Few letters can combine together to form a new letter and similarly letters can be decomposed.

Figure 22 : Character Composition and Decomposition

3.2.1.1 Different Shapes

The Arabic script is cursive, and all primary letters have different forms depending on whether

they are at the beginning, middle or end of a word, so they may exhibit 4 distinct forms (initial,

medial, final or isolated).

3.2.1.1.1 Isolated Form

If a writing group consists of only one letter, the isolated form is used.

Figure 23 : Isolate form of Bay

3.2.1.1.2 Initial Form

The starting letter of the word takes the initial form i.e. joined on the left.

37

Figure 24 : Initial form of Bay

3.2.1.1.3 Final Form

The ending letter of the word takes the final form i.e. joined on the right.

Figure 25 : Final Form of Bay

3.2.1.1.4 Medial Form

The letter, which does not take the initial, final or isolated form takes the medial form i.e. joined

on both sides.

Figure 26 : Medial Form of Bay

3.2.1.2 Connection Form

In specified situations, default glyphs are replaced with alternate forms that provide better joining

behavior.

Figure 27 : Connection Forms

38

3.2.1.3 Diacritic Placement

For the proper pronunciation of the words, the Urdu characters need some kind of diacritics. The

diacritics appear above or below a character to define a vowel or to emphasize a particular sound.

There are a number of diacritics, the common ones being Zabar, Zeir, and Pesh.

Figure 28 : the diacritic: Zeir, Zabar and Pesh included with the character say

Figure 29 : An Urdu Phrase Written in Nastaleeq Script with Diacritics

3.2.1.4 Presence of Base Line

The base line of Urdu is a horizontal line, which runs through the text, cutting all the words at

some point.

Figure 30 : Base Line in Urdu Text

3.2.1.5 Overlapping

The characters in Urdu overlap vertically and do not touch each other.

Figure 31 : Overlap in the Character

39

3.2.1.6 Upper and Lower Cases

Unlike English, there is no concept of upper case and lower case in Urdu language writing.

3.2.1.7 Ligature Formation

In Urdu text, more than one character joins together to form ligature. A word may comprise of

one or more ligatures and ligature in turn may consist of one or more characters. For example, in

the word Pakistan (ن�� ) there are 3 ligatures i.e. Pa ( ), kista (��) and n (ن).

3.2.1.8 Strokes

There are two types of strokes of Urdu characters. The main stroke is the longest continuous

portion of the character that is written before lifting the pen.

Figure 32 : Main Strokes

The secondary stroke is the other portion of the character written after the main stroke.

Figure 33 : Encircled are the Secondary Strokes

3.2.1.9 Script

Urdu is written using different scripts. Few of them are listed below

40

Figure 34 : Examples of Different Urdu Scripts

3.3 Nastaleeq Script

Most widely used Nastaleeq script is taken as a standard for writing Urdu language. It is a

combination of two different fonts, Naskh and Taleeq. It was initially created by Mir Ali Tabrezi

and has been refined over the past 600 years. It is a complex script as it based not only on the pre-

defined rules but also on the aesthetic sense of the calligrapher. It is written using a flat nib

(traditionally using bamboo pens) and is highly cursive and context sensitive in nature [44].

Apart from the characteristics of Urdu, Nastaleeq also exhibits some other properties, which

make it distinct and complex.

41

Figure 35 : Nastaleeq Script [37]

3.3.1 Properties of Nastaleeq Script

Some of the intricate features of Nastaleeq are:

• Unlike Naskh, which has only four shapes for each letter, Nastaleeq letters adopt different

shapes depending on the context in which it is written [44].

•

Table 5 : Number of Shapes of Nastaleeq Script

NUMBER OF SHAPES WITH

RESPECT TO POSITION

CHARACTER

Isolated Initial Media

l

Final

Bay 1 23 19 2

Jeem 1 19 17 1

Ray 1 - - 2

Dal 1 - - 1

Seen 2 18 17 1

Suad 1 17 17 1

Toay 1 17 17 1

Ain 1 17 17 1

Fay 1 19 17 1

Kaf 1 19 17 2

Lam 1 17 19 1

Meem 1 17 17 2

Vao 1 - - 1

42

Hey 1 17 17 2

Hey (do chasmi) 1 17 17 2

Alif 1 - - 2

Since the Nastaleeq is a highly context sensitive script, a single letter at a particular position may

constitute different shapes.

ISOLATED SHAPES

Following are the isolated shapes of some of the characters

Figure 36 : Isolated Shapes [44]

INITIAL SHAPES

Following are the initial shapes of some of the characters.

Figure 37: Initial Shapes of Bay [44] Figure 38: Initial Shapes of Jeem [44]

Figure 39: Initial Shapes of Seen [44] Figure 40: Initial Shapes of Suad [44]

43

Figure 41: Initial Shapes of Toey [44] Figure 42: Initial Shapes of Ain [44]

Figure 43: Initial Shapes of Fay [44] Figure 44: Initial Shapes of Kaf [44]

Figure 45: Initial Shapes of Laam [44] Figure 46: Initial Shapes of Meem [44]

MEDIAL SHAPES

Following are the medial shapes of some of the characters.

Figure 47: Medial Shapes of Bay [44] Figure 48: Medial Shapes of Hay [44]

44

Figure 49: Medial Shapes of Seen [44] Figure 50: Medial Shapes of Ain [44]

Figure 51: Medial Shapes of Fay [44] Figure 52: Medial Shapes of Kaf [44]

Figure 53: Medial Shapes of Laam [44] Figure 54: Medial Shapes of Meem [44]

Figure 55: Medial Shapes of Goal Hay [44] Figure 56: Medial Shapes of Do chasmi Hay [44]

FINAL SHAPES

Following are the final shapes of some of the characters.

Table 6 : Final Shapes of Nastaleeq Characters

Character shapes Characters

45

• It is written from top right to bottom left with each ligature titled at approximately 45 degree.

That is to say that it is written diagonally. It is space- conserving style of writing and takes

approximately 40% less space than Naskh [44].

Figure 57 : Nastaleeq Diagonality

• It has fine or fragile joins. The strokes are thin where a character shape joins with another

character. The character shapes themselves, are generally thick as compared to the joins. Thus,

there is an alternating sequence of thick and thin strokes in the ligatures [44].

46

Figure 58 : The Join where Fay Connects to Tay is Thin as Indicated by the Red Circle

• The joins are formed by cusp-like shapes, which are concave upwards and have their initial

end higher than the final end [44].

Figure 59 : Cusp-liked joins are circled in red

• It has complex Mark placement rules e.g. for Nukta’ s and diacritics [44].

Figure 60 : Two Different Placements of Dots of Chay

Figure 61 : Two Different Placements of Dots of Pay due to Zair

• There is not a fixed baseline on which characters are written. The baseline changes with the

addition of each character. That is to say that there are multiple baselines as it is different for all

characters within a ligature [45].

Figure 62 : Naskh Versus Nastaleeq Writing Style

47

The exit point of one character is joined with the entry point of next character. In order to do this

the shape of the character is changed.

Figure 63 : Different Base Line for Different Characters

• The overlapping problem is present not only in characters but in ligatures as well, as the

calligrapher always tries to produce a piece of script, which is beautiful and catches the eye [44].

Figure 64 : Overlap Between the Ligatures

• There is no monotony among shapes of letter at same position i.e. initial, final and medial

[44].

Figure 65 : Different Shapes of Kaf at the Same Medial Position

3.4 Noori Nastaleeq

In 1970, Pakistani calligrapher Ahmed Mirza Jamil revolutionized Urdu calligraphy by inventing

a new technique of joining Urdu letters and Words called Noori Nastaleeq. Thus made, the use of

computer for Nastaliq styles a reality [39].

Mr Ahmed Mirza Jamil calligraphically designed each ligature so as to make it suitable for

feeding it into the memory of a Computer. The resulting script was difficult, as it was not only

based on the pre-defined formulas but also on the aesthetic sense of the calligrapher.

Computerized composing replaced manual calligraphy but retained the elegance and beauty of the

written language with Nastaliq Script. Today the fast-computerized calligraphy has been admired

48

all over the world and countless dailies; magazines, journals and books are being published in

almost no time [39].

In 1982, Noori Nastaliq was designated as an Invention Of National Importance and Ahmed

Mirza Jamil was awarded Tamgha - I - Imtiaz by the Government of Pakistan [39].

4 Problem Statement

Significant amount of data that is available on the Internet is in the English language.

Electronically available Urdu data is mostly in image form, which is very difficult to process.

Printed Urdu data is the root cause of the problem. So for the rapid progress of Urdu language we

need OCR systems, which can help us to make Urdu data available for the common person.

Majority of the OCR systems available today are for the printed Arabic Script with reasonable

level of accuracy. Little work has been done on Urdu language that is mainly towards the

recognition of Naskh script. It is a simpler script than Noori Nastaleeq [51].

A widely used script in newspapers, governmental documents and books is the Noori Nastaleeq

script but there is a very little reported effort at developing OCR systems for Noori Nastaleeq

script.

Urdu, the official language of Pakistan, is mostly used in newspapers and in little government

documentation. 80 % of the population uses their native languages for communication. But still

no one has done a significant work for the development of the OCR system.

Therefore, the lack of interest of people and unavailability of fonts and Unicode for a long time

proved to be big hurdles in the development and advancement of Urdu OCR system [47].

All the above facts turned to be the motivation behind our work. Moreover no one has used HMM

technique for the recognition of “Noori Nastaleeq Script”. Emphasizes of this project is to convert

a scanned textual image into an editable file format with the reasonable level of accuracy.

The development of this project will cover another milestone towards the advancement of Urdu

language as well as help us to maintain computerized Urdu data.

49

4.1 Problem Scope

The scope of this work is based on the following assumptions:

• Noori Nastaleeq developed by Mirza Ahmed Jamil will be used for analysis,

development and testing.

• The system will be optimized for a single font-size i.e. point size 36.

• All the shapes (isolated, initial, final and medial) of the following class will be handled.

Table 7 : Shapes that will be Recognized

Characters Class Sr.

no

1

2

3

4

5

6

• Only the single columned document without formatting is handled. As shown below

50

Figure 66 : Single Columned Document without Formatting

• Only the Urdu Noori Nastaleeq text without Punctuation marks (` ‘' ! ؟ ، " :؛ ( ) [ ] { } .),

Arithmetic operators and symbols, Urdu numerals ( ), diacritics,

signs, complex ligatures, etc is handled.

• The text in the table will not be handled.

• The system will be able to process only already pre-processed images. As we will be

concentrating on the main problem here i.e. recognition of Noori Nastaleeq and will

ignore these issues as other researchers have already efficiently addressed them.

• No post processing will be done.

5 Methodology

From the literature review we conclude that the statistical classifiers are better pattern matching

technique. We have seen two statistical classifiers commonly used for character recognition .i.e.

Neural Network and Hidden Markov Model (HMM).

Table 8 : Comparison of Neural Network and HMM

Sr. Neural Network Hidden Markov Model

1 Remarkable contribution to the

development of OCR system

Remarkable contribution to the

development of OCR system in case of

hand writing recognition.

2 Training of neural networks is a

difficult task as the number of samples

should be same for each trainable data

and the network needs to be built again

On the other hand training of HMM is very

straight forward. Network is only updated

with the addition of each new data, not

rebuilt. Number of samples for different

51

with the addition of each new data. So

training is a time consuming process.

data can vary.

3 Addition of training data gives the boost

to the performance but makes the

network very complicated. Mesh kind

of network is created which increases

the computational effort and time to

train the data.

The addition of training data increases the

recognition accuracy but there is no effect

on training process.

4 High recognition rates with a small set

of data. But the performance declines

with the increase of data set. The

system fails with the very large set of

data.

High recognition rates with all kind of data.

From the above table we can conclude that HMM is a better pattern matching technique which

has a high accuracy rate even with large data set.

So our system is based on HMM. The recognition process is divided into two steps

• Training

• Recognition

Methodology of training and recognition is as shown in the diagrams given below.

52

Start

Image Acquistion

Skeletonization of image

Framing

Dct calculation

HMM Training Model

write to .mmf file

End

Flow Chart For Training

Data Cleaning

Figure 67 : Flow Chart for Training

53

Start

Image Acquistion

Line seperation

Set BaselineAll lines

extratced

Extract Ligatures

Seperate Mainbodies and

diacriticsSkeletonization of mainbody

Framing

Dct calculation

HMM Segment Recognition

Recognize whole Ligature

write to unicode file End

Flow Chart For Recognition

Segmentation

All segments

recognized

No

Yes

Yes

No

Figure 68 : Flow Chart for Recognition

5.1 Scan Image / Open Image

The image, which needs to be converted, has to be obtained in this very first phase of OCR

systems. The OCR system needs a scanned image containing Urdu text as input. Since the

54

preprocessing will not be done so it is assumed that the image will be kept properly for scanning

to avoid skew in the image and there will be no noise and distortion.

Figure 69 : Scanned Image

5.2 Separate Line of Text

Our proposed system automatically detects and separates individual text lines from the image

using horizontal and vertical projection of pixels.

A projection profile is a histogram giving the accumulated sum of ON pixels along rows. It is in

fact, a one-dimensional array where each element denotes the number of ON pixels along the

respective row in the image. The trough between two consecutive peaks in the horizontal profile

will denote the boundary between two text lines. Similarly through vertical projection of each line

the right and left of line is marked.

Figure 70 : Separate Lines from the Text by Using Horizontal Histogram

After the lines separation, the boundaries of the lines are defined using vertical histogram.

Figure 71 : Boundary of Lines

55

5.3 Set Base Line

Since Urdu is written on a horizontal line called baseline so we need to mark it to differentiate

between main bodies and diacritics. The information of horizontal projection of pixels is used to

set the baseline of a line. Row that contains the maximum number of pixels will be set as a

baseline.

Figure 72 : Base Line Detection by Using Horizontal Histogram

One problem that arose when using this method is that sometimes baseline may be set towards the

top of line. To solve this problem we set the baseline by selecting the row containing maximum

number of pixels from lower half of line. Moreover instead of using a single row as a base line we

used band of baseline consisting of 5 to 10 lines of row. This base line is used to distinguish

between main body and diacritics.

5.4 Isolate Ligature and Diacritics

One line will be taken at a time and then the ligatures and diacritics are isolated from it. The

connected components that lie on the baseline will be considered as main bodies of ligature and

rest are considered as diacritics.

Figure 73 : Blue are the Diacritics while Black are the Ligatures

5.5 Segmentation of Main body

The proposed OCR system applies the Skeletonization process on the main bodies, to extract a

region-based shape features representing the general form of an object.

56

“Skeletonization is a process for reducing foreground regions in a binary image to a skeletal

remnant that largely preserves the extent and connectivity of the original region while throwing

away most of the original foreground pixels” [46] .

The definition of the skeleton will be illustrated by the prairie-fire analogy: the fire is set on the

boundary of an object and the loci where the fire fronts meet and quench each other is the

skeleton [46].

Figure 74 : Sketetonized Image

Figure 75 : Sketetonized Image of the Word Baad

As we know that Urdu is written right to left, so that means the starting point of the ligature (the

first pixel of the first character in the ligature) will lie some where around the right side of the

ligature. At this point we might think that the right most one neighbor pixel will be the starting

point.

Figure 76 : Right Most Encircled Pixel is the Starting Point of the Ligature

Though this concept can be valid for the Naskh script but the overlapping nature of Nastaleeq

script makes it invalid. Because in certain cases, the rightmost one-neighbor pixel is not

necessarily the starting point of the ligature.

57

Figure 77 : The Rightmost Pixel P1 is not the Starting Point of the Ligature whereas the Pixel P2

is the Actual Starting Point of the Ligature

In the above example though the pixel P1 is the rightmost pixel but that is not the starting point of

the ligature. Same is the case in the example below.

Figure 78 : The Rightmost Pixel P1 is not the Starting Point of the Ligature whereas the Pixel P2

is the Actual Starting Point of the Ligature

Computer is not so much intelligent as human being so it cannot judge what will be the starting

point of the ligature in Noori Nastaleeq Script. Since Nastaleeq is written diagonally and

characters overlap their preceding characters so that means there will be one and only one ending

point of the ligature and there is no ambiguity in that since no character overlaps the last character

of the ligature. So for this reason we have tried to find the last point in the ligature, which is

written when writing with pen.

The skeleton is traversed from left to right i.e. the left most ending point of the last letter will be

taken as the initial node.

In order to find the starting point for the traversal, first the lowest-left most pixel P1 having only

one neighbor is found and declared as potential starting point.

58

Figure 79 : The Starting Point of the Ligature is Circled with Red

Then left most pixel of the ligature is checked for vertical connected line to determine if the last

character in the ligature is Alif. If the last character is Alif then highest-left most pixel P2 is

declared as the starting point otherwise P1 is considered the staring point.

Figure 80 : The Actual Starting Point P2 of the Ligature is Circled with Red

while the Potential Starting Point P1 is Circled with Blue

The order of directions for the traversal is west, north, south and east. On the way of traversal

any pixel having more than two black neighboring pixels will be considered as break point. The

body will be segmented at the break points and all the edges will be disconnected from this point.

Figure 81 : Break points in a Sketetonized Image Circled with Red

Figure 82 : Segments in the Ligature are Colored Differently

59

The segmentation may result in over segmentation. As in the above case the letter ain is over

segmented into two segments.

Figure 83 : Over Segmentation in Medial Ain

Similarly it may result in under segmentation if there is no proper junction point between the two

letters.

For example in the word bib there is no break point and whole ligature is considered as one

segment.

Figure 84 : Under Segmentation in the Word Bib

The following table gives the segmentation in few ligatures.

Table 9 : Segmentation of the Ligatures

Sr. Segments Ligature

1

2

60

3

4

5

5.6 Framing

Figure 85 : The Framing of Segmented Word Baad

After skeletonization, the frames of each segment are fed to HMM for training and recognition.

We took a frame of 8x8 size, which constitutes of only 8 black pixels of the segment. The frame

traversal is same as ligature traversal explained above. We experimentally found out that larger

the frame size more ambiguous would be the system. We tested the system with 5x5, 8x8, 9x9,

12x12 and 16x16 frame sizes and found out that 8x8 gave us the best results.

5.7 Hidden Markov Model

Our selected research technique is Hidden Markov Model (HMM), which has an ability to

perform recognition with great ease and efficiency. Language-independent training and

61

recognition methodology; automatic training on non-segmented data; and simultaneous

segmentation and recognition are the useful aspects of using HMM in developing continuous

recognition technology [51].

5.7.1 HMM Training

When pre-processing is complete, before starting training process, the HMM parameters must be

properly initialized with training data in order to allow a fast and precise convergence of the

training algorithm.

Each segment is considered as a separate HMM. There are total of sixty HMMs.

Figure 86 : HMM Model Used

In order to cater all sorts of noise and variations in the image we need to collect at least 100

samples of each hmm model. [See appendix A]

For each segment we have to define the number of states required [see appendix B].

62

5.7.2 HMM Recognition

Before recognizing, we follow these steps:

• Build the task Grammar

• Build the Dictionary

• Make Network using HMM command

After training the model on the given data, the recognition process is performed. In recognition

process the skeletonized ligature is first segmented and then each segment is fed to HMM for the

segment recognition. HMM gives the output, which has the maximum probability given the

observation sequence.

5.8 Recognizing the Ligature

Once the segments are recognized, the rules are applied to recognize the ligature.

The following table gives us the examples of some of the rules.

Table 10 : Rules

Sr. Segments Letter

1

2

3

63

4

5

6

Let’s take an example of the word baad.

Figure 87 : Segmentation in the Word Baad

Table 11 : HMM Names

Sr. HMM HMM name

1

h022

2

h09

3

h029

4

h013

5

h014

64

Rules:

Rule 1: Bay � h022

Rule 2: Ain � h09 + h029

Rule 3: Dal � h013 + h014

So HMM will give us the HMM names in the following order

Figure 88 : The Results Returned by HMM and Empty Unicode File

Suppose we have a pointer to point the current HMM name. Now starting from the right most

Hmm, our pointer will be at location 5. According to the rule 1, h022 will be converted to bay. So

we will write bay in the Unicode file.

Figure 89 : The Pointer is at Location 5. After Applying Rule 1, Bay is Written in Unicode File

Now our pointer will be at location 4 and according to rule 2,ain is recognized and written in the

file, which results in a ligature comprising of bay-ain.

65

Figure 90 : The Pointer is at Location 4.After Applying Rule 2, Bay-Ain is Written in Unicode

File

Our pointer now jumps to the location 2. According to rule 3, h013 and h014 will form dal and is

written in the Unicode file.

Figure 91 : The Pointer is at Location 2.After Applying Rule 3, Bay-Ain-Dal is Written in

Unicode File

So we get the ligature baad, after applying the rules.

6 Results

This Section gives the results of the work done.

6.1 Purpose

The purpose of scientifically evaluating and analyzing Urdu Nastaleeq OCR is to measure its

accuracy in recognizing ligatures.

66

6.2 Sample

For analyzing the OCR the words were extracted from Nokia dictionary, which constituted the

letters from the above-mentioned classes. Three or more samples of each word were taken. The

Urdu words were written in font Noori Nastaleeq and font size 36. The pages were scanned at dpi

150.

6.3 Method

We tested each type of letter without diacritics by recognizing it using Urdu Nastaleeq OCR.

Three or more tokens of each word were used. Accuracy is calculated based on recognition.

Following measures were used to check system accuracy.

• Percentage of recognition of tokens

• Error rate of tokens

6.4 Test Results

The accuracy is calculated on letters and ligatures basis.

6.4.1 Character level Results

Accuracy of each letter is calculated separately. It has been tried that all the forms (initial, medial,

final and isolated) of the letter are covered in the testing data.

6.4.1.1 Class of Alif

Table 12 : Accuracy of Alif

Number of tokens 765

Number of tokens recognized 761

Percent tokens recognized 99.5%

Error rate in token recognition 0.5

6.4.1.2 Class of Swad

Table 13 : Accuracy of Swad



Percent tokens recognized 80%

67

Error rate in token recognition 20%

6.4.1.3 Class of Dal

Table 14 : Accuracy of Dal




Error rate in token recognition 0.7%

6.4.1.4 Class of Choti Yeh

Table 15 : Accuracy of Choti Yeh





6.4.1.5 Class of Ain

Table 16 : Accuracy of Ain





6.4.1.6 Class of Bay

Table 17 : Accuracy of Bay





6.4.1.7 Cumulative Results

Table 18 : Cumulative Results


68




6.4.2 Ligature Level Results

It has been tried that maximum number of ligature be covered in the training data. Again three

repetitions of each ligature are taken.

Table 19 : Ligature Level Accuracy





6.5 Examples

Following figures show some results of OCR.

Figure 92 : Output 100% Accuracy

The result obtained in the above figure is 100 % correct. The .bmp file is on the right side of the

figure and the result is on the left side.

69

Figure 93 : Output with Some Error Due to Noise and Distortion

The result in the above figure is not 100% correct. The Word “��” is not recognized properly

because of the noise and distortion in the image.

Figure 94 : Output with Training Error

70

The result in the above figure is not 100% correct. The Word “��” is not recognized properly

because of the training error.

7 Discussion

The main focus of the project was to do research and make a ligature in dependent OCR capable

of recognizing Urdu Nastaleeq font using HMM technique. This main objective of our project has

been accomplished. Some letters were not recognized correctly. Following were the problems due

to which OCR was not able to recognize these letters.

7.1 Distortion in the Image

The distortion or the noise in the image can also affect the output of the recognizer.

7.1.1 Problem Description

Sometimes due to noise or poor scanning the actual shape of the letter is distorted. As discussed

above a small change in the original image will greatly affect its skeleton. In this case, the letter is

not recognized correctly.

7.1.2 Example

During scanning the “ ” is distorted. Total number of windows and DCT values differ to great

extent. This results in erroneous recognition of letters.

For example we have a word bibi in the figure 89. First bay-yeh is not recognized properly where

as in the second case it is recognized correctly.

Figure 95 :Bay-Yeh not Recognized Properly due to Training Error

The reason is that there is some distortion in the first bay-yeh which changed its skeleton greatly.

71

Figure 96 : Encircled are the Differences in the Skeleton of the Ligatures

7.1.3 Solution

We have trained OCR on 100 samples (at least) of each segment. This problem may be removed

by training the HMM on more samples of each segment and giving original segment as HMM

input instead of giving skeletonized segment. This will provide more information to HMM for the

final recognition. Moreover if the pre-processing is applied before the recognition and training

step, the results will be improved significantly.

7.2 Similarity in Shape

The similarity in the shapes of different characters can lead the recognizer to confusion.


Sometimes the shapes of different letters in different forms are so similar that they give

approximately same state transition probabilities. So when these inputs are given to HMM for

recognition, same state sequence of HMM is obtained. Erroneous output is thus obtained due to

the slight similarity in shape. This problem cannot be solved by changing number of states during

training.

7.2.2 Examples

Figure 97 : Similarity in Shapes of Swad-Bay-Dal and Swad-Dal

The shape of the letters bay and last stroke in swad are similar to each other when written in

Noori Nastaleeq font. Sometimes the OCR is not able to recognize segments like these correctly

because of the similarity in shape.

72

Figure 98 : Similarity in Shapes of Bay and Last Stokes of Ain and Swad

Another such problem can be seen in “��” and “��” shown above. The shape of bay is more

similar than different to the last stokes of swad and ain. Sometimes they are recognized correctly

and sometimes they are not.

7.2.3 Solution

We used a sliding non-overlapping window of size 8*8 to calculate DCT. This problem can be

removed by making the sliding window over-lapping. By using overlapping windows finer details

of segment shape can be also be covered. Moreover many such problems will be automatically

removed when diacritics are handled. For example sometimes swad-dal is misrecognised as

swad-bay-dal due to the similarity in shapes as shown in the figure above. When no diacritic will

be found for bay, we will come to know that the portion of swad is wrongly classified as bay.

7.3 Inconsistency in Font

Inconsistent font can be the major cause of low accuracy rate.


The font Noori Nastaleeq shows very inconsistent behavior i.e. two words, with the only

difference of diacritic placement, have different shapes. Since the font is ligature dependent and is

hand written so there is so much dissimilarity between the shapes of letters belonging to the same

class.

7.3.2 Example

Figure 99 : Dissimilarity in Shapes of Same Ligature with Different Diacritic Placement

73

Since there is so much inconsistency so it might happen that some form of the shape is missed in

data collection step and that results in an error in recognition step. So much diversity in the same

class leads the HMM to confusion.

Another such example is given below.

Figure 100 : Another example of the dissimilarity in shapes of same ligature with different

diacritic placement

7.3.3 Solution

We have trained OCR on 100 samples (at least) of each segment [see appendix A]. This problem

may be removed by training the HMM on more samples of each segment and giving original

segment as HMM input instead of giving skeletonized segment. This will provide more

information to HMM for the final recognition. Moreover if the pre-processing and post-

processing steps are applied, the results will be improved significantly.

8 Conclusion

In this chapter we have discussed the results and accuracy of OCR. The individual letter accuracy

of OCR in token recognition is 95.76% and in type recognition it is 100%. The ligature accuracy

of OCR in token recognition is 92.73% and in type recognition it is 91.43%. This accuracy will

further improve when diacritics will be handled and pre-processing is applied. The accuracy is

calculated by testing the word from the Nokia dictionary. By seeing the results, we can say that

HMM technique is good for recognizing letters. The results can be more accurate if sliding

window approach is used on the original segment without thinning and pre and post processing is

done.

74

9 Future Work and Enhancement

Many enhancements can be made in Urdu Nastaleeq OCR.

• The software can be made font independent

• It can be made font size independent

• Automatic noise removal

• Automatic skew detection and correction

• Diacritics can be handled

• Due to time limitation this software is developed for the above mentioned 6 classes only;

other letters can be catered afterwards.

• The performance can be further increased by having original stokes as HMM input

instead of thinned stokes.

• A deep analysis of Noori Nastaleeq font can be done for further usage in rules generation

of OCR.

• Post processing can be done.

The Urdu OCR system is an area, which requires much more research to make it efficient and

suitable for consumer use. This thesis is a model towards implementing a more efficient system

and still has a lot to be desired.

75

Reference

[1]. A. J. Elms and J. Illingworth, “Modeling polyfont printed characters with HMMs and a

shift invariant Hamming distance”, Proceedings of the Third International Conference on

Document Analysis and Recognition (Vol. 1), pg.504, August 14-15, 1995.

[2]. M. Mohamed and P. Gader, "Handwritten word recognition using segmentation-free

hidden Markov modeling and segmentation-based dynamic programming techniques", IEEE

Trans. Pattern Anal. Mach. Intel. Vol.18, no.5, pg.548–554, 1996.

[3]. A. Kornai, "Experimental HMM-Based Postal OCR System", Proceeding of

International Conference. Acoustics, Speech, Signal Processing, vol. 4, pg. 3,177-3,180,Munich,

Germany, 1997.

[4]. M. S. Khorsheed and W.F. Clocksin, "Structural Features Of Cursive Arabic Script", in

Proceeding of British Machine Vision Conference, pg.1285-1294, 1999.

[5]. Zhidong Lu, Issam Bazzi, Andras Kornai and John Makhoul, “A Robust, Language-

Independent OCR”, System. In the 27th AIPR Workshop: Advances in Computer Assisted

Recognition, SPIE, 1999.

[6]. Marija Bojovic & Milan D. Savic, "Training of Hidden Markov Models for Cursive

Handwritten Word Recognition", in 15th International Conference on Pattern Recognition

(ICPR'00) - Vol1 pg. 1973, September 2000.

[7]. M. Pechwitz and V. Maergner, “HMM based approach for handwritten Arabic word

recognition using the IFN/ENITdatabase”, Proceeding of 7th ICDAR 2003, pg. 890-89, 2003.

[8]. M.S Khorsheed, “Recognizing Handwritten Arabic manuscripts using a single hidden

Markov model”, Pattern Recognition Letters, Vol.24, pg.2235-2242, 2003.

[9]. E. Lecolinet, "Cursive script recognition by backward matching", Proceeding of Sixth

International Conf. Handwriting and Drawing, pg. 89-91, 1993.

[10]. Juan-Carlos Perez, Enrique Vidal and Lourdes Sanchez, "Simple and effective feature

extraction for optical character recognition", the 5th Spanish Symposium on Pattern recognition

and images analysis, 1994.

76

[11]. Badr Al-Badr and Robert M. Haralick, "Segmentation Free Word recognition with

application to Arabic", in Proceeding of International Conference on Document Analysis and

Recognition, pg.355-359, 1995.

[12]. D. Megherbi, S.M. Lodhi and J.A. Boulenouar, “A fuzzy logic-based technique for Urdu

Character Representation and Recognition”, Proceedings of SPIE -- Vol 3962, pg. 13-24, April

2000.

[13]. Sofien Touj, Najoua Essoukri Ben Amara and Hamid Amiri, “Generalized Hough

Transform for Arabic Optical Character Recognition”, Document Analysis and Recognition,

Proceedings of Seventh International Conference, Aug. 2003.

[14]. Stephen Pearce, Maher Ahmed, "An Evolutionary Algorithm for General Symbol

Segmentation", ICDAR 2003:pg 726-730,2003.

[15]. Simon Günter and Horst Bunke, "A New Combination Scheme for HMM-Based

Classifiers and its Application to Handwriting Recognition", ICPR (2) 2002:pg 332-337, 2002.

[16]. Somaya Alma’adeed, Colin Higgens, and Dave Elliman, "Recognition of Off-Line

Handwritten Arabic Words Using Hidden Markov Model Approach", 16th International

Conference on Pattern Recognition (ICPR'02) - Vol 3, pg. 30481, August 2002.

[17]. H. Bunke, M. Roth and E. C. Schukat-Talamazzini, "Off-line cursive handwriting

recognition using hidden Markov models", Poll. Recogn. 28, 9 (1995): pg. 1399-1413. 1995.

[18]. R. El-Hajj, L. Likforman-Sulem and C. Mokbel, "Arabic Handwriting Recognition

Using Baseline Dependant Features and Hidden Markov Modeling", in the 8th International

Conference on Document Analysis and Recognition, ICDAR 2005, Seoul, Korea, (2005).

[19]. C. Bose and S. Kuo, "Connected and degraded text recognition using hidden Markov

model", in Proceeding of 11th International Conference Pattern Recognition, pg 116-119,1992.

[20]. Elms, A.J., "A Connected Character Recognizer Using Level Building of HMMS", In:

Proceeding of 12th International Conference Pattern. Recognition, pg 439–442, October 1994.

77

[21]. Amlan Kundu, Yang He, Mou-Yen Chen, "Alternatives to Variable Duration HMM in

Handwriting Recognition", PAMI(20), Vol. 20, No. 11, pg. 1275-1280, 1998.

[22]. Zahra Shah and Farah Saleem, “Ligature Based Optical Character Recognition of

Urdu, Nastaleeq Font”, Multi Topic Conference, 2002. INMIC 2002. International, 2002.

[23]. Syed S. Hyder and Ali Khoujah, "Character Recognition Of cursive scripts",

Proceedings of the 1st international conference on Industrial and engineering applications of

artificial intelligence and expert systems - Vol 2, June 1998.

[24]. J. Hébert, M. Parizeau, N. Ghazzali, "Learning to segment cursive words using isolated

characters", Proceeding of the Vision Interface Conference, pg. 33-40.(1999).

[25]. M. Fahmy and S. Al Ali, “Automatic recognition of handwritten Arabic characters

using their geometrical features”, Journal of Studies in Informatics and Control with Emphasis

on Useful Applications of Advanced Technology, 2001.

[26]. Syed Afaq Husain and Syed Hassan Amin, ”A Multi-tier Holistic Approach for Urdu

Nastaliq Recognition”, IEEE INMIC Dec. 2002, Karachi.

[27]. S. Snoussi Maddouri, H. Amiri, A. Belaïd and Ch. Choisy, "Combination of Local and

Global Vision Modeling for Arabic Handwritten Words Recognition", in Eighth IWHFR, pg.

128-132. August 2002.

[28]. Alex Cherkasov, “Creating Optical Character Recognition (OCR) application using

Neural Network", IT toolbox Emerging Technologies Industry Articles, 27 July, 2005.

[29]. Myriam Côté, Eric Lecolinet, Mohamed Cheriet, Ching Y. Suen, “Building a Perception

Based Model for Reading Cursive Script”, ICDAR 1995: pg. 898-901, 1995.

[30]. Ali Ahmadi, Yoshinori Shirakawa, Md. Anwarul Abedin, Kazuhiro Takemura, Kazuhiro

Kamimur, Hans Jurgen Mattausch and Tetsushi Koide, "Real-time Character Recognition

System Using Associative Memory Based Hardware", Circuits and Systems, 2005. 48th Midwest

Symposium on, August 2005.

78

[31]. Roberto J.Rodrigues, Antonio Carlos Gay Thome, “Cursive character recognition – a

character segmentation method using projection profile-based technique”, Proceedings of SCI

2000/ISAS 2000 VOLUME V, 2000.

[32]. Ing Ren Tsang, "Pattern recognition and complex systems", PhD thesis defended at

University of Antwerpen on September 22, 2000.

[33]. U. Pal and Anirban Sarkar, “Recognition of Printed Urdu Script”, ICDAR 2003: 1183-

1187, 2003.

[34]. A. Elgammal and M.A. Ismail, “A graph-based segmentation and feature extraction

framework for Arabic text recognition”, in ICDAR'01, 2001,pg. 145-155, 2001.

[35]. Optical Character Recognition, Available at:

http://en.wikipedia.org/wiki/Optical_character_recognition, [Accessed at 17th December 2006].

[36]. Urdu Computing Information (Penn State), Available at:

http://tlt.its.psu.edu/suggestions/international/bylanguage/urdu.html#script, [Accessed in July

2006].

[37]. Wali, A. et el, “Contextual Shape Analysis of Nastaliq, CRULP Annual Student

Report (2001-2002)”, p 288-302, 2002.

[38]. Urdu: Language of Pakistan. Available at

http://www.ethnologue.com/14/show_language.asp?code=URD [Accessed in July 2006].

[39]. Noori Nastaliq, Available at http://www.elite.com.pk/noori.html, [Accessed at 7th

January 2007].

[40]. The Urdu Alphabet, Available at http://www.unics.uni-hannover.de/nhtcapri/urdu-

alphabet.html, [Accessed at 14th January 2007].

[41]. Urdu, Available at http://www.nvtc.gov/lotw/months/february/urdu.html, [Accessed in

July 2006].

[42]. Henry Rogers, Writing Systems, Blackwell publishing, 2005.

79

[43]. Aamir Wali, Atif Gulzar, Ayesha Zia, Muhammad Ahmad Ghazali,Muhammad Irfan

Rafiq, Muhammad Saqib Niaz, Sara Hussain, and Sheraz Bashir, “Features of Noori

Nastalique”, Center for Research in Urdu Language Processing National University of Computer

and Emerging Sciences.

[44]. Dr. Sarmad hussain, Shafiq-ur-Rehman, Atif Gulzar, Amir wali and Jamil,

“Orthographic Analysis of Nasta’leeq Writing Style for Urdu”, Center for Research in Urdu

Language Processing National University of Computer and Emerging Sciences, 2002.

[45]. Sarmad Hussian, www.LICT4D.asian/fonts/Urdu_nasta'le, Center for Research in Urdu

Language Processing National University of Computer and Emerging Sciences.

[46]. Skeletonization/Medial Axis Transform, Available at

http://homepages.inf.ed.ac.uk/rbf/HIPR2/skeleton.htm, [Accessed in 1st December 2006].

[47]. Zahra Shah and Farah Saleem, “Final thesis Report”, 2002

[48]. Lawrence Rabiner and Biing- Hwang Juang, “Theory and Implementation of Hidden

Markov Models” in the book, “Fundamental of Speech Recognition”, chapter 6, published in

1993.

[49]. Steve Young, Gunnar Evermann, Thomas Hain, Dan Kershaw, Gareth Moore, Julian

Odell, Dave Ollason, Dan Povey, Valtcho Valtchev and Phil Woodland, “The HTK Book”,

December 1995.

[50]. A Statistical Learning/Pattern Recognition Glossary, Available from

http://www.cs.wisc.edu/~hzhang/glossary.html. [Accessed 17th December 2006].

[51]. Sobia Tariq Javed, Ameera Maqbool, Sehrish Jameel and Samia Asloob Qureshi, “Urdu

Nastaleeq OCR”, BS final year report, 2005

80

Appendix

Appendix A

Figure 101 : Hmm00

Figure 102 : Hmm01

Figure 103 : Hmm02

Figure 104 : Hmm03

Figure 105 : Hmm04

Figure 106 : Hmm05

Figure 107 : Hmm06

Figure 108 : Hmm07

81

Figure 109 : Hmm08

Figure 110 : Hmm09

Figure 111 : Hmm010

Figure 112 : Hmm012

Figure 113 : Hmm013

Figure 114 : Hmm014

Figure 115 : Hmm015

Figure 116 : Hmm016

82

Figure 117 : Hmm017

Figure 118 : Hmm018

Figure 119 : Hmm019

Figure 120 : Hmm020

Figure 121 : Hmm021

Figure 122 : Hmm022

Figure 123 : Hmm023

Figure 124 : Hmm024

83

Figure 125 : Hmm025 Figure 126 : Hmm026

Figure 127 : Hmm027

Figure 128 : Hmm028

Figure 129 : Hmm029

Figure 130 : Hmm030

Figure 131 : Hmm031

Figure 132 : Hmm032

Figure 133 : Hmm033

Figure 134 : Hmm034

84

Figure 135 : Hmm035

Figure 136 : Hmm036

Figure 137 : Hmm037

Figure 138 : Hmm038

Figure 139 : Hmm039

Figure 140 : Hmm040

Figure 141 : Hmm041

Figure 142 : Hmm042

85

Figure 143 : Hmm043

Figure 144 : Hmm044

Figure 145 : Hmm045

Figure 146 : Hmm046

Figure 147 : Hmm047

Figure 148 : Hmm048

Figure 149 : Hmm049

Figure 150 : Hmm050

86

Figure 151 : Hmm051

Figure 152 : Hmm052

Figure 153 : Hmm053

Figure 154 : Hmm054

Figure 155 : Hmm055

Figure 156 : Hmm056

Figure 157 : Hmm057

Figure 158 : Hmm058

87

Figure 159 : Hmm059

88

Appendix B

Table 20 : HMM State Analysis

Sr. HMM Name No. of Frames No. of states No. of samples 1 h00 3 5 100

2 h01 2 4 100

3 h02 3 5 109

4 h03 3 5 136

5 h04 3 5 100

6 h05 3 4 110

7 h06 3 4 122

8 h07 10 11 108

9 h08 2 4 137

10 h09 2 4 114

11 h010 2 3 180

12 h011

13 h012 3 5 165

14 h013 2 3 109

15 h014 3 4 118

16 h015 7 8 122

17 h016 3 5 100

18 h017 10 11 152

19 h018 8 10 163

20 h019 4 5 157

21 h020 4 6 100

22 h021 2 4 136

23 h022 5 6 111

24 h023 4 5 108

25 h024 3 5 104

26 h025 5 7 169

27 h026 3 4 104

28 h027 10 12 112

29 h028 2 4 123

30 h029 3 5 100

31 h030 3 4 100

32 h031 4 6 191

33 h032

34 h033 9 10 104

35 h034 13 14 122

36 h035 14 15 104

37 h036 14 15 110

38 h037 12 13 122

39 h038 14 16 100

40 h039 14 15 100

41 h040 7 8 129

42 h041 5 7 112

43 h042 8 9 100

44 h043 7 8 107

89

45 h044 9 10 100

46 h045 6 8 100

47 h046 8 10 104

48 h047 2 4 130

49 h048 3 4 127

50 h049 3 4 126

51 h050 4 5 105

52 h051 5 6 108

53 h052 3 5 124

54 h053 4 5 116

55 h054 15 16 101

56 h055 4 5 108

57 h056 3 5 132

58 h057 4 6 100

59 h058 4 6 100

60 h059 2 4 126

INVESTIGATION INTO A SEGMENTATION BASED OCR FOR …Ms. Sobia Tariq Javed received a Bachelor of...

Documents

Transcript of INVESTIGATION INTO A SEGMENTATION BASED OCR FOR …Ms. Sobia Tariq Javed received a Bachelor of...