[IEEE 2007 International Conference on Computing: Theory and Applications (ICCTA'07) - Kolkata,...

6
A Fuzzy Technique for Segmentation of Handwritten Bangla Word Images Subhadip Basu 1 * , Ram Sarkar 2 , Nibaran Das 1 , Mahantapas Kundu 1 ,Mita Nasipuri 1 , Dipak Kumar Basu 1 1 Computer Sc. & Engg. Dept., Jadavpur University, Kolkata-700032, India. 2 Computer Sc. & Engg. Dept., MCKV Institute of Engineering,Liluah, Howrah-711204, India. * Corresponding author. E-mail: [email protected] Abstract A fuzzy technique for segmentation of handwritten Bangla word images is presented. It works in two steps. In first step, the black pixels constituting the Matra (i.e., the longest horizontal line joining the tops of individual characters of a Bangla word) in the target word image is identified by using a fuzzy feature. In second step, some of the black pixels on the Matra are identified as segment points (i.e., the points through which the word is to be segmented) by using three fuzzy features. On experimentation with a set of 210 samples of handwritten Bangla words, collected from different sources, the average success rate of the technique is shown to be 95.32%. Apart from certain limitations, the technique can be considered as a significant step towards the development of a full-fledged Bangla OCR system, especially for handwritten documents. 1. Introduction For any of the widely used non-holistic OCR approaches, success of a specific technique depends on how best a word can be segmented into pieces, which are to be considered subsequently as candidates for its constituent characters. The better is the segmentation, the lesser is the ambiguity encountered in recognition of candidate characters or word pieces. Once the image of an extracted text line is skew corrected, extraction of words from the line becomes a trivial problem. It is so because, simply by identifying valleys of the vertical pixel density histogram of the text line, the words therein can be easily identified. The problem of segmenting extracted words into constituent characters is not so easy, especially for Bangla, an important East Asian script widely used in India and Bangladesh. Popularity wise, Bangla ranks fifth in the world, both as a script and a language. Appearance of consecutive characters in overlapping column positions over a text line makes the problem of Bangla word segmentation more complex compared to segmentation of English words. The problem becomes compounded with handwritten Bangla words because of variation in sizes and shapes of handwritten characters. Considering all this, a novel technique for segmenting images of handwritten Bangla words is presented in this paper. Word segmentation is one of the core problems of OCR of handwritten text, which has long been an active area of research. Some important contributions so far made in this field include of English texts [1], [2], [3], [4], Chinese script [5] and Arabic characters [6]. The work relating to OCR of Bangla script, as mentioned earlier, is found to have few references in the literature. Two such instances [7], [8], one focusing on recognition of isolated handwritten characters based on stroke features and the other concentrated on a multistage approach based on different topological features respectively, have not addressed the problem of Bangla text segmentation. The problem of Bangla text segmentation has been addressed in [9]. The work has produced a complete OCR system for printed Bangla text. The input to the system is supplied with the document images captured by a flat bed scanner. Before classification of the characters appearing in the document image, it has undergone graphics separation, line segmentation, zone detection, word and character segmentation in stepwise manner. The step of character segmentation, which has direct effect on the later stages of character recognition, is based on detection of an important feature of Bangla text, called the Matra. A Matra is a horizontal line, which passes touching the upper part of many characters of Bangla script as shown in Fig. 1(a). Depending on the characters, it covers at most the entire character width. The consecutive characters, in a Bangla word, which have Matras, are joined through a common Matra formed by joining the Matras of individual characters as shown in Fig. 1(b). This line may have some discontinuity over the positions where the characters in the word appear without Matras. Although the technique of character segmentation as described in [9] has shown a high success rate by properly segmenting nearly 98.6% of characters of printed text, it will not be effective for hand written text. In handwritten Bangla words, the Matras are not horizontal as strictly as these are in printed words. So the technique of removing the Matra of a word for segmenting its constituent Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07) 0-7695-2770-1/07 $20.00 © 2007

Transcript of [IEEE 2007 International Conference on Computing: Theory and Applications (ICCTA'07) - Kolkata,...

A Fuzzy Technique for Segmentation of Handwritten Bangla Word Images

Subhadip Basu 1 *, Ram Sarkar 2, Nibaran Das 1, Mahantapas Kundu 1,Mita Nasipuri 1, Dipak Kumar Basu 1

1 Computer Sc. & Engg. Dept., Jadavpur University, Kolkata-700032, India. 2 Computer Sc. & Engg. Dept., MCKV Institute of Engineering,Liluah, Howrah-711204, India.

* Corresponding author. E-mail: [email protected]

Abstract A fuzzy technique for segmentation of handwritten Bangla word images is presented. It works in two steps. In first step, the black pixels constituting the Matra (i.e., the longest horizontal line joining the tops of individual characters of a Bangla word) in the target word image is identified by using a fuzzy feature. In second step, some of the black pixels on the Matra are identified as segment points (i.e., the points through which the word is to be segmented) by using three fuzzy features. On experimentation with a set of 210 samples of handwritten Bangla words, collected from different sources, the average success rate of the technique is shown to be 95.32%. Apart from certain limitations, the technique can be considered as a significant step towards the development of a full-fledged Bangla OCR system, especially for handwritten documents. 1. Introduction For any of the widely used non-holistic OCR approaches, success of a specific technique depends on how best a word can be segmented into pieces, which are to be considered subsequently as candidates for its constituent characters. The better is the segmentation, the lesser is the ambiguity encountered in recognition of candidate characters or word pieces.

Once the image of an extracted text line is skew corrected, extraction of words from the line becomes a trivial problem. It is so because, simply by identifying valleys of the vertical pixel density histogram of the text line, the words therein can be easily identified.

The problem of segmenting extracted words into constituent characters is not so easy, especially for Bangla, an important East Asian script widely used in India and Bangladesh. Popularity wise, Bangla ranks fifth in the world, both as a script and a language.

Appearance of consecutive characters in overlapping column positions over a text line makes the problem of Bangla word segmentation more complex compared to segmentation of English words. The problem becomes compounded with handwritten Bangla words because of

variation in sizes and shapes of handwritten characters. Considering all this, a novel technique for segmenting images of handwritten Bangla words is presented in this paper.

Word segmentation is one of the core problems of OCR of handwritten text, which has long been an active area of research. Some important contributions so far made in this field include of English texts [1], [2], [3], [4], Chinese script [5] and Arabic characters [6].

The work relating to OCR of Bangla script, as mentioned earlier, is found to have few references in the literature. Two such instances [7], [8], one focusing on recognition of isolated handwritten characters based on stroke features and the other concentrated on a multistage approach based on different topological features respectively, have not addressed the problem of Bangla text segmentation.

The problem of Bangla text segmentation has been addressed in [9]. The work has produced a complete OCR system for printed Bangla text. The input to the system is supplied with the document images captured by a flat bed scanner. Before classification of the characters appearing in the document image, it has undergone graphics separation, line segmentation, zone detection, word and character segmentation in stepwise manner. The step of character segmentation, which has direct effect on the later stages of character recognition, is based on detection of an important feature of Bangla text, called the Matra. A Matra is a horizontal line, which passes touching the upper part of many characters of Bangla script as shown in Fig. 1(a). Depending on the characters, it covers at most the entire character width. The consecutive characters, in a Bangla word, which have Matras, are joined through a common Matra formed by joining the Matras of individual characters as shown in Fig. 1(b). This line may have some discontinuity over the positions where the characters in the word appear without Matras.

Although the technique of character segmentation as described in [9] has shown a high success rate by properly segmenting nearly 98.6% of characters of printed text, it will not be effective for hand written text. In handwritten Bangla words, the Matras are not horizontal as strictly as these are in printed words. So the technique of removing the Matra of a word for segmenting its constituent

Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07)0-7695-2770-1/07 $20.00 © 2007

characters may leave many characters joined with each other. Such under segmentation may complicate classification decisions in the subsequent stage.

References involving segmentation of handwritten Bangla words are a few in the literature. Recursive contour following [10] and water reservoir principle [11] are separately applied in the past for this purpose

How to segment handwritten Bangla words into constituent characters efficiently is still a challenging problem of OCR related research. This is a major point of motivation behind the present work that deals with the problem of segmenting hand written Bangla words into constituent characters.

(a) An illustration of Matras of individual characters in a word

(b) An illustration of the common Matra of a word

(c) An illustration of the three zones and region boundaries of a word

Fig. 1(a-c). Illustration of some important features of Bangla

script

Compared to Roman script, some special features of Bangla script make the task of character segmentation complex for words appearing in pieces of Bangla text. There are some characters in Bangla script, called Modified shapes, which are not positioned in a strict left to right non-overlapping sequence with adjacent characters in a word. One such Modified shape while appearing in a word may encircle an adjacent character in the word (e.g., 1st character of the word image of Fig. 1(a)) or it may terminate with some outgrowth covering the upper position of its preceding or succeeding character (e.g., 4th character of the word image of Fig. 1(a)), or it may appear just below another character with a touching point between the two (e.g., the character below the 6th character of the word image of Fig. 1(a)). For all of these, segmentation of Bangla words just by observing the

valleys in the histogram, drawn by adding the column wise pixel densities of the word image, is not possible for Bangla script.

Any word, written in Bangla script, can be partitioned horizontally into three adjacent zones as shown in Fig. 1(c). The portion of each word on and above the Matra is identified as the ‘upper zone’. The main body of the characters in a word and the portion of the word below the main body are identified as the ‘middle zone’ and the ‘lower zone’ respectively. 2. The Present Work In view of the enormous complexity of the handwritten Bangla words, a modular approach is developed here for a fuzzy feature based segmentation of the word images. The overall technique can be subdivided into the following steps. Zone analysis determines ‘upper’, ‘middle’ and ‘lower’ zones in a word image. The Matra identification module identifies the probable black pixels which constitute the common Matra of the word at the top of middle zone by using a fuzzy feature of the black pixels therein. Finally considering three fuzzy features for each of the black pixels lying over the Matra, some of the black pixels on the Matra are identified as potential segmentation points. The black pixels identified as segmentation points on the Matra are used to segment the associated word into isolated components which may then be identified through the technique of connected component labelling. 2.1. Determination of ‘Zones’ in a Word Image The three adjacent zones of a word image, mentioned before, need to be identified before segmenting it into constituent characters. More specifically, the top row of the upper zone (R1), the top row of the middle zone (R2), the mid line (R3) of the middle zone, the bottom row of the middle zone (R4) and the bottom row of the lower zone (R5) are to be identified from the word image first, under the present technique.

A horizontal pixel scan of the word image from top towards bottom identifies the first row with at least a single black pixel as the top row (R1) of the upper zone. Similarly, another horizontal scan from bottom towards top identifies the first row again with at least a single black pixel as the bottom row (R5) of the lower zone. Identification of the top and bottom row boundaries of the middle zone is a difficult task in handwritten Bangla words.

After finding the top most and the bottom most rows (R1 and R5 respectively) of a word image, the next task becomes to find the boundary between upper and middle zones (i.e., R2). The boundary lies along the common Matra of the associated word. Since the Matra in a handwritten word cannot be strictly horizontal,

Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07)0-7695-2770-1/07 $20.00 © 2007

identification of R2 is difficult. To get rid of this problem, horizontalness of the consecutive black pixels in a word image is computed as follows.

Fig. 2. Estimation of ‘R2’ based on horizontal histogram

profiles.

(a) The original word image. (b) Horizontal pixel density histogram of the word image. (c) Horizontal longest-run histogram of the word image.

For each black pixel in each row of the word image,

the length of the longest run of black pixels in horizontal direction is computed. Then the sum of lengths of the longest runs of all black pixels in each row is computed and plotted as shown in Fig. 2. The row with the highest sum represents the row with maximum horizontalness of consecutive black pixels. This row signifies the upper boundary of the middle zone (R2).

To identify the lower boundary of the middle zone (i.e. R4), the number of transitions from text to background pixels and vice versa are computed along each row, starting from R2. In each row, starting from the bottom of the lower zone (i.e. R5) to the top of middle zone (i.e. R2), the sum of all such transitions is computed. The average of these row wise sums, denoted as nave, is then computed, considering all rows from middle and lower zones. Finally, the first row from bottom most line (i.e. R5), with the value of transition sum exceeding nave., is identified as the lower boundary of the middle zone (i.e. R4). Fig. 3 shows an illustration for identification of the lower boundary of the middle zone (i.e. R4) in the word image. Once R2 and R4 are determined, R3 is determined by locating the middle row of the region (R2 - R4).

Fig. 3. Estimation of the lower boundary of the middle zone

(a) The original word image. (b) Horizontal pixel density histogram of the word image. (c) Horizontal text/background transition histogram of the

word image.

2.2. Identification of the Matra from a Word Image: A Fuzzy Approach

The common Matra of a word image may be identified as the continuous horizontal stripe of black pixels appearing at the top of most of the characters and Modified shapes in the word. The horizontalness property of the Matra may be extracted from the row wise sum of the longest run value of each pixel, as discussed in the previous section. The row, that represents the upper boundary of the middle zone, represents the central row (R2) of the Matra as well. Accurate determination of the common Matra is a key step in the segmentation technique to be presented here. For this all the black pixels constituting the common Matra need to be precisely identified. Such black pixels may belong to either of the two zones, viz. upper zone and nearly half of middle zone adjacent to the Matra. Determining their belongingness to the common Matra on the basis of a distance threshold results into a crisp segmentation decision. It is more practical to view this belongingness as a fuzzy relation. The more is the distance of each black pixel from the Matra the less is its degree of belongingness (µ) to the Matra. Considering this a fuzzy membership function is designed as sown in Fig. 4. All the black pixels on R2 are said to be a part of the common headline of the word image.

Fig. 4. The membership function for the set of black pixels,

constituting the Matra

The black pixels lying on a Matra are not only featured by their respective membership values but also by the lengths of the horizontal lines on which they lie. The latter can be computed with the horizontal run length from each black pixel. The more is the value of the horizontal run length of a black pixel from the approximate Headline region, the higher is the chance of belongingness of the black pixel to the Matra. To produce a gross measure of belongingness of each black pixel to the Matra, a product of its horizontal run length and the membership value is computed. Finally the average of

Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07)0-7695-2770-1/07 $20.00 © 2007

products of all such black pixels from the approximate headline region is considered to produce a threshold (η) to decide which of the black pixels from upper zone and nearly half of middle zone belong to the Matra. Such threshold is more rational than a strict distance based threshold, mentioned before. 2.3. Segmentation of the Matra using fuzzy features Once the black pixels constituting the Matra of a word are identified the next task becomes to identify certain column positions on the Matra from where the word can be vertically segmented into constituent characters. Such column positions are called terminal points of segments. One of the prominent features for identifying terminal points of segments is the number of black pixels along each vertical column position on the Matra. The less is the number of black pixels along a vertical column position on the Matra, the higher is its degree of belongingness (µ1) to the set of terminal segment-points. On this basis a fuzzy membership function (µ1) is designed as shown in Fig. 5. It will be considered here as one of the major features (F1) for identifying segment terminal-points along the Matra of a word.

Fig. 5. Fuzzy membership functions µ1, µ2 and µ3

Since certain characters of Bangla script do not always appear in a non-overlapping fashion on contiguous character positions, the above feature alone cannot identify segment terminal points on the Matra. Consider three points P1, P2 and P3 on the Matra of the word shown in Fig. 6. On the basis of feature F1, computed separately at these three points, all these three points are considered more or less equally worth to become segment terminal points. The vertical distance of the first black-pixel, in the region R2- R3, from each of these three points can be of help to determine P2 as the segment terminal point out of these three points. Here, again the more is the distance, the less is the degree of belongingness (µ2) of the associated point to the set of segment terminal points. A fuzzy membership function (µ1), as shown in Fig. 5, is drawn. It can be used to generate a second feature (F2) for

identification of segment terminal-points on the Matra of a segmented word under the present work.

Fig. 6. An illustration for feature F2

Because of the shapes of certain characters of Bangla script, the above two features, viz., F1 and F2, may sometime lead to over segmentation as illustrated in Fig. 7. To prevent this, another feature (F3), similar to (F2), is considered here by extending the region (R2 - R3), previously considered for computing F2, to (R2 - R4). The necessary membership function (µ3) for this is also shown in Fig. 5.

To determine finally whether a black pixel on the Matra can be considered as a segment terminal point, the average of all the three feature values exceed certain predetermined threshold, are finally considered as segment terminal points. The threshold is fixed up by taking the average of all the three feature values of all the black pixel positions over the Matra of a segmented word.

Fig. 7. Over segmentation is prevented at point P by including

an additional feature F3 = µ3(d) 3. Results and Discussions To evaluate the performance of the technique, described here, for segmentation of word images, various samples of handwritten Bangla text documents are collected from different persons. A total of 210 word images are identified from different documents to include varieties in writing styles. The documents are digitized using a

Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07)0-7695-2770-1/07 $20.00 © 2007

flatbed scanner at a resolution of 300 dpi. The digitized documents so prepared are finally binarized simply through thresholding.

Fig. 8. Some sample word images, properly segmented by the

present technique

Some of the sample images of Bangla words, which are properly segmented by the present technique, are shown in Fig. 8. Also, some, on which the technique fails at some points, are shown in Fig. 9. In the Fig.s 8-9, the pixel positions, which are identified as potential segment points on the Matra, are shown with light shading, where as the other pixel positions on the same Matra are shown with darker shading. To evaluate the segmentation performance of the present technique the following expression is developed.

Success rate = (Ct / (Ct + Co + Cu))*100 Where. Ct = the number of segment terminal points

producing true segmentation, Co = the number of segment terminal points producing over segmentation, and Cu = the number of segment terminal points producing under segmentation. Whether a segment terminal point, identified by the present technique, produces true

segmentation, under segmentation or over segmentation is determined through visual observation here. On the basis of this, the success rate of the present technique is computed to be 95.32% out of the 210 word images. It is more or less satisfactory considering complexity of Bangla script, which is further compounded with handwritten characters of varying shapes and sizes.

It can be observed from Fig. 9 that the wrongly identified segment points on the Matra may result into under segmentation or over segmentation of the constituent characters of the word. Mostly bumpy Matras make the words prone to under segmentation when the technique is applied. The technique also fails to prevent over segmentation at the points on the Matra, where more than one component of a constituent character are attached. Further work needs to be directed to overcome these limitations of the technique.

Fig. 9. Some sample word images with encircled portions

showing where the technique fails

An attempt may be made in future to prevent the failure cases, obtained through this experimentation, by intelligent recombination of over segmented character components, appearing in a word. For this, class memberships of the segmented character components need to be identified. A fuzzy rule based classification system may be employed for this. Over segmented

Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07)0-7695-2770-1/07 $20.00 © 2007

character parts, if so identified, may be recombined to form true segments in some cases.

To prevent under segmentation, the membership function of Matra pixels may be made asymptotic along the distance axis with a bell shaped function.

Touching characters appearing in word segments are one of the major causes of under segmentation of word images. How the touching characters in segmented word components can be located is another interesting direction of future work relating to segmentation of handwritten words. The work described in [12] can provide some hints in this regard

The present technique is capable of dealing with 50 basic characters, i.e., the vowels and consonants of Bangla script, and 10 Modified shapes, representing some additional characters of Bangla script. More research effort is needed to extend this technique to deal with 250 complex shaped compound characters of Bangla script.

Apart from all these, the fuzzy word segmentation technique, presented here for segmentation of handwritten Bangla words, can be considered as an important step towards the realization of a full fledged Bangla OCR system. Acknowledgements Authors are thankful to the “Center for Microprocessor Application for Training Education and Research”, “Project on Storage Retrieval and Understanding of Video for Multimedia” of Computer Science & Engineering Department, Jadavpur University, for providing infrastructural facilities during progress of the work. References [1] R.G. Casey et.al. “A Survey of Methods and Strategies in

Character Segmentation”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18,pp 690-706, 1996.

[2] R.M. Bozinovic et.al. “Off-line Cursive Script Word Recognition”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11,pp 68-83, 1989.

[3] J.T. Faveta, “Offline General Handwritten Word Recognition Using an Approximate BEAM Matching Algorithm”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23,pp 1069-1021, 2001.

[4] A.W. Senior et.al. “An Off-line Cursive Handwriting Recognition System”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20,pp 309-321, 1998.

[5] P.K. Wong and et.al. “Off-line Handwritten Chinese Character Recognition as a Compound Bays Decision Problem”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20,pp 1016-1023, 1998.

[6] A. Amin “Off-Line Arabic Character Recognition: The State of the Art”, Pattern Recognition, vol. 31, No. 5. pp. 517-530, 1998.

[7] A. F. R. Rahman, R. Rahman, M.C. Fairhurst, “Recognition of Handwritten Bengali Characters: a Novel Multistage Approach,” Pattern Recognition, vol. 35, p.p. 997-1006, 2002.

[8] T. K. Bhowmik, U. Bhattacharya and S. K. Parui, “Recognition of Bangla Handwritten Characters Using an MLP Classifier Based on Stroke Features,” in Proc. ICONIP, Kolkata, India, p.p. 814-819, 2004.

[9] B. B. Chaudhuri and U. Pal, “A Complete Printed Bangla OCR System,” Pattern Recognition, vol. 31, No. 5. pp. 531-549, 1998.

[10] A. Bishnu, B. B. Chaudhuri, “Segmentation of Bangla Handwritten Text into Characters by Recursive Contour Following,” in Proc. 5th ICDAR, pp. 402-405, 1999.

[11] U. Pal, S. Datta, “Segmentation of Bangla Unconstrained Handwritten text,” in Proc. 7th ICDAR, pp. 1128-1132, 2003.

[12] U. Garain, B. B. Chaudhuri, “Segmentation of touching characters in printed Devnagri and Bangla scripts using fuzzy multifactorial analysis,” IEEE Trans. On Systems, Man and Cybernetics – Part C: Applications and Reviews, vol. 22, pp. 449 – 459, 2002.

Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07)0-7695-2770-1/07 $20.00 © 2007