[IEEE 2013 IEEE Second International Conference on Image Information Processing (ICIIP) - Shimla,...

Identification of Kashmiri Script in a Bilingual Document Image

Rumaan Bashir Department of Computer Science

Islamic University of Science and Technology [email protected]

Smk Quadri Department of Computer Science

University of Kashmir [email protected]

Abstract— Script identification is a very important field in the area of pattern recognition & document image analysis. Commendable work has been proposed and implemented to recognize various common scripts in unilingual, bilingual and multilingual contexts. So far, diminutive work has been presented for Kashmiri script identification. In this paper, we are describing and experimentally testing our approach for identification of Kashmiri script with respect to English script which comprises a text document image. Two important and simple features are used for identification of scripts: Horizontal Profile Coefficients (Peaks) & Horizontal Profile Valleys.

Keywords—Script Identification; Recognition; Kashmiri; English; Bilingual Doument Image; Multilingual; Horizontal Projection; Profile Coefficients; Valleys; Peaks.

I. INTRODUCTION Automatic pattern recognition is a field that has evolved

exponentially and tremendously in the past few decades. Many areas of this field have been heavily researched including the field of Script Identification. A Script is a system of written letters and symbols used to write a particular language. It comprises the text of a manuscript or a document. It is also defined as any system or style of writing. Secondarily, it is defined as any of various typefaces that imitate handwriting. Languages can be written using scripts, but script itself is made up of symbols and characters. Generally, it is found that a script can have any of the following writing system [1]:

Figure 1. Script writing systems.

Script identification involves identifying a particular script

of a language in a document image. If a document image consists of one script, the task is simpler. Because of increase in the number of scripts in a document image, the processing & identification of scripts becomes complex. This is because every script possesses unique characteristics that differentiate it from other scripts e.g. English and Hindi are differentiated due

to the presence of an upper bar [11, 22] (sirorekha) in the text lines of Hindi which is not present in English. Additionally, some scripts are very similar to each other e.g. Urdu and Arabic. This is due to the shape of symbols used to write them and the directionality of reading/writing (right to left).

Automatic script identification finds its applications in automatic archiving & processing of multilingual documents, searching repositories and multilingual OCRing. In India, a bilingual document is a very common one. It contains normally two scripts, English script and another regional script [2]. Our paper focuses on such document images that contain English & Kashmiri scripts.

This paper is organized as follows:

1. Section II describes the background of Kashmiri language.

2. Section III presents a review of recent techniques used in script identification.

3. Section IV presents the distinguishing features of the Kashmiri script.

4. Section V describes the proposed method for Kashmiri script identification.

5. Section VI presents the experiment results. 6. Finally, advantages & conclusions are summarized in

Section VII.

II. BACKGROUNG OF KASHMIRI LANGUAGE Kashmiri Language is spoken primarily in the Kashmir

Valley located in the Jammu & Kashmir State in India. Kashmiri also called as Koshur (کأُشر) is a language belonging to the Indo-Aryan language’s Dardic sub-group [3, 4, 5]. Kashmiri language is one of the few languages present/printed on the paper notes of Indian currency. In India, there are approximately 12 official scripts [2]. Kashmiri language is also an official language in India.

The Kashmiri language has been listed as one of the 22 Indian languages to be developed by Govt. of India in the Eighth Schedule to the Indian Constitution. Kashmiri language needs to be developed, alongside other regional languages of J&K state. Therefore, the Kashmiri language has been made an obligatory subject in school education in the Kashmir Valley up to the secondary (10+2) level, since November 2008.

The descent of Kashmiri language is shown in the following figure:

Featural e.g. Korean

Script Writing Systems

Logographic e.g. Chinese

Syllabic e.g. Japanese

Alphabetic e.g. Arabic

Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)

978-1-4673-6101-9/13/$31.00 ©2013 IEEE 575

Figure 2. Family tree of Kashmiri language.

Three orthographical systems are used to write the Kashmiri language:

1. Sharada script. 2. Devanagari script 3. Perso-Arabic script. 4. Roman Script (sometimes informally used to write

Kashmiri online).

After the 8th Century A.D, the Kashmiri language was initially & traditionally written in the Sharada script. Today, Devanagari script and Perso-Arabic script (pseudo/modified) are used to write Kashmiri [6, 7]. Other Asian languages like Persian, Urdu, Pashto, and African languages like Nubi, Javi, Juba are also written in Arabic script. Kashmiri, amongst the languages written in Perso-Arabic script, is a rare language which indicates all vowel sounds [8]. The figure below depicts the characters used to write Kashmiri.

(a)

(b)

(c)

Figure 3. Arabic Script for Kashmiri. (a) Consonants (b) Vowels (c)

Numerals.

The Jammu and Kashmir Government has recognized the Perso-Arabic script with extra diacritical marks (now known as Kashmiri script) as the official script for Kashmiri. This script is now widely used in publications in the language. But, this

script is still lacking standardization [4]. The computer software for writing Kashmiri in this script commonly is Inpage software. The following figure shows a sample of Kashmiri script.

Figure 4. Sample Kashmiri script.

III. LITERATURE SURVEY A great deal of work has been carried out in the field of

Script Identification. Various scripts have been researched which profoundly include English. The area of script recognition generally falls into three main categories on the basis of method of writing:

Figure 5. Different Systems of Script Identification.

In addition to the above, based on the type of processing, three main types of script identification are found [9, 10]. These are shown below:

Figure 6. Different Systems of Script Identification.

The global analysis uses text block as a single entity while as the local analysis uses intrinsic features like character shapes including loops, tick features, diacritical marks, curvatures, concavities etc. [11]. A huge amount of literature is available for script identification. A detailed analysis of script recognition and its methods are presented in [1].

Some recent advances are mentioned here. Script identification for six official scripts of India using fractal-based features, component based features and topological features is presented in [2]. [9] Discusses script identification based on Steerable Pyramid Features (SPF). Heuristic based script identification of multilingual documents using seven features which include tick components, top holes, vertical lines, bottom holes, horizontal lines, etc has also been proposed [11]. Arabic font family & font size recognition at ultra low resolution using stochastic approaches is discussed in [12]. Font identification for fast retrieval of script documents using curvature features & SVM classifier has also been presented

ARYAN

Indo-Aryan Eranian Dardic

Kafir Khowar Dard

Shina Kohistani Kashmiri

Script Identification Systems

Global Analysis Based Systems

Local Analysis Based Systems

Hybrid Systems

Script Identification Systems

Printed/Machine Written Systems

Handwritten Systems

Hybrid Systems


576

[13]. An automatic technique of distinguishing the text lines of 12 Indian scripts including English, Devanagari, Bangla, Tamil, Telegu, Kashmiri, Urdu, Malayalam, Oriya, Punjabi, Gujarati, & Kannada is presented in [14]. A rotation-robust script identification method has been proposed in [15] based on Bi-dimensional empirical mode decomposition and Local Binary Patterns. Script identification using multi-stage tree classifier is presented in [16]. An approach based on texture features for script identification using the co-occurrence histograms of decomposed images using wavelets has also been described [17]. [18] Proposed recognizing South-Indian scripts using 37 structural features extracted and reduced using wrapper & filter selection approaches. They also used K-Nearest Neighbor and Support Vector Machines (SVM) and classifiers. Horizontal projection profile to perform the script identification for Hindi, English and Kannada has been proposed by [19]. Script recognition of Han and Roman Scripts has been proposed using directional features supported with a classification by Gaussian Kernel-based SVM [20]. [21] Uses top profile and bottom profiles of text lines to perform script identification. [22] Proposes novel method of script identification by using morphological reconstruction in the document image of Kannada, Hindi, Urdu & English scripts. [23] Also uses top & bottom profiles for script identification on Kannada, Hindi and English scripts. [24] Proposes recognition of Urdu script in Nastaliq and Naksh styles using a combination various methods of water reservoir, contours & topological concepts. The use of horizontal projection profile, moments, and runlength histograms for the script identification of Arabic-English scripts has also been proposed [25].

Systems have been developed which recognizes scripts like English [15, 19, 22, 25, 26], Japanese [15, 20], Roman [20], Arabic [25, 26, 10, 12], Chinese [15, 26], Russian [15], Korean [15], French [15], etc. India being a multilingual country has as much as 1,652 languages and dialects. Regarding Indian languages many researchers have worked in Kannada [19, 22], Urdu [22, 24], Bangla [26], Devanagari [19, 22, 26] etc. In Indian scripts, major contributors are [27].

So far, petite work has been reported in Kashmiri language document image analysis. This work is a pioneering step towards the recognition/identification of Kashmiri script in a bilingual environment.

IV. DISTINGUISHING FEATURES OF KASHMIRI SCRIPT The Kashmiri script is by nature, in form and shape similar

to that of the Arabic, Urdu & Persian script. Therefore, the features of these scripts also are applicable to the Kashmiri script. Kashmiri script shows similarity to the above languages as being a highly cursive script. Its rules for word composition are distinct. The shape of the characters changes due to their placement at the start, middle or end of the word [12, 24]. It heavily uses diacritical marks. These diacritical marks are often separately placed (at a little distance) than the main text words. These marks are not connected into the main word. These include dash marks on the top & bottom of text words. They also include dot marks, one in number, or combination of two’s and three’s. Further, special symbols like ‘tashkeel’, ‘pesh’, ‘zabbar’, ‘zaeer’, ‘tashdeed’, ‘jazm’, etc. are also

important part of this script. The writing and reading order for Kashmiri unlike English is from right to left [9, 12, 24]. The concept of upper case and lower case does not exist in Kashmiri [9, 12].

V. PROPOSED METHOD The method used for recognition is very simple statistics

based approach and is based on the projection profile concept. The algorithms work on the basis of the horizontal projection profiles of the scanned images. Projection profile features are highly used for pattern recognition issues [19, 25]. A horizontal projection profile hpp(x) of a binary image f(x, y) is the sum of black pixels projected onto the vertical axis y. The equation used for horizontal projection profile for the below mentioned algorithms is as follows:

hpp (x) = ∑ v(i) …… (a) where, v(i) = 0 white pixel 1 black pixel

Initially, the document image is scanned which is followed by a step of binarization. The horizontal projection profile is generated. From this profile, two features are observed: profile coefficients & profile valleys. Two methods have been proposed:

A. Profile Coefficient Method: When the projection profiles are generated for both English & Kashmiri, different values in the peaks of the profiles are observed for each text line and these are annotated as the Profile Coefficients. The two highest Profile Coefficients (maxima) for each text line are used to make discrimination between Kashmiri and English scripts. Algorithm for Knowledge Base Creation:

Step 1. Start. Step 2. Scan the Training Data Image. Step 3. Binarize the Document Image. Step 4. Generate Horizontal Projection Profiles. Step 5. Calculate the Profile Coefficient for two highest peaks in a text line. Step 6. Populate the Knowledge Base. Step 7. Stop.

Algorithm for Script Identification: Step 1. Start. Step 2. Scan the Sample Image. Step 3. Binarize the Document Image. Step 4. Generate Horizontal projection profiles. Step 5. Calculate the Profile Coefficient for two highest peaks in a text line. Step 6. Compare with the Knowledge Base. Step 7. Classify the Script. Step 8. Stop.

no. of columns

i=1


577

B. Profile Valley Method: When the horizontal projection profiles are generated for both English and Kashmiri different types of valleys are observed. For English script lines, these valleys are found to be more uniformly placed, clear, & deep while as for Kashmiri script lines vice versa is observed. Thus, the morphological appearance of these valleys for script lines is accounted to make discrimination between Kashmiri and English scripts. Algorithm for Knowledge Base Creation:

Step 1. Start. Step 2. Scan the Training Data Image. Step 3. Binarize the Document Image. Step 4. Generate Horizontal Projection Profiles. Step 5. Evaluate the morphological features of the valleys of text lines. (viz. depth, clarity & uniform placement) Step 6. Populate the Knowledge Base.

Step 7. Stop.

Algorithm for Script Identification: Step 1. Start. Step 2. Scan the Sample Image. Step 3. Binarize the Document Image. Step 4. Generate Horizontal projection profiles. Step 5. Evaluate the morphological features of the valleys of text lines. (viz. depth, clarity & uniform placement) Step 6. Compare with the Knowledge Base. Step 7. Classify the Script. Step 8. Stop.

VI. EXPERIMENTAL RESULTS In order to test the efficacy of these algorithms, experiments

were carried out in which the algorithms were fed sample bilingual document images containing text created from Inpage 2009 software. The Kashmiri script here has been written in nastaliq style.

For the Profile Coefficient method, the coefficients of the horizontal projection profile with two peaks of maximum value for each script lines were evaluated.

TABLE I: ENGLISH SCRIPT LINES -PROFILE COEFFICIENTS

English Script Text Line Images Profile Coefficients

Peak 1 Peak 2 Average

186 175 180.5

200 191 195.5

186 180 183

194 176 185

205 181 193

The table I shows five English script lines with two highest

peaks in the profile (coefficients). It was observed that the two highest peaks in the horizontal projection profile ranges between (175 – 205). Similarly, table II shows five Kashmiri script line images with two highest peaks in the profile

(coefficients). It was observed that the two highest peaks in the horizontal projection profile ranges between (54 – 79).

The classification here is done on the basis of the values of the coefficients. If the average coefficient is in the range of (180-195) ± 10 the script line is identified as English script else if the average coefficient is in the range of (55-75) ± 10 the script line is identified as Kashmiri script.

TABLE II : KASHMIRI SCRIPT LINES- PROFILE COEFFICIENTS

Kashmiri Script Text Line Images Profile Coefficients

Peak 1 Peak 2 Average

75 60 67.5

76 71 73.5

62 55 58.5

79 55 67

72 54 63

These experimental results confirm the feasibility of our

proposed system. For the Profile Valley method, a detailed analysis of the

valleys of the horizontal projection profile was performed yielding promising results. As shown in the following figure, the valleys in the English script part of the document image are deep, uniformly placed and clear while reverse is the case with the Kashmiri script part. This distinction leads to the classification using this method.

Figure 7. Experimental Results for Horizontal Profile Valleys of English and Kashmiri Script.

Overall 500 text lines were tested using the proposed

techniques and 481 text lines were identified successfully, yielding 96.2% accuracy.


578

VII. CONCLUSION & FUTURE SCOPE The work presented in this paper is related to the

identification of Kashmiri script from a bilingual document image. The second script used is the English script. Very little work has been reported till date for the recognition of Kashmiri script. This is a pioneering attempt to recognise Kashmiri script. The results of the proposed system are promising. The proposed methods are simple in implementation. With this piece of work, we achieved a system of recognition that is able to discriminate and differentiate between English and Kashmiri scripts. Two simple features, Horizontal Profile Coefficients (two maxima peaks) and Horizontal Profile Valleys have been used for identification purposes.

In future research, steps will be taken to improve the global & local feature extraction & recognition. Finally, our global objective is to improve the method of Kashmiri script identification and to apply it for multilingual script identification in multilingual document images.

REFERENCES [1] Debashis Ghosh, Tulika Dube, & Adamane P. Shivprasad, “Script

Recognition – A Review”, IEEE, Trans. On PAMI Vol. 32 No. 12 pp 2142-2161 (2010)

[2] K. Roy, S. Kundu Das, Sk Md Obaidullah, “Script Identification from Handwritten Documents”, IEEE Proc. 3rd Intl. Conf on Computer Vision, Pattern Recognition, Image Processing and Graphics, (2011)

[3] "Kashmiri language". Encyclopædia Britannica. [4] Omkar N. Koul, "Koshur: An Introduction to Spoken Kashmiri",

Kashmir News Network: Language Section, (1996), www.koshur.org/Kashmiri/introduction.html

[5] S. S. Toshkhani, "Kashmiri Language: Roots, Evolution and Affinity". Kashmiri Overseas Association, Inc. (2008)

[6] en.wikipedia.org/wiki/Kashmiri_language [7] www.omniglot.com - Kashmiri [8] Daniels & Bright, “The World's Writing Systems”, (1996) pp. 753–754. [9] Mohamed Benjelil, Remy Mullot, Adel M. Alimi, “Language and Script

identification based on Steerable Pyramid Features”, IEEE, Intl. Conf. on Frontiers in Handwriting Recognition, (2012)

[10] Ikram Moalla, Abdelkarim Elbaati, Adel M. Alimi, Abdelmajid Benhamadou, “Extraction of Arabic from multilingual documents”, IEEE, (2002)

[11] M Swamy Das, D. Sandhya Rani, C. R. K. Reddy, “ Heuristic based Script Identification from Multilingual Text Documents”, IEEE, 1st Intl. Conf on Recent Advances in Information Technoogy (RAIT), (2012)

[12] Faoud Slimane, Slim Kanoun, Jean Hennebert, Adel M. Alimi, Rolf Ingold, “A study on font family and font size recognition applied to Arabic word images at ultra-low resolution”, Elsevier, Pattern Recognition Letters 34 (2013), 209-218

[13] Sukalpa Chanda, Umapada Pal & Katrin Franke, “Font Identification – In Context of Indic Script”, 21st Intl. Conf. on Pattern Recognition ( ICPR-2012), Japan

[14] U. Pal & B. B. Chaudhari, “Script Line Separation From Indian Multilingual Script Documents”, Proc 5th Intl. Conf, on Document Analysis ad Recognition, IEEE Comp. Society Press pp 406-409 (1999)

[15] Jianjia Pan, Yuanyan Tang, “A Rotation-Robust Script Identification Based on BEMD and LBP”, IEEE, Intl. Conf on Wavelet Analysis and Pattern Recognition, (2011)

[16] Shamita Ghosh and Bidyut B. Chaudhuri, “Composite Script Identification and Orientation Detection for Indian Text Images”, IEEE Intl. Conf. on Document Analysis and Recognition, (2011)

[17] Hiremath P. S., Shivashankar S., Jagdeesh D Pujari & V. Mouneswara, “Script Identification in a handwritten document image using texture features”, IEEE, Proc. 2nd Intl. Advance Computing Conf. (2010)

[18] Rajesh Gopakumar, N. V. Subareddy, Krishnamoorthi Makkithaya & U. Dinesh Acharya, “Zone-based Structural feature extraction for Script Identification from Indian Documents”, IEEE, 5th Intl. Conf. on Industrial & Information Systems, (2010)

[19] Prakash K. Aithal, Rajesh G., Dinesh U. Acharya, Krisnamoorthi M. Subbareddy N.V, “Text Line Script Identification for a Trilingual Document”, IEEE, 2nd Intl. Conf. on Computing, Communication and Networking Technologies, (2010)

[20] Sukalpa Chanda, Umapada Pal, Katrin Franke & Fumitaka Kimura, “Script Identification- A Han and Roman Script Perspective”, IEEE, Intl. Conf. on Pattern Recognition, (2010)

[21] M.C. Padma, P. A. Vijaya, P. Nagabushan, “Language Identification from an Indian Multilingual Document Using profile features”, IEEE, Intl. Conf on Computer and Automation Engineering, (2009)

[22] B. V. Dhandra, P. Nagabushan, Mallikarjun hangarge, Ravindra Hegadi, V.S. Malemath, “Script Identification Based on Morphological Reconstruction in Document images”, IEEE, 18th Intl. Conf. on Pattern Recognition, (2006)

[23] P. A. Vijaya & M. C. Padma, “Text Line Identification from a Multilingual Document”, IEEE, Intl. Conf. on Digital Image Processing, (2009)

[24] U. Pal & Anirban Sarkar, “Recognition of Printed Urdu Script”, IEEE, 7th Intl. Conf. on Document Analysis & Recognition, (2003)

[25] Ahmed M Elgammal & Mohamed A. Ismail, “Techniques for Language Identification for Hybrid Arabic-English Document Images”, IEEE, (2001)

[26] U. Pal & B. B. Chaudhari, “Automatic Identification of English, Chinese, Arabic, Devanagri and Bangla Script Line”, IEEE, (2001)

[27] U. Pal, S. Sinha and B. B. Chaudhari, “Multi-Script identification from Indian documents”, IEEE, Proc. 7th Intl. Conf. on Document Analysis and Recognition, (ICDAR 2003) vol 2. pp 880-884


579

[IEEE 2013 IEEE Second International Conference on Image Information Processing (ICIIP) - Shimla,...

Documents

Transcript of [IEEE 2013 IEEE Second International Conference on Image Information Processing (ICIIP) - Shimla,...