Sinhala OCR (Digital, Handwritten, & Palm-leaf Text) - eAsia2009 ABS 387-Sinhala OCR

Sinhala OCR (Digital, Handwritten, & Palm-leaf Text)

�ikeshala Wickramaarachchi1

[email protected]

D. L. Anoj De Silva2

[email protected]

Tashila Kannangara3

[email protected]

G.Balachandran4

[email protected]

School of Computing, Asia Pacific Institute of Information Technology

(APIIT), Sri Lanka.

1.0 Abstract

OCR (Optical Character Recognition) for Sinhala script has become an area of interest in the

recent years with a number of researches conducted on this subject. This paper uses multi-font

multi-size digital text, handwritten text and palm-leaf manuscripts as three (3) case studies to

address the Sinhala OCR. All three (3) case studies addressed the problem domains by

developing demo tools. The demo tools were implemented mainly using Artificial Neural

Networks.

Key Words – Sinhala Script, Optical Character Recognition, OCR, Artificial Neural Networks,

Feature Extraction, Image Processing.

2.0 Introduction

Sinhala is an official language of Sri Lanka, which is primarily used by its ethnic majority, the

Sinhalese. Sinhala script is principally used as the writing system for the Sinhala language.

Sinhala script derives its orthography from the Brahmi script. Brahmi is a family of abugidas

(writing systems) used in South Asia, Southeast Asia, Tibet, Mongolia and Manchuria.

Moreover Sinhala writing system is influenced by Pallava Grantha script which used around

8th – 10th century.

The art of converting human readable documents into machine readable and editable ASCII or

/ and Unicode format files is known as Optical Character Recognition (OCR). Most of the

1 Author of the Case Study 2



modern OCR engines for Scripts like Arabic, Latin, Chinese and Korean are capable of

handling multi-font and multi-size characters. Font families such as Serif and Sans–serif,

different font sizes such as 10, 12 and 14 are concerned in those OCR engines. Research

shows that there is NO such reported multi-font and multi-size supporting OCR engine for

Sinhala Script. Therefore it remains as a challenging problem to develop practical OCR

system for multi-font and multi-size characters, which contains in a single document. Case

study one (1) tries to address this problem.

Case study two (2) addresses the handwritten Sinhala script. A large number of organizations

in Sri Lanka deal with data acquired in the form of Sinhala handwriting. Handwriting is a

major source of input to most organizations where data is collected using hand filled forms

such as registration forms, tax forms, visa forms and census forms. Currently all collected

data needs to be entered to information systems manually for the purposes of processing and

storing. The manual data entry process is extremely time consuming and error prone. These

organizations would benefit greatly from a system that could convert handwritten Sinhala

script directly to electronic text.

Third (3) case study address the palm-leaf manuscripts. Most of Sri Lankan historical data

such as medicine potions, Buddhist dharma and astrological data are written on palm-leaves.

Over past two thousand years most of valuable data are written in palm-leaf manuscripts.

Most of these palm-leaves are nearing the end of their natural lifetime or are facing

destruction. There are some applications which are created OCR systems for palm-leaf

manuscripts, but NOT for Sinhala script.

3.0 Sinhala Script

Indic languages primarily belong to two major linguistic families, Indo-Aryan and Dravidian.

In Sri Lanka the majority spoken language is Sinhala and it is belongs to Indo-Aryan family.

Sinhala script uses the Abugida system. In an abugida, writing system in which each vowel-

consonant letter represents a pure-consonant accompanied by a specific vowel; the vowels are

indicated by modification of the consonant sign, either by means of diacritics or through a

change in the form of the consonant (Daniels and Bright, 1996, p4). To indicate consonant

with a different vowel, symbols are added around the base symbol as before, after, above,

below or In some cases, modifiers are placed on both sides of the consonant. This feature

makes OCR as complexity task for Indic scripts.

According to Gunasekara (1891, p.3), there are two types of alphabets in Sinhala. They are

the Elu alphabet and the Mixed Sinhala alphabet. The Sinhala alphabet used in the present

differs from both the Elu alphabet and the mixed alphabet. The contemporary Sinhala

alphabet consists of a total of 60 letters. It is made up of 18 vowels, 40 pure consonants and

the Anusvaraya and Visargaya. Some researchers consider that there are 41 pure consonants in

the contemporary Sinhala alphabet (Premaratne & Bigun 2002).

4.0. Optical Character Recognition (OCR)

An OCR system consists of many stages such as preprocessing, segmenting, feature

extraction and character recognition. The objective of the preprocessing stage is to enhance

the image quality and prepare them for further processes. The output of the OCR system is

highly influenced by the preprocessing module. Activities such as thresholding, noise

removal, skew correction and background line removal are usually conducted within the

preprocessing stage. For OCR only the foreground image is required. Extracting the

foreground from the background of an image is known as thresholding (binarization). Noise

removal is usually conducted on the image after undergoing thresholding. Noise can be

defined as any unwanted information contained in a digital image. Noise in document images

can be caused by certain attributes of the scanner, improper tuning of scanning parameters,

texture of the source and type of implement used to produce the characters.

The goal of segmentation is to break down a set of characters into smaller entities prior to the

recognition process. To reach the final output of segmentation, the group of characters should

be segmented into lines, words and finally individual characters. Character recognition should

not be directly conducted on raw segmented characters because characters of different sizes

and the large number of input variables can cause problems for pattern recognition systems.

Feature extraction is used after segmentation to transform raw character images into a

smaller and consistent number of variables known as features. After the feature extraction

stage character recognition is conducted. The objective of character recognition is to

successfully recognize a character using the extracted character features.

5.0 Case Study 1: Multi-font and Multi-size Sinhala OCR

The research shows that, the one and only currently available (and reported) Sinhala OCR

engine (2009) is capable of handling single-font and single-size character recognition, but not

Multi-font and Multi-size Sinhala character recognition at once (developed by UCSC).

Anyhow, in reality most of the documents we find contain at least two types of fonts and at

least two sizes of fonts. This case study represents a practical scheme for Multi-Font and

Multi-Size Character Recognition using Artificial Neural Network (ANN) for Sinhala Scripts

which proves the concept that Multi-Font and Multi-Size Optical Character Recognition can

be applied to Sinhala scripts as well.

The optical images which contains Multi-Font and Multi-Size Sinhala vowel characters taken

as the inputs for the system and then it goes through the image Pre-processing techniques

such as Grayscale Dilation, Median Filtering and then converted it to an binary image using a

Global Thresholding value. As the first step it goes through a grayscale dilation process to

reduce unwanted color details of an optical image. Noise filtering techniques should apply to

reduce noise up to certain extent. Therefore noise reduction techniques such as Median

Filtering, Sharpening and Smoothing applied to enhance the image quality in order to gain a

reasonable overall output from the OCR system. Gray scaled and noise removed image will

convert to a binary image as shown in Figures 1, 2, and 3. This binary image comprises of just

two pixel values (Black & White). A color intensity value should be chosen and the pixels

which contain higher values than the chosen intensity value are marked as 1 i.e. Black pixels

and which contains a lower intensity value marked as 0 i.e. White pixels. This process helps

to differentiate objects from its background of an image (Gonzalez & Woods, 2002).

Figure 1: RGB Image

with background color

Figure 2:

Grayscale Image

Figure 3:

After Thresholding

Once the image is pre-processed, segmentation process has to be completed. Most of the time

there can be multiple text lines in an optical image. Therefore Horizontal Projection Profile

implemented to segment text lines as Figures 4 and 5. “The projection profile gives valleys of

zero height for these OFF pixels between the text lines. Segmentation of the image into

separate lines is done at these valley points” (Reddy & Krishnamoorthi 2008).

Figure 4: Image with multiple text lines Figure 5: Horizontal Projection Profile of

Figure 4.

After segmenting text lines of an optical image, the characters/glyphs will segmented by

using Vertical Projection Profiles applied to segmented text lines as Figure 6.

Figure 6

Having segmented characters, it goes through a process of extracting each character to a

square with the width and height of the respective character. The above isolated

characters/glyphs are then resized to a specific image size (250 x 250) so it contains only the

specified amount of pixel data. This is the Normalization process taken to solve the main

objective in this case study which is Multi Font and Multi Size character recognition. Size

invariant shape invariant constant size of a character/glyph image would be the ideal solution

to make the Feature Extraction more generalized (Shatil A.S.M. & Khan, M., 2006).

The concept of this Feature Extraction method is, creating an abstract image of a

character/glyph out of the total pixel data (250x250 pixels) grabbed after normalization

process. A sample of an abstract image created by the prototype is shown below in Figure 8.

Before Sampling (250x250 pixels) After sampling to 25 x 25 pixels

Figure 8

Finally a Feed Forward Neural Network with back-propagation algorithm for supervised

learning is chosen for the training recognition process.

As mentioned above, demo tool developed to recognize Multi Font and Multi Size Sinhala

Characters/glyphs, prove the concept that Multi Font and Multi Size OCR for Sinhala Script

is possible and successful. This confirms that achievement had taken the Sinhala OCR

technology for a new level with the use Artificial Neural Networks.

6.0 Case Study 2: Offline Sinhala Handwriting OCR

OCR and Handwriting Recognition for Sinhala script have attracted a significant amount of

attention in the recent years. Analysis of existing research reveals that most of the efforts

focus on a limited subset of Sinhala characters and recognizing of constrained and well

defined handwriting. The significant lack of research in the area of unconstrained Sinhala

handwritten script recognition prevents the existing research and development attempts from

being useful in any realistic environment.

Handwriting recognition is the task of transforming a language represented in its spatial form

of graphical marks into its digital representation Plamondon and Srihari (2000, p.64). The

ultimate handwritten script recognition system should be able to recognize unconstrained

writing produced by any writer, deal with different writing styles and languages and remain

unaffected by the size of the vocabulary. But developing such a system remains a challenge

due to the complex nature of handwriting. Some of the factors contributing to this complexity

are writer dependency, various writing styles, similar looking characters, nature of the input

signal and vocabulary.

In this research the offline Sinhala handwriting recognition system was developed, trained and

tested using handwritten names collected from National Identity Cards (NIC) of Sri Lankan

citizens that contain Sinhala script. Since names of individuals contain a majority of

contemporary Sinhala characters, names were chosen as the domain to test the demo tool.

Since NIC contain names written in unconstrained Sinhala script and are readily available,

NIC was selected as the medium of acquiring handwritten Sinhala names.

As described in case study 1, this research also uses thresholding and noise removal

techniques for image preprocessing. After preprocessing, segmentation and feature extraction

methods are used. Finally the ANN used as the classifier for character recognition. Figure 9.

Figure 9 : Major steps of the system

Use of ANN provided a considerable level of accuracy for the handwriting recognition. But

test results suggested more room for improvement. Recognizing all Sinhala characters falls

into the category of large vocabulary problems. For such problems utilizing the knowledge of

the lexicon is a recommended method of increasing the performance of the system. A

technique such as Hidden Markov Model (HMM) can be integrated to the system to increase

its overall accuracy. Koerich et al. (2002, p.99) and (Marinai et al. 2005, p.31) have used

hybrid classifiers of ANN and HMM to increase the accuracy of handwriting recognition

systems.

6.0 Case Study 3: Palm leaf manuscripts OCR

Palm leaves were once a popular writing medium especially in the Asian region. These

manuscripts are created by first carving characters or letters using a metal stylus into the dried

Palm leaf. Next advancing the contrast, legibility and visibility of the carving was carried out

by applying lampblack with coconut oil or another aromatic oil which contains insect

repellent qualities. The life time of Palm leaf manuscripts are not as long as artificial

materials. They face destruction from causes such as dampness, fungus and insects.

Destruction of palm leaf manuscripts lead to the risk of losing a wealth of ancient knowledge

contained within them. OCR systems can be used to preserve the knowledge contained in

these manuscripts with more efficiency than a manual system.

It will be a vast domain if the selected topic was to address an area like medicine related or

dharma related. To avoid those difficulties, the scope is narrow to address the horoscope

which is written on Palm leaves only.

Sinhala Palm leaf horoscopes have mainly three parts where characters are written in. they are

two cages which are mentioned with the Zodiac sign and Nawanshakaya, a description which

written in Sanskrit using Sinhala script, describes about the time period which is the person

born and related details about that, a description in Sinhala about the persons astronomical

details. In this proposed system only consider about the two cages and the Sinhala description.

Due to the difficulty of identifying and recognition of compound characters, touching

characters and Sanskrit description are not addressed as in the domain.

Even though basic system is completed there are many enhancements should be done to use

as a product. Some are listed below.

- A capability of acquiring the data from the system (integrated scanning facility).

- Fully auto mated image pre processing stage

- A capability to segment overlapped lines and overlapped characters.

- Capability of matching with the possible words in a lookup table.

7.0 Conclusion

This paper uses multi-font multi-size digital text, handwritten text and palm-leaf manuscripts

as three case studies to address the Sinhala OCR for first time. All three (3) case studies

addressed the problem domains by developing demo tools. Except palm leaf OCR the other

two systems shows satisfactory results. However improvements could be made in the

preprocessing, segmenting, feature extraction and character recognition stages to improve the

overall accuracy of all three systems. Recognizing all Sinhala characters falls into the

category of large vocabulary problems. For such problems utilizing the knowledge of the

lexicon is a recommended method.

8.0 Acknowledgements

Our heartfelt gratitude goes out to project supervisor Mr. Balachandran Gnanasekaraiyer for

the vital encouragement and guidance he provided us at all times. We would also like to thank

our project assessors Mr. Gamindu Hemachandra and Ms. Jina R. Daluwatta for the valuable

feedback they gave us during the important stages of this project. Support given by the

academic staffs, lab administrators, and library staffs at APIIT are deeply appreciated.

9.0 References

Daniels, P.T., Bright, W, The world's writing systems, 1st ed, 1996, New York: Oxford

University Press.

Gonzalez, R. C., Woods, R. E., 2002. Digital Image Processing. 2nd ed. Pearson Education,

Delhi.

Gunasekara, A. M., 1891. A comprehensive Grammar of the Sinhalese Language. Asian

Educational Services, New Delhi.

Koerich, A.L., Leydier, Y. Sabourin, R. Suen, C.Y. 2002. A hybrid large vocabulary

handwritten word recognition system using neural networks with hidden Markov

models. In: Eighth International Workshop on Frontiers in Handwriting Recognition,

August 6-8 2002 Ontario Canada. 99-104.

Marinai, S., Gori , M., Soda, G., 2005. ‘Artificial Neural Networks for Document Analysis

and Recognition’, IEEE Transaction on Pattern Analysis and Machine Intelligence,

vol. 27, no. 1, pp. 23-35.

Plamondon, R., Srihari, S. N., 2000. ‘On-Line and Off-Line Handwriting Recognition: A

Comprehensive Survey’, IEEE Transactions on Pattern Analysis and Machine Intelli

gence, vol. 22, no. 1, pp. 63-84.

Premaratne H.L & Bigun J. 2002, ‘Recognition of Printed Sinhala Characters Using Linear

Symmetry’, The 5th Asian Conference on Computer Vision, 23-25 January 2002,

Melbourne, Australia

Reddy N.V.S. & Krishnamoorthi 2008, ‘Hierarchical Recognition System for Machine Printed

Kannada Characters’, IJCS-S International Journal of Computer Science and -et

work Security, vol.8 no.11, pp 44-53.

Shatil A.S.M. & Khan, M., c.2006, Minimally Segmenting High Performance Bangla Optical

Character Recognition Using Kohonen -etwork, Computer Science and Engineering,

BRAC University, Dhaka, Bangladesh.

Biographical �otes

G. Balachandran

Bala is a lecturer at School of Computing at Asia Pacific Institute of Information Technology,

Sri Lanka, and is a consultant to the ICT Agency of Sri Lanka for the Tamil language. He was

responsible for the standardization of Tamil encoding, collation and keyboard. Moreover he is

working in ICT localization for last 4 years and member of Local Language Working Group

(LLWG) at ICT Agency.

D. L. Anoj De Silva, �ikeshala Wickramaarachchi, and Tashila Kannangara

Anoj, Nikeshala and Tashila are graduates of APIIT city campus, of B.Sc. (Hons) Computing

specialized in Software Engineering, which is affiliated to Staffodshire University of UK.

Currently Anoj is working as an Associate Software Engineer at Virtusa Corporation,

Nikeshala as an internship member of Unilever (Pvt) Ltd and Tashila as an E-Marketing

Executive of Archmage (Pvt) Ltd.

Sinhala OCR (Digital, Handwritten, & Palm-leaf Text) - eAsia2009 ABS 387-Sinhala OCR

Documents

Transcript of Sinhala OCR (Digital, Handwritten, & Palm-leaf Text) - eAsia2009 ABS 387-Sinhala OCR