[Handbook of Statistics] Handbook of Statistics - Machine Learning: Theory and Applications Volume...

Handbook of Statistics, Vol. 31ISSN: 0169-7161 18Copyright © 2013 Elsevier B.V. All rights reservedhttp://dx.doi.org/10.1016/B978-0-444-53859-8.00018-7

Machine Learning in Handwritten ArabicText Recognition

Utkarsh Porwal, Zhixin Shi, and Srirangaraj SetlurUniversity at Buffalo, The State University of New York, USA

Abstract

Automated recognition of handwritten text is one of the most interestingapplications of machine learning. This chapter poses handwritten Arabictext recognition as a learning problem and provides an overview of the MLtechniques that have been used to address this challenging task. The use ofco-training for solving the problem of paucity of labeled training data andstructural learning approaches to capture contextual information for featureenhancement have been presented. A system for recognition of Arabic PAWsusing an ensemble of biased learners in a hierarchical framework and the useof techniques such as Artificial Neural Networks, Deep Belief Networks, andHidden Markov models within this hierarchical framework to improve textrecognition have been described. The chapter also describes some of the featuresthat have been successfully used for handwritten Arabic text classification forcompleteness since the selection of discriminative features is critical to thesuccess of the classification task.

Keywords: OCR, handwriting recognition, arabic script, learning algorithms,features for text recognition, cotraining, structural learning, HMM, ensembleof learners, neural networks, deep belief networks

1. Introduction

The field of automated handwriting recognition has achieved significant real worldsuccess in targeted applications such as address recognition on mail-pieces forsorting automation and reading of courtesy and legal amounts on bank checks,by using domain-dependent constraints to make the problem tractable. However,unconstrained handwritten text recognition is still an open problem which isattracting renewed interest as an active area of research due to the proliferation

443

http://dx.doi.org/10.1016/B978-0-444-53859-8.00018-7

444 Utkarsh Porwal et al.

of smartphones and tablet devices where handwriting with finger or stylus is likelyto be a potentially convenient mode of input for these handheld devices. Much of thepast research on handwritten text recognition has been focused on Latin and CJK(Chinese, Japanese, Korean) scripts while scripts such as Arabic and Devanagariare now seeing increased interest. In this chapter, we discuss machine learningapproaches in the context of handwritten Arabic text recognition.

OCR of handwritten Arabic script, in particular, has proven to be a verychallenging problem. This chapter poses Arabic text recognition as a learningproblem and investigates the fundamental challenges for learning algorithms.Techniques to overcome such challenges are covered in detail along with the state-of-the-art methods proposed for handwritten Arabic text recognition. Handwrittentext recognition is a hard problem due to several additional challenges such asvariations within the writing of a single writer as well as between writers and noisein the data. Systems developed for recognition of Latin script have not been entirelysuccessful in the recognition of Arabic text due to script-specific challenges suchas the highly cursive nature of Arabic script, heavy use of dots and other diacriticmarks, and context dependent variations in the shape of some characters. This ispartly due to the fact that features that are discriminative for Latin text are notnecessarily discriminative for Arabic text. While the main focus of the chapter islearning algorithms, we also briefly describe features that have worked well forArabic text recognition (Section 4). Language models have also been successfullyused to augment recognition performance but have not been addressed in thischapter.

2. Arabic script—challenges for recognition

The Arabic language is widely spoken in many parts of the world and the Arabicscript is used for representing Arabic and related languages such as Persian, Urdu,Pasto, Sindhi, and Kurdish, as well as many African languages. The Arabic scriptis cursive, even in its printed form, and hence the same character can be writtenin different forms depending on how it connects to its neighbors (ligatures). TheArabic script is written from right to left and the alphabet consists of 28 letters eachof which has between two to four context or position dependent shapes within aword or sub word. The shapes correspond to the beginning, middle, or end of the(sub-) word or in isolation. Figure 1 shows the list of all Arabic characters with thecorresponding forms that they can take based on their position within the text.

However, a few letters do not have initial or medial forms and cannot be connectedto the following letters. When these letters occur in the middle of the word, the wordis divided into sub words often known as parts of Arabic words or PAWs. Figure 2shows Arabic words with a combination of sub words. Each sub word is usually asingle connected component with small unconnected diacritical marks.

Additionally, other diacritics are used to represent short vowels, syllable endings,nunation, and doubled consonants (Lorigo and Govindaraju, 2006). Such diacritics(hamza and shadda) are shown circled in Fig. 3.

Arabic script exhibits a strong baseline along which it is written from right to left.Some letters have ascenders and descenders. Usually characters are joined along this

Machine Learning in Handwritten Arabic Text Recognition 445

Fig. 1. Arabic alphabet.

Fig. 2. Arabic words with one, two, three, and four sub words.

Fig. 3. Handwritten words with diacritics—Hamza and Shadda.

Fig. 4. Handwritten Arabic words with ascenders and descenders.


Fig. 5. Sample handwritten Arabic documents.

baseline. Figure 4 shows Arabic words with ascenders and descenders along with thebaseline.

Many of these characteristics of Arabic text make recognition a difficult task.Long strokes parallel to the baseline and the ligatures that also run along the baselinemake it difficult to segment Arabic words into their component characters. As aresult, techniques developed for Latin scripts often do not translate well to Arabicscript. Figure 5 shows two sample handwritten Arabic documents that illustrate thecharacteristic features of Arabic script. Over the past two decades many approacheshave been tried to recognize Arabic text. Simple rule based methods such as templatematching do not work well with handwritten documents given the variability evenwithin a single writer’s handwriting. Writers have different writing styles and a fewtemplates cannot capture all variations. In the last decade, the focus has been ontrying to develop principled machine learning techniques to solve the problem ofhandwriting recognition. A typical approach is to formulate the text recognitiontask as a supervised learning problem in which a classifier is trained to distinguishlist of classes. Data samples that are labeled (their classes are known in advance)are provided to the learning algorithm to train a classifier and the trained classifierwill then be used to recognize the class or label of the test data sample. However,performance of the classifier will depend on several factors with the key ones being (1)amount of training data provided, (2) similarity of the test data to the training data,(3) inductive bias of the classifier (assumptions made by classifier to discriminatebetween member classes), and (4) the quality and amount of information provided tothe learning algorithm for learning (feature extraction and availability). Out of thesefour factors, (2) is more of a requirement than constraint. No model can succeed ifit is tested on a data that is very different from the data it was trained on. Therefore,it is required in machine learning problems that training and test data should comefrom the same data distribution. However, some relaxation of these requirementsare addressed in transfer learning (Pan and Yang, 2010).


One of the primary challenges in Arabic handwriting recognition is obtaininglabeled data because the process of annotating the data is tedious and expensive.Annotating data requires human intervention which makes the whole process oflabeling very slow. It is often difficult to collect sufficient data pertaining to eachclass (word or character) so that a classifier can be learned. Obtaining labeled datais a bottleneck for handwriting recognition. However, unlabeled data is easy to obtainas documents can be scanned at minimal cost. Therefore, it is prudent to investigateif unlabeled data can be used to improve the performance of the learning algorithm.This setting where unlabeled data is used along with labeled data for learning isreferred to as semi-supervised learning in the literature and some learning paradigmsthat are relevant to the recognition task are introduced in Section 3.

Another challenge is of feature selection and good discriminative features arecrucial for optimal performance of the learning algorithm. However, it is oftendifficult to capture all the information present in the data in terms of features assome of the information is contextual or domain specific. Therefore, it is a challengeto capture such information and can undermine the performance of the learningalgorithm if all the information present in the training data cannot be leveraged.Some of the features that have been effective in the recognition of handwrittenArabic documents are described in Section 4.

A third important factor is the selection of the model as it plays an important rolein the performance of recognition or prediction system. Some learners are bettersuited for certain types of data although it does not necessarily guarantee betterperformance. For instance, learners like HMM (Rabiner, 1990) and DTW (Puurulaand Compernolle, 2010) are a natural fit for capturing temporal information (1Ddata) such as in the case of speech recognition because of their assumptions andformulations. Likewise, techniques such as MRF (Li, 1995) and CRF (Wang andJi, 2005) are ideally suited to capture spatial information (2D data) in images orvideos. However, sometimes no single classifier may be good enough for the taskand the selection of algorithms becomes non-trivial as it is difficult to make any apriori assumptions about the structure of the data. In such instances, the selectionof a model that will work with all constraints such as limited labeled data or missinginformation becomes a challenge. In this chapter, we seek to investigate techniquesto address all of these issues and specific models that have worked well in the domainof handwritten text recognition are described in Section 5.

3. Learning paradigms

A learning problem can be formally defined as finding the mapping function betweenthe input vector x ∈ X and its output labels y ∈ Y. In the training phase, a finitenumber of data points (xi,yi) are provided under the assumption that they are alldrawn from some unknown probability distribution � in domain D. The functionthat takes an input vector x and produces the output label y is called the targetconcept. In the process of locating this target concept, a learner will output thehypothesis which is consistent with most of the training samples while minimizingthe error. Therefore, formally a learner will output

h∗ = argminh∈H

{err(h)}. (1)


Fig. 6. Selected hypothesis and actual target concept.

Any learner will explore the hypotheses space and output a hypothesis, thereforeselection of the most suitable learning algorithm is crucial in this regard. Considerthe case shown in Fig. 6 where the learning algorithm selected is linear and theactual target concept is some nonlinear function. In this scenario, the learningalgorithm can never locate the target concept regardless of the number of trainingsamples provided to explore the hypotheses space because the generated hypothesisis linear and target concept is nonlinear. Therefore, selection of the right hypothesesspace (learning algorithm) is very important for any learner to succeed. Domainknowledge and context-based assumptions can help in the selection of the rightmodel and optimizing the performance of the algorithm. Error due to bad qualityof hypotheses space is called approximation error.

It is also possible that no target concept exists that can correctly generate labelsfor all the training data samples. This might be due to inherent noise in the data. Onedata point can also have multiple labels due to noise, therefore no function will beable to generate those labels. Even assuming that a target concept does exist, it canstill be difficult to locate it due to several reasons. The learner explores the hypothesesspace by minimizing the error over number of data samples. Therefore, more datasamples should help in better exploration of the hypotheses space. However, it isoften difficult to get enough data samples in practice, as target concepts can bevery high dimensional complex functions and vast amounts of training data will beneeded to explore the hypotheses space. Error due to limited number of samplesis called estimation error. Therefore, limited number of data samples is the secondchallenge after selection of the right model.

A third issue of concern is the nonavailability of sufficient information. Heresufficient may refer to not only the quantity of information but also the rightinformation. Imagine the actual target concept is a m dimensional function andthe feature extracted for classification is n dimensional. If m > n then in an attemptto locate the target function, the learner will wander in a n dimensional space whilethe actual function lies in a m dimensional space. Therefore, the best any learner cando is to approximate the target concept in n dimensional space; i.e. the projectionof the actual function in a lower dimensional space. However, if more features(information) can be provided, then it is possible to search for the target function ineither the ideal feature space or in a space where loss of information is minimal. It isalso possible that m = n or m < n. Even if the dimensionality of the target functionand the extracted feature are the same, it is possible that the target concept is in adifferent space. The additional information in this case will help in getting close to


the space of target concept (for example, by rotation in the presence of skew). Asimilar argument holds for m < n, where it is possible that the target concept is ofa lower dimension but the feature is not capturing enough information about thedimensions of the target concept.

Although learning theory has many different schools of thought, one of the sim-plest yet powerful paradigms is Probably Approximately Correct (PAC) learning(Valiant, 1984). In PAC setting, any learner h from H is a PAC learner if for anyε where 0 < ε < 1/2, for any δ where 0 < δ < 1/2 and for any distribution � indomain D

Prob[err�(h) � ε] � 1− δ, (2)

where the error is calculated over number of data points {xi,c(xi)}for i ∈ [1,N]sampled from the distribution � as shown below

err�(h) := Prob(x)∼�[h(x) �= c(x)]. (3)

However, PAC model of learning has very strong assumptions and it is likely thatin real-world applications such assumptions will not hold true. It assumes that afunction that can generate labels for all the data points exists for any distributionin domain D and for any 0 < ε, δ < 1/2. This assumption is very strict as thismay not be true for every distribution, ε or δ. Moreover, it might be possible thatsuch a target concept may not exist which generates all the labels because of thenoise in the data and even if there is such function, locating it might be a NPhard problem. To circumvent these limitations, the learning paradigm known asInconsistent Hypothesis Model (IHM) was proposed. In this model, training datapoints {xi,yi} for i ∈ [1,N] are sampled from some distribution � over D × {0,1}(considering only a two-class case) where xi ∈ � and output labels yi ∈ {0,1}.The learning algorithm will select a hypothesis h from H such that this hypothesisminimizes the generalization error

err�(h) := Prob(x,y)∼�[h(x) �= y]. (4)

Any learner will output the most optimal hypothesis:

h∗ = argminh∈H

{err(h)} . (5)

Having formally described the learning algorithms, the next subsection willaddress the challenge of limited data samples by using co-training, a PAC stylesemi-supervised algorithm.

3.1. Co-training

Blum and Mitchell (1998) proposed co-training as a method which needs smallamount of labeled training data to start. It requires two separate views of the dataX = X1 × X2 and both of the views should be orthogonal to each other i.e., eachview should be sufficient for correct classification. Blum and Mitchell (1998) gavea PAC style analysis of this algorithm where data samples x = {x1,x2} are drawnfrom some distribution D and there are two target concepts f1 ∈ C1 and f2 ∈ C2


corresponding to each view such that f1(x1) = f2(x2). Therefore, if f is consideredas the combined target concept for any data point x drawn from distribution D withnonzero probability then f (x) = f1(x1) = f2(x2) = y (label). Here, if f1(x1) �= f2(x2)

for some x with nonzero probability assigned by D then (f1,f2) are consideredincompatible with distribution D. Therefore, co-training will use unlabeled datato find the best compatible pair (h∗1,h

∗2) with the distribution D.

Therefore, two learners will be trained on the labeled data available initially for thetwo views and they will reiteratively label some unlabeled data points. In each roundof co-training, a cache is drawn from the unlabeled data set and all the data pointsin the cache are labeled. The learner labels all the data points with a certain degreeof confidence and some of the data points are selected from this cache and addedto the training set. Selection of these data points is crucial since the performance ofthe learner will increase only if the added points have correct labels. Newly addedpoints should be such that they increase the confidence of the learner in makingdecisions about labels of data points in the next iteration. If incorrectly labeleddata points are added, the performance of the learner will likely decrease. Giventhe snowballing effect in each iteration, the performance of the selection algorithmwill decide the robustness of the algorithm. Therefore, co-training is an iterativebootstrapping method which seeks to increase the confidence of the learner in eachround. It boosts the confidence of the score much as the Expectation Maximization(EM) method does but works better than EM (Nigam and Ghani, 2000b). In EM,all the data points are labeled in each round and the parameters of the model are re-tuned. This is done till convergence is achieved, i.e., when parameters do not changewith new information whereas in co-training a few of the data points are labeledeach round and the classifiers are then retrained. This helps build a better learner ineach iteration which in turn would lead to better decisions and hence an increase inthe overall accuracy of system.

Blum and Mitchell (1998) showed that co-training works best if the two views areorthogonal to each other and each of them is capable of classification independently.They showed that if the two views are conditionally independent and each learnertrained on the views is a low error classifier then the accuracy of classifiers can beincreased significantly. They proved that error rates of both the classifiers decreasesduring co-training because of the extra information added to the system in eachround. This extra information directly depends on the degree of un-correlation.This is because the system is using more information to classify data points. Sinceboth views are independently sufficient for classification, this brings redundancywhile producing more information. However, Abney (2002) later reformulated themetric for co-training in terms of the measure of agreement between learners overunlabeled data. Abney (2002) gave an upper bound on the error rates of learnersbased on the measure of their disagreement. However, Nigam and Ghani (2000a)proved that completely independent views are not required for co-training and thatit works well even if the two views are not completely uncorrelated.

3.1.1. Selection algorithmSeveral selection algorithms have been proposed based on the kind of data andapplication. One approach is to select a set of data points randomly and add to the


labeled set (Clark et al., 2003). The system is retrained and its performance is testedon the unlabeled data. This process is repeated for some iterations and performanceon each set of points is recorded. The set of points resulting in the best performanceare selected to be added to the labeled set and the rest are discarded. This methodis based on the degree of agreement of both learners over unlabeled data in eachround. Other methods that have been tried include choosing the top k elements fromthe newly labeled cache, selecting the ones with maximum scores, or choosing somemaximum and some minimum scoring points (Wang et al., 2007). Other heuristicsinclude checking the standard deviation or just choosing a fixed number of toppoints in every round. In all these cases, an empirically determined threshold isused for the selection criteria and since this is not a very principled approach, theefficacy of the method is dependent on the kind of data or application. Porwal etal. (2012) proposed an oracle-based selection approach where selection was madewithout using any heuristics. The approach was based on learning the pattern ofthe distribution of scores for different classes or writers given by the learners. If thispattern can be learned, then for any unseen data point, the oracle would be able topredict the class or label from the score distribution generated by the learner. Theadvantage of this approach would be a robust selection algorithm that would workregardless of any specific data or application.

3.1.2. Oracle trainingA validation set is used for the training of the oracle and the trained oracle is usedfor selecting data points after each round. Training of the oracle is done before co-training starts. A classifier is trained on the initial training set and its performance istested on the validation set. Now, the score distribution over all classes is consideredas feature and an oracle classifier is trained using these features. Data points forwhich the predicted class matches the ground truth are assigned to the positive classand the rest of the data points is assigned to the negative class. The new featuresare score distributions over all classes and all data points are divided into two newclasses, viz. positive and negative. The task is now narrowed down to a two-classproblem where one class has all the points that meet the selection criteria and theother class has the data points which should be discarded. Here, the oracle classifiersneed not be same as the learner used in co-training process.

Once the oracle is trained, co-training begins. In each round of co-training, acache is selected from the unlabeled dataset and is tested against the learner. Someof the data points with best performance are added to the trained set and the learnersare retrained. These data points from the cache will be selected by the oracle. Aftereach round, a score distribution of all data points will be generated. These scoresare considered as features of a new test set for the oracle. This set would be testedagainst the oracle and it would label all the data points with two new classes thatare either positive or negative. All the data points classified as positive by the oracleare selected for addition to the training set. Learners are retrained and the secondround of co-training resumes and this process repeats until the unlabeled dataset isexhausted. Here the upper limit on the performance of the oracle depends on theaccuracy of the learner. If the oracle selects all the data points correctly labeled bythe learner, the accuracy of performance is still dependent on the performance of the


learner. Hence, the upper bound on the performance is determined by the accuracyof the learner.

Co-training is useful for generating labeled samples for handwritten textrecognition and addresses the first fundamental problem of learning algorithmsi.e., availability of limited number of data samples. Using co-training, one cangenerate labels for unlabeled data and thus aid in enhancing the accuracy ofthe recognition system. However, even when sufficient labeled data is available,the task of Arabic handwritten text recognition is not trivial because of thevariation present in the data. Handwriting of multiple writers can be clusteredinto some broad writing styles. Handwriting of an individual captures severalnuances of a writer’s personality and background (Shivram et al., 2012). Thereare factors influencing the handwriting of an individual which are abstract, suchas the effect of the native language on the writing of the nonnative languagesalso known as accent of an individual (Ramaiah et al., 2012). While featuresthat capture these abstract notions are critical for applications such as writeridentification, the challenge in handwritten text recognition is to find features thatcan tolerate the wide variation in handwriting while being reliably discriminativebetween the classes. Porwal et al. (2012) proposed a novel approach for featureenhancement in semi supervised framework using structural learning to capturethe information that was difficult to extract through regular feature extractiontechniques.

3.2. Structural learning

Porwal et al. (2012) proposed a structural learning-based approach for featureenhancement where a target task is divided into several related auxiliary tasks (Andoand Zhang, 2005; Blitzer et al., 2006). These auxiliary tasks are then solved and acommon structure is retrieved, which in turn is used to solve the original targettask. This approach, also known as multi-task learning in the machine learningliterature, is very effective in a semi-supervised framework where labeled data islimited and the original problem is complex and rich enough to be broken downinto sub problems and each candidate sub problem provides information usefulfor solving the target problem. As discussed above, any learner will explore thehypotheses space to approximate the target function using training data samples.Since usually, limited data points are available to explore any hypotheses space,selection of appropriate and rich space is central to the performance of the learner.Often learner fails to approximate the target function because it does not lie withinthe space that the learning algorithm is exploring. The central idea of structurallearning is to select the most appropriate hypotheses space with the use of finitelabeled data available.

The key concept of structural learning is to break down the main task intoseveral related tasks and then find a common low dimensional optimal structurewhich has high correspondence with every sub task. This structure is used to solvethe main problem. The optimal structure would correspond to the scenario wherethe cumulative error of all the sub tasks will be minimized. This optimal structurecaptures information that is domain specific and is very useful in solving the maintask as it helps in the selection of the right hypotheses space. e.g., the nuances of


handwritten text captured by accent, styles, etc. can be considered as related subtasks of the main task of handwritten text recognition. Since several such aspects ofhandwriting are abstract in nature, there needs to be a principled way to define thesesub tasks. Ando and Zhang (2005) propose an approach to create such related subtasks in a semi-supervised framework.

The efficacy of structural learning lies in the fact that in almost all of the real-world problems the hypothesis produced by an algorithm is a smooth discriminantfunction. This function maps points in the data domain to the labels. The smoothnessof this function is enforced by a good hypotheses space. If any two points are close inthe domain space then the mapping produced by the discriminant function will alsobe close in the target space. Therefore, if one can find such discriminant functionsthen this implies a good hypotheses space.

In structural learning, we find several such functions that correspond to thestructure of the underlying hypotheses space. If the sub tasks are related we getinformation about context embedded in the optimal structure. If sub tasks are notrelated, the structural parameter still contains information about the smoothness ofthe hypotheses space. Therefore, breaking the main task into sub tasks is helpful eventhough they are not related as the structure retrieved will still have the informationabout smoothness of the space.

Formally, structural learning can be defined as a collection of T sub tasks indexedby t ∈ {1, . . . ,T } and each sub task has nt samples over some unknown distribution�t. All the sub tasks have their respective candidate hypotheses spaces Hθ,t indexedby the parameter θ which is common to all the sub tasks and encapsulate all theinformation that are useful for solving the primary task. The new objective functionis to minimize the joint empirical error

h∗θ,t = argminh∈Hθ,t

nt∑i=1

L(h(xt

i),yt

i), (6)

where L is the loss function.The first step is to create the auxiliary tasks related to the main task. Auxiliary

tasks can be formulated as capturing abstract aspects of the main tasks whichcannot be well formulated but are still vital in recognition of the handwrittentext. One way to create auxiliary tasks in a semi-supervised framework is bymaking use of unlabeled data. However, creation of an auxiliary task shouldaddress two issues. First is the label generation for the auxiliary tasks. The processshould generate automatic labels for each auxiliary task. Second condition is ofrelevancy among the auxiliary tasks. It is desirable that the auxiliary tasks arerelated to each other so that a common optimal structure can be retrieved. Andoand Zhang (2005) suggested few generic methods to create auxiliary tasks thatwould satisfy these two conditions. We will cover one of those techniques in thischapter.

In this method two distinct features φ1 and φ2 are used. First, a classifier is trainedfor the main task using the feature φ1 over labeled data. Same feature is extractedfrom unlabeled data and the classifier trained is used to create auxiliary labels forthe unlabeled data. The auxiliary task is to create binary classification problems


Algorithm 1 Structural Learning Algorithm

Require:

1: X 1 = [φ1(x)t,yt]Tt=1 ← Labeled Feature One

2: X 2 = [φ2(x)t,yt]Tt=1 ← Labeled Feature Two

3: U = [φ2(x)j ] ← Unlabeled Data Feature Two

4: C ← Classifier

5: Train C with X 1

6: Generate auxilary labels by labeling U with C

7: For a L class probelm create L binary predictionproblems as auxilary tasks, yl = hlφ2(x)),l =1 . . . L

8: for l = 1 . . . L do

9: wl,θ = (φ2(x)T φ2(x))−1φ2(x)T yl

10: end for

11: W = [w1| . . . |wL]12: [U�VT ] = SVD(W)

13: Projection onto Rh,� = U[:,1:h] = [θ1| . . . |θh]

14: New feature in RN+h space, [φ2(x)θ φ2(x)]

for predicting the label assigned for each of the data points in the unlabeled data.Therefore, for a n class problem as the main task n auxiliary tasks can be created as atwo-class problem. An auxiliary predictor will give label 1 if it can predict the correctauxiliary label, otherwise it will assign 0. Any auxiliary predictor can be written as

hw(x) = w1x1 + w2x2 + · · · + wnxn (7)

and the goal is to reduce the empirical error as given by Eqs. (5) and (6). Therefore,error can be written as

y = h(w,x)+ ε. (8)

In order to minimize this error, we take least square loss function and minimizethe joint empirical error for all the data points

err�(h) = 12

n∑i=1

(yi − wT x

)2. (9)

To minimize the joint empirical error, the algorithm seeks the optimal weightvector. Setting gradient of the error function to zero will give the optimal weightvector

0 =n∑

i=1

yixT − wT

(n∑

i=1

xxT

). (10)

Solving for w we obtain

wopt =(

xT x)−1

xT y. (11)


This will give optimal weights for one predictor of one auxiliary task. To get theoptimal structure corresponding to all the sub tasks this process should be repeatedfor all the auxiliary tasks. After the optimal x is calculated for all the auxiliary tasks,a big weight matrix W of all such weight vectors is created whose columns are theweight vectors of the hypothesis of auxiliary classes.

Once the big weight matrix W is calculated, it can be used to find the low dimen-sional common sub space. However, before doing dimensionality reduction, redun-dancy in the information is removed. Often sub tasks are related to each other alongwith the main task and they capture information of the same nature. Thus, it maynot add much to discriminatory power of the algorithm to solve the main task. Since,information hidden in the weight vectors could be related, only left singular vectorsare picked from the singular value decomposition (SVD) of the W matrix. Therefore,

[U�VT ] = SVD(W). (12)

Initially, weight vectors are in the feature space RN but they can be projected

onto some lower dimensional space Rh to capture the variance of the auxiliary

hypotheses space in the best h dimension. Therefore the low dimensional featuremapping is θT x. This new feature mapping can be appended to the original featurevector to solve the main task in R

N+h space. Thus, structural learning can be used tocapture the contextual information which is difficult to extract otherwise by regularfeature extraction techniques.

The next section describes some of the features that have been successfully usedfor handwritten Arabic text recognition.

4. Features for text recognition

Determination of an appropriate set of features that can capture the discriminativecharacteristics of Arabic characters/words while allowing for variability inhandwriting is a key challenge for handwritten Arabic OCR. Feature extractiontypically involves the conversion of a two-dimensional image into a one-dimensionalfeature sequence or feature vector. Selection of good discriminative features is criticalto achieve high recognition performance. Survey on evaluations of features for handprinted isolated characters can be found in Trier et al. (1995) and Liu et al. (2003).In this section, we describe a few features that have proven to be very effective in therecognition of handwritten cursive Arabic text.

4.1. Gradient-structural-concavity (GSC) features

The design philosophy underlying these features is to capture the characteristics ofhandwritten character images at multiple resolutions, from fine to coarse. The GSCfeatures capture the local, intermediate, and global characteristics of a handwrittenimage (Favata et al., 1994). Specifically, the gradient captures the local stroke shape,structural features capture the coarser trajectory of the stroke and the concavityfeatures encapsulates stroke relationships at an even coarser level (see Fig. 7). Whilethe original features were binary in nature, re-implementation using floating pointnumbers provides better performance.


Fig. 7. GSC features.

The gradient features are computed by applying a Sobel operator on the binarycharacter/word images. The operator generates an approximation of the x and yderivatives at each image pixel. By definition, the gradient is a vector with thedirection ranging from 0◦ to 359◦. The range is split into 12 non-overlapping regionsand sampled into bins (for example a m× n grid). A histogram is computed in eachgradient direction at each pixel within a bin. When a m× n grid is used, a vector of12×m× n floating point numbers represents the gradient feature of the image.

The structural features represent discriminatory patterns embedded within thegradient map. Several m × n window operators on the gradient map locate localstrokes in the up, down, and diagonal directions. These strokes are combined intoa larger feature using a set of rules. Other features that are encapsulated includecorner like shapes. When a m × n grid is used, 12 × m × n floating point numberscontribute to the total feature vector.

The coarsest set of features are the concavity features. The three types of concavityfeatures include (i) coarse pixel density, which captures the general grouping of pixelsin the image (m × n floating numbers when using a m × n grid), (ii) large strokes,which captures the prominent horizontal and vertical strokes (m × n × 2 floatingnumbers for a m× n grid), and (iii) up, down, left, right, and closed loop concavitiesdetected by convolving image with a star-like operator (m× n× 5 floating numbersfor a m× n grid).

4.2. Chain code and critical point features

A chain code is a lossless image representation algorithm for binary images. Thebasic principle of chain codes is to separately encode each connected componentin the image. For each such region, the points on the boundary are traced and the


Fig. 8. Critical point features: the critical points are displayed on the image as colored pixels. Blue circlesare the central locations of end critical points of a stroke, and red circles are the central locations ofcurvature critical points. Pink dots represent contours of small connected components, and inner loopsare marked with green. (For interpretation of the references to color in this figure legend, the reader isreferred to the web version of this book.)

coordinates of the boundary points are recorded. The critical point features representthree types of stroke shape features. The changes in the direction of the strokes arecalculated at each local boundary point by tracing the boundary of a stroke. Thechange of direction at each point is estimated by measuring the angle between theincoming direction and the outgoing direction. The incoming direction and outgoingdirection are estimated by using the three directly connected neighboring boundarypoints before and after the current point. A significant change in direction at aboundary point is determined by thresholding on the change of direction (angle)at the point. Significant left (end critical point) and right (curvature critical point)turning points in the word image are identified. These two types of points are labeledas critical points. The actual computation of these points often results in smallclusters of critical points each made of several consecutive boundary points (seeFig. 8). The third type of critical points is the boundary points on small connectedcomponents, which are mostly the diacritic marks in Arabic characters.

A m× n grid is applied to the image to quantify the feature points into a featurevector of floating numbers. The ratios of the counts of each of the three typesof critical points, relative to the total number of boundary points in the bin, arecalculated for each of the bins. This provides three feature values in each bin andthe total contribution to the feature vector is 3×m× n floating number features.

4.3. Directional profile features

The directional profile features are inspired by the description of handwriting strokesas stroke sections in a finite number of directions, e.g., horizontal stroke, verticalstroke, etc. When a continuous stroke changes direction along the writing trajectory,the stroke is then segmented naturally into sections in different directions. Basedon this concept, runlengths of black pixels are extracted along four directions, thehorizontal, the vertical, and the two diagonals. Each of the text pixels is marked bya color representing the dominating direction of the runlength going through thepixel. The resultant color coded segmentation of the stroke sections can be seen inFig. 9. The finite number of stroke directions provides a rich encapsulation of thestroke structure of the handwritten text.

A m × n grid is overlaid on the image to convert the stroke directions into afeature vector. In each of the bins, the ratio of total length of the directional runsgoing through each point in the bin relative to the total length of the runs goingthrough the points in the bin is calculated. Four feature values are calculated for


Fig. 9. (a) The first strip shows a resizable window that helps visualize the features within the window.The second strip shows the predominant directional feature at each pixel in different colors. The nextfour strips show the intensity of the directional features in each of the four directions (0, 45, 90, 135) withdarker shades indicating greater intensity. The last strip combines the intensities of the four directions.(b) The four directions for the features (0, 45, 90, 135). (c) Example using a configurable grid of 4 × 4for the PAW image.

each bin and the total contribution to the feature vector is 4×m×n floating numberfeatures.

4.4. Discrete symbol features

Psychological studies indicate that word shape plays a significant role in visualword recognition, inspiring the use of shape-defining high-level structural featuresin building word recognizers (Humphreys et al., 1990; Wheeler, 1970). Ascendersand descenders are prominently shape defining and have been widely used in holisticparadigms for Latin script (Madhvanath and Govindaraju, 2001). The oscillationhandwriting model (Hollerbach, 1981) theorizes that the pen moves from left to righthorizontally and oscillates vertically and features near extrema are very importantin character shape definition. These features include ascenders, descenders, loops,crosses, turns, and ends. Concepts such as ascender and descender are prominentwhen looking at a word image holistically and the position of these structures ismore relevant as features than the identity of the structures themselves. So, positionis an important attribute of structural features. A structural feature can also haveother attributes such as orientation, curvature, and size. Structural features can beutilized together with their attributes in defining the shape of characters and, thus,the shape of words, and these features can be used to construct a recognizer thatsimulates the human’s shape discerning capability in visual word recognition (Xueand Govindaraju, 2006).


Table 1Structural features and their attributes

Structural feature Position Orientation Angle Width

Short cusp, long cusp X X

Arc, left-ended arc, right-ended arc X X

Loop, circle, cross, bar X

Gap X

Fig. 10. Discrete symbol features.

Table 1 lists the structural features that are used in modeling handwrittencharacters/words. Loops, arcs, and gaps are simple features. Long cusps and shortcusps are separated by thresholding their vertical length. Left/right-ended arcs arearcs whose stroke ends at its left/right side. Loops, arcs, and cusps are further dividedinto upward and downward classes. These features are illustrated with relation toArabic text in Fig. 10.

4.5. Feature selection and combination

Practical systems for recognition will need to use a combination of features to obtainoptimal performance. The usefulness of the features described in this section wasevaluated by experiments using combinations of the implemented features.

The experiments were conducted on an image set containing 7346 PAW imagesof the 34 most frequent PAW classes extracted from the AMA Arabic 1.0 DataSet (2007). After noise removal (median filtering, slant correction, and ruling-lineremoval), the images were divided into a training set with 6498 PAW images and atest set with 848 PAW images. The experiments were conducted using the publiclyavailable supporting vector machine classifier libSVM (Chang and Lin, 2011).

The best performing feature set was the combination of G and C features from theGSC feature set with the critical point and the directional profile features. Table 2benchmarks the performance (top one recognition rate) on the 34 PAW data foreach of the features and their combinations using the libSVM classifier.

The next section highlights recognition techniques proposed in the literature anddiscusses the issue of selection of the right model as sometimes it is difficult toapproximate or locate the target concept despite sufficient number of data samplesif the model selected is not appropriate.


Table 2Performance of recognition of the 34 PAWs using libSVM (top choice)

Features Top 1 recognition rate (%)

G in GSC 79.48

G + C 83.96

GSC 85.38

Critical point + directional profile 82.43

G + C + Critical point + directional profile 86.44

5. Models for recognition

The traditional approach for handwritten text recognition is to segment thedocument image into lines and further into smaller components such as charactersor words. The basic units of recognition are characters or words. However,segmentation-based approaches face challenges in the case of Arabic scripts. Arabicscript, be it handwritten or machine printed, is very cursive in nature. Hence,segmenting the words into characters is not easy and results in segment shapes whichintroduce confusion thus making recognition difficult. Arabic script has a numberof dots and diacritics. Placement of these small components changes the contextof the word so the same set of graphemes can create different words. Therefore,segmentation may result in loss of contextual information of the structure as awhole. The recognition models that are typically used for Arabic text recognitionhave words or parts of words (PAW—single connected component) as the basicunit of recognition. Some approaches that depend heavily on language models usetext lines as input. This section describes techniques that have been used for Arabichandwritten word recognition.

5.1. Ensemble of learners

The task of handwritten Arabic text recognition can be seen as a complex learningproblem. Dietterich (2000) lists three primary reasons for failure of learningalgorithms and an ensemble can be used to address these problems. The first reasonis statistical in nature in that the learning algorithm searches the hypotheses spaceto approximate the target concept and if the amount of training data is not sufficientto search the entire space, the algorithm outputs a hypothesis which fits the trainingdata best. If the training data is small, there can be multiple such hypotheses withnone of them not being a good approximation to the actual target concept. In suchcases, the algorithm fails to learn the actual parameters of the distribution � andperforms poorly on the test data samples. An ensemble of learners can be used toeffectively address the issue of multiple consistent hypotheses and voting can bedone to approximate the target concept. The second reason for failure could bethat the learning algorithm may fail to reach target concept due to computationalreasons. Often algorithms perform local search to optimize cost functions and getstuck in local minima like gradient descent algorithms such as Artificial NeuralNetworks. Likewise, decision trees greedily split the hypotheses space into half ateach step and might fail to approximate or locate target concept if hypotheses space


is not smooth despite sufficient number of training data samples. Choice of a goodstarting point is essential for the good performance of such algorithms. Using anensemble of learners, one can run different learners from different starting points toget a good approximation of the target concept. A third potential reason for failureis representational as it is likely that the correct target concept cannot be representedby any of the hypotheses. It could be because the learning algorithm stops searchingthe hypotheses space once it finds a good fit for the training samples. However byusing an ensemble of learners it is possible to get a better representation of thetarget concept by taking weighted sum of the individual hypotheses which in turnwill expand the collective hypotheses space.

Porwal et al. (2012) proposed a system for recognition of Arabic parts of words(PAWs) using an ensemble of biased learners in a hierarchical framework. The mainmotivation was that limited training data causes deterrence in searching the spaceand often just one kind of labeling is not enough to explore the space efficiently.Hence, an approach based on a hierarchical framework was proposed to reduce thecomplexity of the hypothesis space to be explored. In this method, training samplesare clustered based on class labels to generate a new set of labels (according to thecluster assignment) for the data. Intuition behind doing this is to adopt a two-stepapproach to reduce the complexity of the task by first grouping PAWs which aresimilar in some way followed by a separate algorithm to distinguish the memberclasses within these clusters of similar PAWs. This helps in reducing the complexityof the task and the new label set will help in learning the natural structure of the dataas it provides additional information. After clustering, each data point has two typesof labels one corresponding to the cluster and the other one is the actual PAW classthat it belongs to. A separate classifier is learned with the new set of labels generatedafter clustering, which will be unbiased. This is the first level of classification in thehierarchy.

In level two, an ensemble of biased learners along with an arbiter panel is used.Intuition for level two is derived from the work of Khoussainov et al. (2005) in whichfocus is on individual classes instead of different regions of instance space to find theoptimal discriminant function. This can be done by constructing base learners suchthat they are biased toward individual classes. Learners are trained in a one vs allfashion where a two-class classification problem is created for each class consideringthe rest of the classes as the second class. To make a learner biased towards oneclass it is trained with majority of data points from that particular class and datapoints from rest of the classes as the other class. In this approach total number ofclassifiers trained is equal to the total number of classes. In case of disagreementbetween base learners, an arbiter panel is used to assign a label to the data point.

An arbiter panel is a group of classifiers with different kind of inductive biases.Inductive bias is the assumption made by the algorithm in addition to observeddata to transform its output into logical deductions (Mitchell, 1997). There areseveral types of inductive biases such as maximum margin, nearest neighbor, ormaximum conditional independence. The bias plays an important role when alearning algorithm outputs a hypothesis. If data points are difficult to classify evenby using biased classifiers then it is intuitive to change the basic assumption madeby the learner in computing the discriminant function. Hence, in the arbiter panel,learners with different inductive bias will be used to classify the data points with


Fig. 11. Schematic of proposed biased learners-based approach in a hierarchical framework.

disagreement. An overview of the proposed system is shown in Fig. 11. Sometechniques that use single classifiers instead of an ensemble of classifiers havealso been successfully used for Arabic text recognition. These techniques includeArtificial Neural Networks and Hidden Markov models.

5.2. Artificial Neural Network (ANN)

Artificial Neural Network (ANN) has been used extensively in various applicationssuch as speech recognition, digit recognition, and object detection. Figure 12 (Paseroand Mesin, 2010) shows a schematic representation of an Artificial Neural Network.Fundamental working unit of an ANN is a neuron. It takes multiple inputs andgenerates an output a which is a weighted combination of all the inputs. This outputa is fed into a transfer function f to produce y. An ANN is a layered collection ofseveral such neurons as shown in Fig. 12. Weights of all the neurons are iterativelyadjusted to find the minima using some constraints such as least square error. Thisleast square error is minimized using gradient descent algorithm to find the optimalweights of the model. However, gradient descent algorithms often end up findingthe local minima as opposed to global minima. Therefore, different combinationsof initial weights are tried which translate to trying different starting points whileexploring the hypotheses space.

Intuitively, Neural Networks change the feature representation at each level. Itcan be explained in terms of a complex and high dimensional target concept that thelearner is seeking. When features are extracted, they hold some meaning in terms


(a) (b)

Fig. 12. Artificial Neural Network.

of a property of the image such as a structural feature or a global feature. However,it is very likely that the sought after target concept is a complex function of thesefeatures and may not be of the same dimensionality as the feature vector. Therefore,each layer of a Neural Network takes a simple representation of the data point andconverts it into a more complex and abstract representation. Porwal et al. (2012)proposed a method using Deep Belief Networks (DBNs) inspired by this reasoning.

5.3. Deep Belief Networks (DBNs)

The key benefit of DBNs, which are probabilistic generative models, is to enhancethe feature representation in order to approximate the target function. Primaryhypothesis in the work of Porwal et al. (2012) is that the target function thatcould classify all the characters or words present in the lexicon will be a veryhigh dimensional and complex function. To approximate such a function, thehypotheses space explored by the learner should also be rich enough. Using DeepBelief Networks (DBNs), simple features provided in the first layer can be mappedinto more complex and abstract representation of the data. Working of DBNs ismotivated by the way humans analyze and identify objects in the real world, startingfrom the higher level distinctions and then further breaking it down to finer details.Likewise, DBN takes features from the lowest pixel level and forms a distributedrepresentation to discriminate at higher levels such as edges, contours, and eventuallywords of the handwritten text.

Any input vector x can be represented in different forms (Bengio, 2009). Forinstance, an integer i ∈ {1,2, . . . ,N} can be represented as a vector r(i) of Nbits where one bit corresponding to the ith position is 1 and the rest is 0. Itis called local representation. Likewise, it can also be represented in distributedrepresentation where vector r(i) is of M bits where M = logN

2 . It is to be notedthat distributed representation is a very compact way of representation and can beexponentially more compact than the local representation. Therefore, for any inputvector x, r(x) is a M way classification which partitions the x space into M regions.These different partitions can be combined to form different configurations of r(x).


Fig. 13. Graphical model of Deep Belief Networks with one input layer and three hidden layers (left)and random samples of Arabic PAWs generated from the trained DBN.

Therefore, distributed representation is a more compact way of keeping all theinformation of an input feature intact. This representation is employed in deeparchitectures on multiple levels where higher levels are more abstract and canrepresent more complex forms of an input feature vector.

We have seen that structural learning can be used for feature enhancement andit is possible to increase the dimensionality of the feature space. The goal of thestructural learning is to get closer to the space of the target concept irrespective ofthe dimensionality of features or target concept. However, it is only helpful if theclasses are separable in the higher manifold. Converging on the target space may notbe sufficient as it may be possible that class regions are overlapping and are not easilyseparable. Structural learning helps approach the space of the target concept but doesnot guarantee good discriminability. The information gained might be sufficient forpurposes such as reconstruction or compression but if classes are not separable inthat manifold, then the learning algorithm will not be able to discriminate betweenmember classes. DBNs can be used in such settings where feature representation canbe changed in each layer in such a way that discrimination occurs in a space whereclasses are well partitioned.

A Deep Belief Network (DBN) with one input layer and three hidden layers isshown in Fig. 13. The model is generative and the nodes in consecutive layers arefully connected, but there is no connection for nodes in the same layer. All the nodesin DBN are stochastic, latent variables. The connection between the top two layersof the DBN is undirected while the rest of the connections are directed. If a DBN hasL layers then the joint probability distribution of the parameters of the generativemodel can be written as

P(x,h1...L) = P(hL−1,hL)

(L−2∏k=1

P(hk|hk+1))P(x|h1

). (13)

As can be observed from Fig. 13 the training of the directed layers (i.e., the bottomthree layers) is difficult: it is intractable to obtain p(h1|x) (or p(h2|h1)) since once xis observed all the nodes in layer h1 are correlated. However, it is relatively easy totrain the undirected layer, and therefore Restricted Boltzmann Machine (RBM) isused for layer-wise pre-training.

In the pre-training phase, RBM is used to get the parameters for the first hiddenlayer using the input feature vector. Using the output of the first layer, second layerparameters are calculated using the RBM again. An unsupervised learning algorithm


Fig. 14. Graphical model of restricted Boltzmann machines.

can unravel the salient features of the input data distribution. Once a completenetwork is trained, the hidden layers represent the distributed representation of thedata and each layer is more abstract and represents higher level objects than thelower ones. Afterwards, all the parameters of the DBN are fine-tuned according tothe objective criteria (Hinton and Salakhutdinov, 2006). Figure 13 illustrates samplesof Arabic parts-of-words (PAWs) generated from a trained DBN.

5.3.1. Restricted Boltzmann machines (RBMs)RBMs are used in the layerwise pre-training of the DBNs to estimate parametersfor each hidden layer using the layer below it. The graphical model for RBMs isshown in Fig. 14 where all units at each layer are independent of each other. Theonly interaction is between the hidden layer and the observed layer.

The joint probability of binary hidden and binary observed variables can bedefined by the energy function as

P(x = x,H = h) ∝ ex′b+h′c+h′Wx. (14)

Inference in the case of RBMs is exact because the explaining away phenomenonis eliminated. Conditional probability of observed variable given hidden variablecan be written as

P(xj = 1|H = h) = sigmoid

(bj +

∑i

hiWij

). (15)

Likewise, conditional probability of hidden variable given observed variable canbe written as

Q(Hi = 1|x = x) = sigmoid

⎛⎝ci +

∑j

Wijxj

⎞⎠ . (16)

In training, RBMs approximate the gradient descent and the parameters can becalculated as

∂ log P(x; θ)

∂W= EPdata

[xhT

]− EPmodel

[xhT

], (17)

∂ log P(x; θ)

∂b= EPdata[x] − EPmodel[x], (18)

∂ log P(x; θ)

∂c= EPdata[h] − EPmodel[h]. (19)


Fig. 15. Graphical representation of HMM.

In this, the expectation of the data can be calculated easily and for the expectationof the model a k-step Contrastive Divergence (CD-k) is used in which data pointsare sampled using Gibbs sampling (Bengio, 2009).

5.4. Hidden Markov Models (HMM)

Hidden Markov Models (HMM) have been extensively used for handwritten textrecognition. Figure 15 shows a generic graphical representation of HMM where Xare hidden states and Oare the observed variables. It is based on the Markov propertythat any state is generated from the last few states (one in this case), therefore thisis a representation of a first-order HMM. In HMM, each observation is generatedby some states and observations are independent of each other. Any HMM canbe defined with five parameters i.e., (N,M,A,B, and π ) where N is the number ofhidden states. This parameter is selected empirically and is usually based on theapplication and the data. M is the number of observation symbols for each hiddenstate. This parameter is basically the length of the observation vector. A,B, and π

are learned at the time of training. Here, A is the state transition probability, B is theobservation symbol probability distribution in state. This captures how any state ismodeling the observed symbols. Lastly, π is the initial state probability.

In a lexicon-based approach using HMM, a separate HMM is learned for all thewords in the lexicon. Features used in the training are extracted from the test wordimage and every HMM generates a score. Test word is identified as the word whichgenerates the highest confidence. This is an instance of the evaluation problem inHMM where given an observation O and a model λ , probability of this observationbeing generated by this model i.e., P (O|λ) is computed. Rabiner (1990) provides agood tutorial on the use of HMMs for recognition applications.

In Chen et al. (1995), a variable duration Hidden Markov model based approachwas proposed for handwritten word recognition. In this modeling scenario, thenumber of states is equal to the number of characters thereby restricting the numberof states. Discrete state duration probabilities were proposed as an additional elementin this model. So, in addition to (N,M,A,B, and π), the model also defines D and �

with D = P(d |qi) where qi is any state corresponding to a letter and d is the numberof segments per character. The maximum number of durations per state d was 4.The following equations summarize the calculation of the initial state probabilities.


πi = number of words begining with ϕ(qi)

total number of words in the dictionary, (20)

Aij = number of transitions from ϕ(qi) to ϕ(qj)

number of transitions from ϕ(qj), (21)

�j = number of words ending with ϕ(qj)

total number of words in the dictionary, (22)

P(d |qi) = number of times that ϕ(qi) is split into d partstotal number of times that ϕ(qi) appears

, (23)

Here, the function ϕ maps the state to the character.Kim and Govindaraju (1997) used an over-segmentation approach for cursive

handwritten words and the notion of variable duration obtained from segmentationstatistics to maximize the efficiency of lexicon driven approaches for word recogni-tion. The variable duration probability is used to determine the size of the matchingwindow and improving accuracy. Character confusions were reduced by matchingwithin the window size controlled by the statistics, the number of characters in alexicon entry, and the number of segments of the word image.

An HMM can emit output from either transitions (Mealy machine) or states(Moore machine). Xue and Govindaraju (2006) describe the combination of discretesymbols and continuous attributes as structural features for cursive handwrittentext recognition which are then modeled using state-emitting and transmission-emitting HMMs. This approach works very well on lexicons of size in the 100s.In all these HMM-based approaches, the Viterbi algorithm or some variation isused for decoding.

In a lexicon-free approach using HMMs, each state models one character andobservation is the segmented character from the test image. This problem is aninstance of the decoding problem of HMMs where given the observation sequenceand a model, the sequence of states that could have generated the given observationsequence with the highest probability is computed. Since, states are characters, aword is generated by computing the optimal character sequence using dynamicprogramming. Although this approach does not need any lexicon, it is prone toerrors as any state can go to any other state. Therefore, chances of error are highcompared to a lexicon-based approach. Another limitation of this approach is thatit needs characters as observations. This is a very difficult task in the context ofArabic text recognition as segmenting text into characters is hard.

6. Conclusion

In this chapter, the task of handwritten Arabic text recognition is formulated as alearning problem for a machine learning algorithm. The fundamental challengesfaced by learning algorithms in general and additional challenges posed by thedomain of recognition of handwritten Arabic script in particular are outlined. Adiscussion of the learning paradigms, features, and classification methods, that havebeen used for handwritten Arabic text recognition, is presented to illustrate thepotential for machine learning approaches in tackling the challenging problem ofunconstrained handwritten text recognition.


References

Abney, S., 2002. Bootstrapping. In: Proceedings of the 40th Annual Meeting of the Association forComputational Linguistics, pp. 360–367.

Ando, R.K., Zhang, T., 2005. A framework for learning predictive structures from multiple tasks andunlabeled data. J. Mach. Learn. Res. 6, 1817–1853. <http://dl.acm.org/citation.cfm?id=1046920.1194905>.

Bengio, Y., 2009. Learning deep architectures for AI. Found. Trends Mach. Learn. 2 (1), 1–127.Blitzer, J., McDonald, R., Pereira, F., 2006. Domain adaptation with structural correspondence learning,

In; Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing,EMNLP ’06. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 120–128.<http://dl.acm.org/citation.cfm?id=1610075.1610094>.

Blum, A., Mitchell, T., 1998. Combining labeled and unlabeled data with co-training, In: Proceedings ofthe 11th Annual Conference on Computational Learning Theory, COLT’ 98, ACM, New York, NY,USA, pp. 92–100. http://doi.acm.org/10.1145/279943.279962.

Chang, C.-C., Lin, C.-J., 2011. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst.Technol. 2, 27:1–27:27. <http://www.csie.ntu.edu.tw/cjlin/libsvm>.

Chen, M.-Y., Kundu, A., Srihari, S.N., 1995. Variable duration hidden markov model and morphologicalsegmentation for handwritten word recognition. IEEE Trans. Image Process. 4 (12), 1675–1688.

Clark, S., Curran, J.R., Osborne, M., 2003. Bootstrapping pos taggers using unlabelled data. In;Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003,CONLL ’03, vol. 4. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 49–55.http://dx.doi.org/10.3115/1119176.1119183.

Dietterich, T.G., 2000. Ensemble methods in machine learning. In: Multiple Classifier Systems, Springer,pp. 1–15.

Favata, J.T., Srikantan, G., Srihari, S., 1994. Handprinted character/digit recognition using a multiplefeature/resolution philosophy. In: International Workshop on Frontiers in Handwriting Recognition,pp. 47–56.

Hinton, G., Salakhutdinov, R., 2006. Reducing the dimensionality of data with neural networks. Science313 (5786), 504–507.

Hollerbach, J.M., 1981. An oscillation theory of handwriting. Biol. Cybern. 39 (2), 139–156.Humphreys, G.W., Evett, L.J., Quinlan, P.T., 1990. Orthographic processing in visual word identification.

Cognitive Psychol. 22(4), 517–560. <http://view.ncbi.nlm.nih.gov/pubmed/2253455>.Applied Media Analysis, Inc., 2007. Arabic dataset 1.0. Downloaded from <http://appliedmediaanalysis.

com/Datasets.htm>.Khoussainov, R., He, A., Kushmerick, N., 2005. Ensembles of biased classifiers. In: Proceedings of

the 22nd International Conference on Machine Learning, ICML ’05. ACM, New York, NY, USA,pp. 425–432. http://doi.acm.org/10.1145/1102351.1102405.

Kim, G., Govindaraju, V., 1997. A lexicon driven approach to handwritten word recognition for real-timeapplications. IEEE Trans. Pattern Anal. Mach. Intell. 19, 366–379.

Li, S.Z., 1995. Markov Random Field Modeling in Computer Vision. Springer-Verlag, New York, Inc.,Secaucus, NJ, USA.

Liu, C.-L., Nakashima, K., Sako, H., Fujisawa, H., 2003. Handwritten digit recognition: benchmarkingof state-of-the-art techniques. Pattern Recogn. 36 (10), 2271–2285.

Lorigo, L.M., Govindaraju, V., 2006. Offline arabic handwriting recognition: a survey. IEEE Trans.Pattern Anal. Mach. Intell. 28 (5), 712–724. http://dx.doi.org/10.1109/TPAMI.2006.102.

Madhvanath, S., Govindaraju, V., 2001. The role of holistic paradigms in handwritten word recognition.IEEE Trans. Pattern Anal. Mach. Intell. 23 (2), 149–164. http://dx.doi.org/10.1109/34.908966.

Mitchell, T.M., 1997. Machine learning. In: McGraw Hill Series in Computer Science, McGraw-Hill.Nigam, K., Ghani, R., 2000a. Analyzing the effectiveness and applicability of co-training. In: Proceedings

of the Ninth International Conference on Information and Knowledge Management, CIKM ’00,ACM, New York, NY, USA, pp. 86–93. http://doi.acm.org/10.1145/354756.354805.

Nigam, K., Ghani, R., 2000b. Understanding the behavior of co-training. In: Proceedings of KDD-2000Workshop on Text Mining.

Pan, S.J., Yang, Q., 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 (10),1345–1359.

http://dl.acm.org/citation.cfm?id=1046920.1194905


http://www.csie.ntu.edu.tw/cjlin/libsvm

http://view.ncbi.nlm.nih.gov/pubmed/2253455

http://appliedmediaanalysis.com/Datasets.htm


Pasero, E., Mesin, L., 2010. Artificial neural networks to pollution forecast air pollution. In. InTech.<http://www.intechopen.com/books/air-pollution/artificial-neural-networks-for-pollution-forecast>.

Porwal, U., Rajan, S. Govindaraju, V., 2012. An oracle-based co-training framework for writeridentification in offline handwriting. In: Document Recognition and Retrieval.

Porwal, U., Ramaiah, C., Shivram, A., Govindaraju, V., 2012. Structural learning for writer identificationin offline handwriting. In: International Conference on Frontiers in Handwriting Recognition,pp. 415–420.

Porwal, U., Shivram, A., Ramaiah, C., Govindaraju, V., 2012. Ensemble of biased learners for offlinearabic handwriting recognition. In: Document Analysis Systems, pp. 322–326.

Porwal, U., Zhou, Y., Govindaraju, V., 2012. Handwritten arabic text recognition using deep beliefnetworks. In: International Conference on Pattern Recognition.

Puurula, A., Compernolle, D., 2010. Dual stream speech recognition using articulatory syllable models.Int. J. Speech Technol. 13 (4), 219–230. http://dx.doi.org/10.1007/s10772-010-9080-2.

Rabiner, L.R., 1990. A tutorial on hidden Markov models and selected applications in speech recognition.In: Waibel, A., Lee, K.-F. (Eds.), Readings in Speech Recognition. Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA, pp. 267–296. <http://dl.acm.org/citation.cfm?id=108235.108253>.

Ramaiah, C., Porwal, U., Govindaraju, V., 2012. Accent detection in handwriting based on writing styles.In: Document Analysis Systems, pp. 312–316.

Shivram, A., Ramaiah, C., Porwal, U., Govindaraju, V., 2012. Modeling writing styles for onlinewriter identification: a hierarchical bayesian approach. In: International Conference on Frontiers inHandwriting Recognition, pp. 385–390.

Trier, O.D., Jain, A.K., Taxt, T., 1996. Feature extraction methods for character recognition—a survey.Pattern Recogn. 29 (4), 641–662.

Valiant, L.G., 1984. A theory of the learnable. Commun. ACM 27 (11), 1134–1142. http://doi.acm.org/10.1145/1968.1972.

Wang, W., Huang, Z., Harper, M., 2007. Semi-supervised learning for part-of-speech tagging of mandarintranscribed speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing,ICASSP 2007, vol. 4. pp. IV-137–IV-140.

Wang, Y., Ji, Q., 2005. A dynamic conditional random field model for object segmentation in imagesequences. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR’05). CVPR ’05, vol. 1–01, IEEE Computer Society, Washington, DC,USA, pp. 264–270. http://dx.doi.org/10.1109/CVPR.2005.26.

Wheeler, D., 1970. Processes in word recognition*1. Cognit. Psychol. 1 (1), 59–85. http://dx.doi.org/10.1016/0010-0285(70)90005–8.

Xue, H., Govindaraju, V., 2006. Hidden Markov models combining discrete symbols and continuousattributes in handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 28 (3), 458–462.

http://www.intechopen.com/books/air-pollution/artificial-neural-networks-for-pollution-forecast


[Handbook of Statistics] Handbook of Statistics - Machine Learning: Theory and Applications Volume...

Documents

Transcript of [Handbook of Statistics] Handbook of Statistics - Machine Learning: Theory and Applications Volume...