An associative memory-based learning model with an efficient hardware implementation in FPGA

15
An associative memory-based learning model with an efficient hardware implementation in FPGA Ali Ahmadi a,, Hans Jürgen Mattausch b , M. Anwarul Abedin d , Mahmoud Saeidi c , Tetsushi Koide b a Electrical and Computer College, Khajeh-Nasir University of Technology, Shariati St., Tehran, Iran b Research Institute for Nanodevice & Bio System (RNBS), Hiroshima University, Higashi-Hiroshima, Japan c Education & Research Institute for Information & Communication Technologies, Tehran, Iran d Department of Electrical & Electronic Engineering, Dhaka University of Engineering & Technology, Bangladesh article info Keywords: FPGA Handwritten characters Hardware prototyping Learning model Machine learning abstract In this paper we propose a learning model based on a short- and long-term memory and a ranking mech- anism which manages the transition of reference vectors between the two memories. Furthermore, an optimization algorithm is used to adjust the reference vectors components as well as their distribution, continuously. Comparing to other learning models like neural networks, the main advantage of the pro- posed model is that a pre-training phase is unnecessary and it has a hardware-friendly structure which makes it implementable by an efficient LSI architecture without requiring a large amount of resources. A prototype system is implemented on an FPGA platform and tested with real data of handwritten and printed English characters delivering satisfactory classification results. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction The learning capability in artificial systems has attracted a great amount of attention in recent years. Many different approaches have been developed for improvement of machine learning in the literature (Dietterich, 1997, 2000; Mitchell, 1997). As for a learning model, one of the main issues is the feasibility of learning online in a real-time application with an incremental learning function. However, in most of the currently existing models a pre-training process becomes necessary. For example in case of neural net- works, as a well-known connectionist learning model, it is known that in most of practical cases the system can only be learned per- fectly when the entire dataset is made available to the network in a prior training procedure (Harris, 1991; Haykin, 1999). The next is- sue in a learning model is a hardware-friendly structure. A special- ized hardware for a learning system offers applicable advantages such as higher speed in processing of repetitive calculations, lower cost by lowering total component counts, and increased reliability in the sense of reduced probability of equipment failure. In this pa- per, first we review the learning methods applied in the field of character recognition and investigate advantages and disadvan- tages of the main methods. Then we propose a novel learning mod- el which is capable to learn from input samples constantly and adjust the reference pattern set whenever necessary. The underly- ing concept of the learning algorithm is based on taking a short- and long-term memory and a ranking mechanism which manages inclusion and elimination of reference patterns as well as their transition between the two memories. Also, the reference vectors magnitude as well as their distribution is adjusted continuously by means of an optimization algorithm. The main advantage of the proposed algorithm comparing to other learning methods is that it takes the advantage of real-time learning as well as a hard- ware-friendly structure which can be easily implemented in the lower level hardware as an LSI architecture without requiring a large amount of hardware resources. In order to enhance the pattern-matching speed, the classifica- tion process is designed on the basis of a parallel associative mem- ory. The prototype of associative memory we use here has been already designed (Abedin, Tanaka, Ahmadi, Koide, & Mattausch, 2007; Mattausch, Gyohten, Soda, & Koide, 2002; Yano, Koide, & Mattausch, 2002) and has a mixed digital-analog fully parallel architecture for nearest Hamming/Manhattan-distance search. The proposed model was implemented in an FPGA platform of the Altera Stratix family, successfully. The system has a pipelined structure in both levels of processing blocks and processing ele- ments within each block, which provides a maximum capability of parallel processing. In order to evaluate the system performance, it was used in the real application of character recognition. A num- ber of handwritten data samples were used for testing the system and the results approved the efficiency of classification with a high speed of pattern matching as well as the learning functionality. The organization of this paper is as follows: in Section 2 we give the state of the art of learning methods in the OCR field. Section 3 contains the model description including the core concept of the learning algorithm as well as the detailed explanations about each 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.08.138 Corresponding author. Tel.: +98 21 22361217. E-mail address: [email protected] (A. Ahmadi). Expert Systems with Applications 38 (2011) 3499–3513 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Transcript of An associative memory-based learning model with an efficient hardware implementation in FPGA

Page 1: An associative memory-based learning model with an efficient hardware implementation in FPGA

Expert Systems with Applications 38 (2011) 3499–3513

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

An associative memory-based learning model with an efficienthardware implementation in FPGA

Ali Ahmadi a,⇑, Hans Jürgen Mattausch b, M. Anwarul Abedin d, Mahmoud Saeidi c, Tetsushi Koide b

a Electrical and Computer College, Khajeh-Nasir University of Technology, Shariati St., Tehran, Iranb Research Institute for Nanodevice & Bio System (RNBS), Hiroshima University, Higashi-Hiroshima, Japanc Education & Research Institute for Information & Communication Technologies, Tehran, Irand Department of Electrical & Electronic Engineering, Dhaka University of Engineering & Technology, Bangladesh

a r t i c l e i n f o

Keywords:FPGAHandwritten charactersHardware prototypingLearning modelMachine learning

0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.08.138

⇑ Corresponding author. Tel.: +98 21 22361217.E-mail address: [email protected] (A. Ahmad

a b s t r a c t

In this paper we propose a learning model based on a short- and long-term memory and a ranking mech-anism which manages the transition of reference vectors between the two memories. Furthermore, anoptimization algorithm is used to adjust the reference vectors components as well as their distribution,continuously. Comparing to other learning models like neural networks, the main advantage of the pro-posed model is that a pre-training phase is unnecessary and it has a hardware-friendly structure whichmakes it implementable by an efficient LSI architecture without requiring a large amount of resources. Aprototype system is implemented on an FPGA platform and tested with real data of handwritten andprinted English characters delivering satisfactory classification results.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

The learning capability in artificial systems has attracted a greatamount of attention in recent years. Many different approacheshave been developed for improvement of machine learning in theliterature (Dietterich, 1997, 2000; Mitchell, 1997). As for a learningmodel, one of the main issues is the feasibility of learning online ina real-time application with an incremental learning function.However, in most of the currently existing models a pre-trainingprocess becomes necessary. For example in case of neural net-works, as a well-known connectionist learning model, it is knownthat in most of practical cases the system can only be learned per-fectly when the entire dataset is made available to the network in aprior training procedure (Harris, 1991; Haykin, 1999). The next is-sue in a learning model is a hardware-friendly structure. A special-ized hardware for a learning system offers applicable advantagessuch as higher speed in processing of repetitive calculations, lowercost by lowering total component counts, and increased reliabilityin the sense of reduced probability of equipment failure. In this pa-per, first we review the learning methods applied in the field ofcharacter recognition and investigate advantages and disadvan-tages of the main methods. Then we propose a novel learning mod-el which is capable to learn from input samples constantly andadjust the reference pattern set whenever necessary. The underly-ing concept of the learning algorithm is based on taking a short-and long-term memory and a ranking mechanism which manages

ll rights reserved.

i).

inclusion and elimination of reference patterns as well as theirtransition between the two memories. Also, the reference vectorsmagnitude as well as their distribution is adjusted continuouslyby means of an optimization algorithm. The main advantage ofthe proposed algorithm comparing to other learning methods isthat it takes the advantage of real-time learning as well as a hard-ware-friendly structure which can be easily implemented in thelower level hardware as an LSI architecture without requiring alarge amount of hardware resources.

In order to enhance the pattern-matching speed, the classifica-tion process is designed on the basis of a parallel associative mem-ory. The prototype of associative memory we use here has beenalready designed (Abedin, Tanaka, Ahmadi, Koide, & Mattausch,2007; Mattausch, Gyohten, Soda, & Koide, 2002; Yano, Koide, &Mattausch, 2002) and has a mixed digital-analog fully parallelarchitecture for nearest Hamming/Manhattan-distance search.The proposed model was implemented in an FPGA platform ofthe Altera Stratix family, successfully. The system has a pipelinedstructure in both levels of processing blocks and processing ele-ments within each block, which provides a maximum capabilityof parallel processing. In order to evaluate the system performance,it was used in the real application of character recognition. A num-ber of handwritten data samples were used for testing the systemand the results approved the efficiency of classification with a highspeed of pattern matching as well as the learning functionality.

The organization of this paper is as follows: in Section 2 we givethe state of the art of learning methods in the OCR field. Section 3contains the model description including the core concept of thelearning algorithm as well as the detailed explanations about each

Page 2: An associative memory-based learning model with an efficient hardware implementation in FPGA

3500 A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513

block performance in different steps of preprocessing, classifica-tion, and learning. Section 4 deals with hardware implementationmethods and the results of FPGA programming. Section 5 is con-cerned with the evaluation of the system performance based onthe experimental results from different aspects. In Section 6 wemake a discussion on the advantages and the limitations of theproposed model with a practical comparison with other existingmodels. Finally, Section 7 gives some concluding remarks.

2. State of the art

In the field of character recognition as a real application oflearning models, the main well-known methods applied for classi-fication include statistical methods (e.g. k-nearest-neighbor rule(Smith, 1994), artificial neural networks (Liu & Gader, 2002)(including multilayer perceptron (MLP), radial basis function network(RBF), learning vector quantization (LVQ)), kernel methods (e.g. sup-port vector machines (SVMs) Burges, 1998), stochastic models (e.g.hidden Markov models (HMMs) Saon, 1999), and multiple classifiercombination (Rahman & Fairhurst, 2003).

The recognition accuracies have been promoted significantly bythe use of such models comparing to the conventional methods oftemplate selection and tuning. Some excellent results have beenalready reported by (LeCun, Bottou, Bengio, & Haffner, 1998; Suen,Kiu, & Strathy, 1999). However, the problem is yet far from solvedas recognition accuracies of either machine-printed characters ondegraded image or freely handwritten characters are insufficient,and the existing learning methods do not work well on huge sam-ple data and ever-increasing data.

As Liu and Fujisawa (2005) have reported, the characteristics ofabove classifiers can be investigated by the use of three main met-rics: classification accuracy, training complexity, and storage and exe-cution complexity. A summary of this comparison is as follows.

When training with enough samples, discriminative classifiers(i.e. classifiers based on minimum error training including neuralnetworks and SVMs) give higher accuracy than statistical classifi-ers. Among discriminative classifiers, SVMs have demonstratedsuperior classification accuracy to neural classifiers in manyexperiments.

In the sense of training complexity, discriminative classifiers donot support incremental training, and adding new classes or newsamples requires re-training with all samples.

As for execution complexity, statistical classifiers are generallymore expensive in storage consumption and execution than dis-criminative classifiers, since they need to adjust more computa-tional parameters, and in some cases like k-NN all trainingsamples need to be stored and compared each time which makesthem not to be practical for real-time applications. Among discrim-inative classifiers, neural networks use much less parameters thanSVMs and therefore need less storage and computation. On theother hand, the performance of neural networks is sensitive tothe size of structure while this is not much influential in the caseof SVMs.

We will discuss the above issues again through experimentalresults in the next sections. Concerning to HMMs, they are gener-ally considered as a well-suited method for recognition of sequen-tial ambiguous patterns and hence, in OCR field, they are mainlyused in word-level classification rather than character-level classi-fication. Due to their complexity, generally they are not intendedfor a hardware implementation plan.

Regarding the hardware implementation for learning methods,most of the works in the literature deal with neural networks hard-ware implementation. Surveys of the hardware implementation forneural networks can be found in Glesner and Pochmuller (1994),Lindsey and Lindblad (1994), Heemskerk (1995) and a more recent

one is given by Liao (2001). In Liao (2001) the author has reviewedthe major categories of neural networks hardware architectures todate including accelerator boards, neurocomputers built from gen-eral purpose processors, and neurochips. Two well-known exam-ples of neurohardware, CNAPS (McCartor, 1991) and SYNAPSE-1(Ramacher et al., 1993) are described in detail and their advantagesand disadvantages are discussed.

In addition, some real-world applications of neural networkshardware like in OCR and speech-recognition are reported. As anexample, a powerful OCR tool, OCR-on-a-chip developed byLigature Ltd, is introduced which was embedded first time in theWizcom’s Quicktionary (WizCom Technologies Ltd), a hand-heldpen-scanner for online translating texts. Another example is anapplication of ANNA-chip (Boser, Sackinger, Bromley, LeCun, &Jackel, 1991) in character recognition task (Sackinger, Boser,Bromley, LeCun, & Jackel, 1992). We will discuss these two prod-ucts in Section 6. A list of commercial hardware realized for neuralnetworks can be found in http://neuralnets.web.cern.ch/NeuralNets/nnwlnHepHard.html.

As Moerland and Fiesler (1996) have remarked, the key prob-lems for all realizations for neural networks in hardware are theinaccuracy and imperfections of the hardware components. Thisranges from quantization of the weight values and component-to-component variations to stuck-at faults of weights and neurons(Beiu, 1998; Hollis & Paulos, 1994)

The main motivation in this research has been to propose a newlearning model which takes advantages of existing models and atthe same time overcomes their shortcomings. In short, this canbe described as a system with qualification of high classificationaccuracy, real-time learning capability, and a hardware-friendlystructure.

3. Model description

As it is described in Section 1, the main feature of the proposedmodel is a dynamic learning function which makes it useful in thereal-time applications like video detection, online text recognition,intelligent systems, etc. With its hardware-friendly structure, themodel can be easily implemented in different hardware platforms.We show that by using some techniques of parallelism like pipe-lined structure, the system performance can be speeded up signif-icantly. When compared with other learning models like neuralnetworks, the main advantage of the proposed model is that apre-training phase is unnecessary and the model has a hardware-friendly structure. The core part of the model is the learning proce-dure but the model includes two further blocks of preprocessingand classification, prior to the learning. More explanations on theperformance of each block are given in the following.

3.1. Learning procedure

The core concept of learning in the model is based on a short/long-term memory which is very similar to the memorizing proce-dure in the human brain. For this, the memorized reference pat-terns are classified into two areas according to their learningranks. One is a short-term storage area where new information istemporarily memorized, and the other is a long-term storage areawhere a reference pattern can be memorized for a longer timewithout receiving the direct influence of incoming input patterns.The transition of reference patterns between short-term andlong-term storages is carried out by means of a ranking algorithm.Besides, an optimization algorithm is applied to update the refer-ence patterns and optimize their distribution as well as the thresh-old values used for classification and ranking. The flowchart ofFig. 1 shows an outline of the learning procedure.

Page 3: An associative memory-based learning model with an efficient hardware implementation in FPGA

Fig. 1. Flowchart of the learning procedure. C is a constant value selected experimentally based on the data condition.

A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513 3501

As can be seen from the flowchart, by taking a nearest-matchingover the reference patterns memory, we get the winner-addressand winner-input distance (Dw-i). The winner-input distance isthen compared with a local threshold value corresponding to eachreference pattern to find if the classification is acceptable or the in-put pattern should be considered as a new reference pattern. Incase that the winner-input distance is lower than threshold Dth

and also there is a reasonable margin between winner and near-est-loser, the classification is considered as a reliable case and wegive a high rank-jump to the winner pattern in the ranking mem-ory, otherwise the winner will get a low rank-jump.

In summary, the purpose of the main blocks used in the learn-ing procedure can be described as follows: the reliability-checkblock is used for improving the reliability of classification. It takesthe difference between winner and nearest-loser to initialize theappropriate rank-jump values for the ranking process. The rankingblock controls the life time and elimination time of reference pat-terns by managing their assigned ranks in a ranking memory. Theoptimization block is used for adjusting constantly the referencepatterns and distance thresholds. For this, it stores the mean valuesof recent samples and the distance thresholds in two-side memo-ries. The distance thresholds have a key role in this learning proce-dure as they are used to decide if a new reference pattern be added.Details of the two main blocks, ranking and optimization, are de-scribed below.� Ranking block: Each reference pattern in the system is given a

‘‘unique rank” showing the occurrence level of the pattern. A rank-ing memory is considered for the ranks where the rank is the sameas the index of the memory and the reference patterns address (itsaddress in the reference memory) is saved as the content. The rank

is increased by a predefined jump value in case of a new occur-rence (when a new input is matched with the current referencepattern), and reduced gradually when there is no occurrence ofmatching and other reference patterns are getting higher ranksand shifting up to higher positions. The reference patterns are clas-sified into the long-term and short-term memory according totheir rank whereas a specific rank level of s_rank is defined asthe border of short-term and long-term memory. If the rank ofeach pattern gets higher than s_rank, it enters into the long-termmemory, and conversely if its rank gets less than s_rank, it falls intothe short-term memory. Fig. 2 shows the flowchart of ranking pro-cess. As can be seen from the flowchart, if Dw-i < Dth (i.e. a knownRef pattern case) then we first search for the existing rank of thewinner in the ranking memory.

If the winner belongs to the short-term memory (rank<s_rank)the rank advancement is JS, and if the case of winner is belongingto the long-term memory (rank>s_rank), the rank advancement be-comes JL (JL > JS). Then each of the patterns having rank betweenthe old and the new winner rank is reduced in rank by one(Fig. 3). The transition between short- and long-term memory hap-pens by these changes in rank.

In the case of Dw-i P Dth, the system considers input and winnerpattern to be different and takes the input pattern as a new refer-ence pattern. The top rank of the short-term memory is given tothis new reference pattern (if the long-term memory is not yet full,then the lowest rank in the long-term memory will be assigned)and subsequently the rank of each of the other reference patternsexisting in the short-term memory is moved down by one, and thereference pattern with the lowest rank will be erased from thememory (Fig. 4).

Page 4: An associative memory-based learning model with an efficient hardware implementation in FPGA

Fig. 2. Flowchart of the ranking process block.

Fig. 3. Rank advancement for a currently existing reference pattern.

Fig. 4. Giving the top-rank of short-term memory to a new input sample consideredas a new reference pattern.

3502 A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513

It is worth-noting that as it is shown in Figs. 3 and 4, two differ-ent memories are used in the ranking process. One is the memoryfor storing reference patterns which is the same as the main asso-ciative memory of the system, and the other one is a ranking mem-ory which is used for giving rank to the reference patterns andcontains the addresses of reference patterns.� Optimization block: This block is used for renewing the refer-

ence patterns continuously according to the input data variation.The main purpose is to improve the reliability of the classification.We take two main updating steps including reference pattern mag-nitudes and distance thresholds. Fig. 5 shows the flowchart of theoptimization process used for hardware implementation. As it canbe found from the flowchart, the first step is to decide whether anew reference pattern is generated or the nearest-matching pat-tern can be considered as the winner. The decision maker here is

the local distance threshold Dth corresponding to each referencepattern, which itself is renewed during the optimization process.If the winner-input distance is greater than Dth that is the case ofnew reference pattern generating, initial values are given to thenew Dth and Ref pattern. Also, Dth(mean) and Ref(mean) as well asthe input counter memories are set with initial values. Dth(mean)

and Ref(mean) memories are used later for updating of Dth and Refpatterns. In case of a known reference pattern, i.e. winner-inputdistance is smaller or equal to Dth, we update the mean values ofDth and Ref pattern magnitude (Dth(mean) and Ref(mean)) for thecurrent winner. The updating process is explained in the following.If the counter number, i.e. the number of inputs already assigned tothe current winner, is larger than a predefined threshold Nth, thenthe Dth and Ref memory are updated with the last mean valuesand the counter is set to 1.

Page 5: An associative memory-based learning model with an efficient hardware implementation in FPGA

Fig. 5. Flowchart of the optimization block. Ref, Ref(mean), Dth, Dth(mean), n, and Nth

are standing for reference pattern magnitude, mean of last n inputs matched with areference pattern, threshold value for the winner-input distance, mean of last nwinner-input distances, number of matching occurrences for a reference pattern,and a predefined threshold value for n, respectively.

Fig. 6. (a) Using labels of preceding neighbor pixels (L1, L2, L3, L4) to identify thecurrent pixel’s label. (b) Equivalent labels stored in two buffers SLB1 and SLB2(1 � 2 and 2 � 3).

A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513 3503

3.1.1. Updating process� Reference pattern magnitude: A reference pattern vector is a

combination of a reference image and its feature vector as de-scribed in the classification section. For each Ref vector we take amean vector of input patterns matched with it as

Ref ¼ 1n

Xxi ð1Þ

where xi is the ith input vector and Ref is the mean vector of the lastn inputs assigned to the specific reference pattern. This mean vectoris updated at each input incoming as follows

Refn ¼ ððn� 1ÞRefn�1 þ xnÞ=n ð2Þ

where Refn�1 is the pervious value of mean vector before input xn

entered. In order to simplify the instructions for hardware imple-mentation, relation (2) can be written as

Refn ¼ Refn�1 � ðRefn�1 � xnÞ=n ð3Þ

where the division by n is performed only when n is a multiple of 8,using a simple right shift operator. The replacement of the referencevector with this updated mean vector is achieved after a specificnumber of incoming inputs (Nth).� Distance thresholds: In addition to reference patterns, the

threshold values for the winner-input distance are updated in thisblock. For each reference vector we take a local distance thresholdbased on the distribution of local data. Similarly to relation (1) wehave

Dth ¼1n

XDk

wi ð4Þ

where Dkwi is the winner-input distance of the kth input sample and

Dth is the mean value of last n Dwi. Similar to (3) we use followingrelation for updating this mean value.

Dthn ¼ Dth

n�1 � ðDthn�1 � Dn

wiÞ=n ð5Þ

It should be noted that in relation (5) the Dnwi is not only the

winner-input distance of the inputs falling inside the current Dth

but also of the input samples that are out of Dth but have beenmatched to the current reference pattern (however they are con-sidered as new reference patterns later). This prevents the distancethreshold Dth to become continuously smaller. The replacement ofthe threshold Dth with this updated mean value is achieved after aspecific number of incoming inputs (Nth).

3.2. Preprocessing steps

Preprocessing steps are intended to read the input data and pre-pare it for the classification task. Since in the current work, the pro-posed learning model is applied for a character recognition task,the preprocessing steps are designed to provide the necessarypreparations of the character input data prior to the classificationstep. The preprocessing steps include following blocks: reading,noise removal, binarizing, labeling, segmentation, feature extraction,and normalizing.� Reading: Contains a reading device (line-scan sensor) moving

on each line of the text with an appropriate speed and scans thedata continuously as a sequence of thin frames. The frames be-tween each two word spaces are collected and form a larger frameas a gray-scale bitmap array which holds all the word characters.� Noise removal: As the most noises appearing in the texts are of

the pepper & salt type, we apply a median filter with a neighbor-hood size of 3 � 3 for noise removing. Fig. 19 (Appendix A) showsthe tree diagram we use for finding the median of each nine neigh-bor pixels (Smith, 1998). The method is hardware-friendly and canbe implemented in a pipelined structure.� Binarizing: In this block the input image frame is binarized to a

simple black & white bitmap by taking a local threshold value ex-tracted via a mean filter with neighborhood size of 7 � 7. Thestructure of the mean filter is likely similar to the median filter ex-plained above in terms of the registering manner of neighborhoodpixels.� Labeling: The purpose of the labeling process is to identify the

binary connected segments in the image by giving distinct labels.We scan all the input frame pixels sequentially while keepingthe labels of preceding neighboring pixels (four pixels’ labels,Fig. 6(a)) in four registers and decide if the current pixel belongsto one of the preceding labels or gets a new label through sequen-tial conditions. The labels for all image pixels are then extracted inthis way and saved in a label memory. In case of facing withequivalent labels, that is different labels used in the sameconnected segment, like in the case of character W (Fig. 7), the

Page 6: An associative memory-based learning model with an efficient hardware implementation in FPGA

Fig. 7. Labeling process for character W. (a) Binary image. (b) Labeled image using three equivalent labels.

3504 A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513

label equivalences are recorded in two buffers SLB1 and SLB2(Fig. 6(b)). A detailed flowchart of the labeling process is shownin Fig. 20 (Appendix A).� Segmentation: Once the labeling process is terminated, the im-

age memory is scanned once again for the segmentation task. Fig. 8depicts the algorithm used for segmentation. We scan the labeledimage N times each time searching label Li (i = 1, . . . , N, N is the to-tal number of labels generated in the labeling process). The labelread from memory is then searched within a lookup table (SLBF)which is already created based on data of SLB1 and SLB2 buffers,and is replaced with the final equivalent label. During the scanningprocess, the addresses of pixels having label Li are written in a newsegment memory (SM) to form the distinct segment Li. Next, theboundaries of segment Li are identified and a new 2-D segmentvector with binary values 0 and 1 is generated based on the ad-dresses in the SM memory. This is the output segment vector rep-resenting a distinct character in the image. The scanning processwill be repeated until all the distinct labeled segments are distin-guished and outputted from the system.� Feature extraction: In order to have a robust classification,

some characteristic features of the input pattern are extractedand grouped in a feature vector. The features are selected so thatthey have the minimum dependency on size and variation of data.Given a segmented (isolated) character we extract its moment-based features as follows. The 2-D moment of the character is de-fined as

mpq ¼XW�1

x¼0

XH�1

y¼0

ðx� lxÞpðy� lyÞ

qf ðx; yÞ ð6Þ

where p and q are the moment order in the direction of x and y, andlx and ly are mean values for x and y. To simplify the calculationswe define two vertical and horizontal 1-D moments as

mx ¼XW�1

x¼0

ðx� lxÞkf ðx; cÞ ð7Þ

my ¼XH�1

y¼0

ðy� lyÞkf ðr; yÞ ð8Þ

where k is the moment order, and c and r are the specific columnand row. The f is considered as the intensity of pixels which for abinary image is 0 or 1. From the moments we compute followingfeatures.� Total mass (number of pixels in a binarized character).� Centroid.� Elliptical parameters.

s Eccentricity (ratio of major to minor axis).s Orientation (angle of major axis).� Skewness.

In principle skewness is defined as the third standardized mo-ment of a distribution as

c ¼ l3=r3 ð9Þ

but to simplify the calculations we take a simpler measure of KarlPearson (Hildebrand, 1986) defined as

c ¼ 3ðmean�medianÞ=standard deviation ð10Þ

and we calculate horizontal and vertical skewness, separately.All the above six features, i.e. total mass, centroid, eccentricity,

orientation, horizontal skewness, and vertical skewness are thennormalized as is described in the next section to generate the fea-ture vector.� Normalizing: Each segmented character as well as the feature

vector is normalized prior to the classification. The segmentedcharacter bitmap is rescaled to 16 � 16 pixels using a bilinearinterpolation technique. As for the feature vector, each feature va-lue is normalized using the minimum and maximum of the featurein the memory, and gets a value ranging between 0 and 1.

3.3. Classification block

The classification task is carried out by using a nearest-matchingmethod taking a hybrid distance measure: a Hamming distance forthe main image vector DH (image) which comes as a 256 bits vectorafter size normalizing, and an Euclidean distance for the featurevector DE (feature). Because of the variations of the magnitudes oftwo distance measures, the values should be weighted. We useweighting factors as follows.

D ¼ dDHðimageÞ þ kDEðfeatureÞ ð11Þ

where in our experiments we found d = 0.25 and k = 1 as the mosteffective values.

Accordingly, the reference pattern which gives the minimummatching distance D is considered as the winner. The details ofthe hardware design for the classification block are described inSection 4. It is supposed to use an associative memory (Mattauschet al., 2002) for a fully parallel pattern matching in order to en-hance the system search time further more.

4. Hardware implementation

The proposed learning algorithm is intended to be implement-able in the lower level of the hardware structure with higher speedand minimum resources. As it is described earlier in this paper, themain motivation for a hardware implementation is to speedup thesystem performance by using some parallelism techniques, reduc-ing the system cost by lowering the total components count espe-cially in the case of a specific application, and finally improving the

Page 7: An associative memory-based learning model with an efficient hardware implementation in FPGA

Fig. 8. Flowchart of segmentation procedure. SLBF is a lookup table indicating the final label value according to equivalent labels buffers SLB1 and SLB2, and SM is atemporary memory for saving addresses of pixels with label Li.

Fig. 9. Block diagram of the system implemented in the FPGA.

A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513 3505

system reliability in the sense of reduced probability of device fail-ure. As for prototyping of the whole model including preprocessing

and classification blocks, we have chosen an FPGA platform. Fig. 9shows the block diagram of the system implemented in the FPGA.

Page 8: An associative memory-based learning model with an efficient hardware implementation in FPGA

Fig. 10. Simplified RTL block diagram of the learning block implemented in the FPGA.

Table 1Route & placement results for the learning block implemented in an Altera StratixFPGA.

Fitter status SuccessfulQuartus II version 4.2 full versionTop-level entity name LearningFamily StratixDevice EP1S80B956C7Timing models FinalTotal logic elements 1970/79,040 (1%)Total pins 170/692 (24%)Total virtual pins 0Total memory bits 258,668/7,427,520 (3%)DSP block 9-bit elements 0/176 (0%)Total PLLs 1/12 (8%)Total DLLs 0/2 (0%)

3506 A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513

The reconfigurability of FPGA can be used to make the hardwareflexible enough for doing different tasks with the same hardware.The FPGA capability to implement pipelined structures is used toshare cell units for different parallel processing blocks. A pipelinedarchitecture is used for the main data flow conducting betweenprocessing blocks. Furthermore, each block itself is designed in apipelined structure. As for the main data flow of the system, thepipeline is used to improve the parallel performance so that differ-ent blocks perform different processing step at the same time. Ascan be seen from Fig. 9, a number of intermediate memories areemployed as the storage for pipeline data registering. The controlsignals generated by the main controller are used to activate anddeactivate the successive block performance.

An Altera Stratix family FPGA device (EP1S80) with a resourcecapacity of 79 K logic-cells and 7.4 Mbits of RAM memory is ap-plied. As for reading device we use the line-scan sensor ELIS-1024 of Panavision Co. with a resolution level of 1024 � 1 pixelsequipped with a 16 mm optical lens. The system first was designedin Verilog HDL and after synthesis and functional simulation, weperformed route & placement for FPGA programming. The route& placement is done with the Altera QuartusII software.

Having pre-supplied PLL blocks in the FPGA board, we couldemploy several PLLs for a precise clock generation and timing man-agement. An average clock frequency of 20 MHz is selected for allprocessing blocks, however a higher clock speed is also applicable.Fortunately, the whole system could be fitted into the same FPGAboard without the need of any external memory or additional re-sources. Following are explanations about mapping and imple-mentation of the main blocks of the system.

Fig. 10 shows a simplified RTL block diagram of the Learningblock, including Ranking & Optimization, implemented in theFPGA, and the results of route & placement and resources usagefor this block are listed in Table 1. We have also achieved an ASICfull-custom design for a simple model of the Ranking block only,

and implemented it in an LSI architecture (Shirakawa, Mizokami,Koide, & Mattausch, 2004).

As for implementation of the classification block, in order tomodel the fully parallel functionality of the associative memorywhich is supposed to be used as the main classifier of the system,we applied four dual-port SRAM memory blocks, each containing32 data words. Taking this structure and a memory I/O bus of256 bits width, we can have 16 parallel comparisons within oneclock cycle. Fig. 11 depicts a simple schematic of the design. Thesegmented character comes as a 256 bit binary vector oncethe preprocessing steps and size normalization are carried out.The matching of this data vector and reference patterns areachieved by using a number of parallel XOR gates followed byOne-adder blocks. The minimum distance selection (winnersearch) is then accomplished through a set of parallel comparatorsafter which the winner distance as well as the winner address areidentified and outputted to the I/O bus.

Page 9: An associative memory-based learning model with an efficient hardware implementation in FPGA

RAM1

RAM2

RAM3

RAM4

Ones Adder

Comp1

Comp2

Comp3

Comp4

MUX1

MUX2

MUX3

MUX4

Buffer

Buffer

Buffer

Address Counter

ParallelComparators

ParallelSelectors

Winner distance

Memory number

Winner address

Input data

Clock

Buffer

XOR Matching

Parallel 2-ports

Memories

256

Ones Adder

Ones Adder

Ones Adder

Ones Adder

Ones Adder

Ones Adder

Ones Adder

Fig. 11. Schematic of the classification block mapped with four dual-port RAM blocks in an FPGA architecture. A number of 16 parallel matchings are performed within eachclock cycle.

Table 2Route & placement results for the whole system implemented in an Altera StratixFPGA.

Fitter status SuccessfulQuartus II version 4.2 full versionTop-level entity name OCRFamily StratixDevice EP1S80B956C7Timing models FinalTotal logic elements 15,451/79,040 (19%)Total pins 281/692 (40%)Total virtual pins 0Total memory bits 2,854,174/7,427,520 (38%)DSP block 9-bit elements 28/176 (15%)Total PLLs 1/12 (8%)Total DLLs 0/2 (0%)

Fig. 12. Some handwritten samples from different writers used as input data.

A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513 3507

The resource usage for the whole system is reported in Table 2.The DSP blocks listed in the table are used for the noise removalblock in the preprocessing steps. From Table 3 we can see that onlya reasonably small amount of logic-cells (19%) and memory (38%)are needed for the whole system implementation. The main sys-tem clock can also be increased to higher values if faster I/O de-vices are used.

5. Experimental results and evaluation

We examined the system performance in a real application ofcharacter recognition. The system was applied for recognition ofboth printed and handwritten English characters. As for printedcharacters a total number of 25 datasets including data of five dif-ferent fonts (Times, Arial, Monotype, Symbols, Comic), noisy data,color background data, slightly rotated data, and data with differ-ent resolution were gathered and tested. Each set contained 26characters. The classification results were very satisfactory andcomparing to the results for handwritten data, with much less mis-classifications. Herein, we only report the results of experiments onhandwritten data. A number of 20 datasets of English characterswritten by five different writers were used for the experiments.

To have a variation limited data for this step of the test, the writerswere asked to adhere to standard writing style and not to use acomplicated writing manner. Fig. 12 shows some gathered datasamples. Additionally, we have made some tests on more general-ized datasets like MNIST for making a comparative study withother methods in the literature.

We achieved the evaluation of system performance in the threelevels of: classification, learning, and hardware efficiency. Details ofeach level are described in the following.� Classification results: In order to have an accurate evaluation,

we selected a 10-fold cross-validation method for data sampling.The classification results during and after the learning processare reported in Tables 3 and 4. The Table 4 results are the averageof 10-time sampling. It is worth-noting that the system has startedlearning without any initial reference patterns or a predefineddataset. The data are given to the system in a random order. Ascan be realized from the Table 3 results, the number of patternsadded as new references as well as the misclassification rate is highin the beginning of the process but when the learning goes on and

Page 10: An associative memory-based learning model with an efficient hardware implementation in FPGA

Table 3Classification results for five datasets from different writers during the learning period (taking no initial Ref patterns)

Writer Dataset A (52 samples) Dataset B (52 samples) Dataset C (52 samples) Dataset D (52 samples)

Misclassify New Ref added Misclassify New Ref added Misclassify New Ref added Misclassify New Ref added

1 9 42 12 25 11 19 19 142 11 36 9 26 16 19 10 313 10 36 6 20 10 21 12 124 11 39 14 20 13 22 5 205 15 33 9 8 15 18 9 15

Total (%) 21.5 71.6 19.2 38.1 25 38.1 21.2 35.4

Table 4Classification results after a period of learning.

Test dataset No of samples Misclassify New Ref added

1 104 3 112 104 7 103 104 5 124 104 5 105 104 6 166 104 8 217 104 5 108 104 6 159 104 8 2010 104 8 15

Total 1040 5.9% 13.5%

Table 5Classification results for data of MNIST database after a period of learning.

Dataset Misclassify New Ref added

Dataset 1 1 2Dataset 2 1 1Dataset 3 0 1Dataset 4 1 3

Total (%) 2.5 5.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1False Positive rate

True

Pos

itive

rate

Real performance ROC

Random guessing

Isocost lines

Fig. 13. ROC curve of the system performance.

3508 A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513

system gets to a more stable condition by adjusting continuouslythe reference pattern memory and distance thresholds, the mis-classification rate reduces dramatically. In Table 4 we can see theclassification results after a short period of learning (1040 sam-ples). The average classification rate of the system is obtained as94.1% which is reasonable for this type of application.

The classification results for data of MNIST database are re-ported in Table 5. The results are based on using four test datasetsof handwritten digits each containing 30 samples. The averagemisclassification rate for this data is obtained as 2.5% which willbe discussed in the next section.

We have created the ROC (receiver operating characteristic)curve (Metz, 1978) based on the confusion matrix obtained fromclassification results to show the system performance as a trade-off between selectivity and sensitivity. Each entry of the confusionmatrix including TPR, FPR, TNR, and FNR1 is determined as an aver-age of the corresponding entries of the confusion matrices for all dif-ferent classes. The various points on the ROC curve are identified bytaking different initial values (six values between 0 and 20) for dis-tance thresholds (Dth). Fig. 13 illustrates the ROC curve for the sys-tem performance. The ROC curve can be used to choose the bestoperating point. The best operating point might be chosen so thatthe classifier gives the best trade-off between the costs of failing

1 TPR, FPR, TNR, and FNR stand for true positive rate, false positive rate, truenegative rate, and false negative rate, respectively.

to detect positives against the costs of raising false positives. Theaverage expected cost of classification at point x, y in the ROC spaceis defined as

C ¼ ð1� pÞaxþ pbð1� yÞ ð12Þ

where a is the cost of a false positive, b is cost of a false negative,and p is the proportion of positive case to all cases. Since in themodel we use, the cost of missing positive cases (misclassifying acharacter to another one) outweighs the cost of missing negativecases (not classification of a true character), we have set the weightsas a = 10 and b = 1, and in the test datasets we used with 26 differ-ent classes, p is 0.02 (2/104). Crossing the isocost lines (yellowstraight lines in the plot of Fig. 13)2 with the ROC graph, we findthe optimal operating point (the cross-point closest to north-west)which is corresponding to Dth = 13.

As for accuracy rate, we can consider the area under the ROCcurve (AROC) as an alternative to the usual accuracy definition,but since we intend to have a comparative study with other meth-ods, for which ROC curves are not available, we prefer to use theusual definition as the proportion of correct classified samples tothe total population or in other words 1-error_rate. Therefore,based on the data in Table 4, the accuracy rate of the system is ob-tained as 94.1%.

The histogram of the winner-input distance for the classifica-tion of handwritten datasets is shown in Fig. 14. From this histo-gram it can be found that the number of accurate classifications,that is classifications with lower winner-input distance, is verylarge which could imply high reliability of the classification.

Fig. 15 depicts the misclassification rate during the learningprocess. The first plot shows the misclassification rate versus the

2 For interpretation of color in Figs. 1,3,4,6,7,9–11,13–20, the reader is referred tothe web version of this article.

Page 11: An associative memory-based learning model with an efficient hardware implementation in FPGA

0

5

10

15

20

25

30

0 1 2 3 4 5 6 7 8 9 10 11Winner-input distance

Sam

ples

Fig. 14. Histogram of winner-input distance for classification of handwrittencharacters.

0102030405060708090

100

0 200 400 600 800 1000Input samples/ classified samples

Mis

clas

sifie

d ra

te (%

) Vs. Input samplesVs. Classified samples

Fig. 15. Changes in the misclassification rate during the learning process.

050

100150200250300350400450500

0 200 400 600 800 1000Input samples

Lear

ned

sam

ples

Learning Appr. 1Learning Appr. 2

Fig. 16. Learning speed in the experimental tests on handwritten data.

0

50

100

150

200

250

300

0 200 400 600 800 1000Input samples

Lear

ned

sam

ples

s_rank=180 rank-jump: JL=5, JS=3

s_rank=220 rank-jump: JL=3, JS=1s_rank=180 rank-jump: JL=7, JS=5

s_rank=180 rank-jump: JL=3, JS=1

Fig. 17. Change in the learning function for different values of rank-jump ands_rank.

A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513 3509

number of input samples and the second one versus the number ofclassified samples (i.e. input samples excluding the samples addedas new reference patterns).� Learning results: As it is mentioned above, the model was ap-

plied in a learning task without initial reference patterns and welet the new patterns to be learned over the time. By a ‘‘learned”pattern we refer to a pattern that after a rank-jump process has en-tered into the long-term memory and can be considered as a stablereference pattern. To investigate the efficiency of learning, wetested two different approaches for ranking a new reference pat-tern. In the first approach, the new reference pattern is initially gi-ven the highest unoccupied rank in the long-term memory as longas unoccupied ranks are available, whereas in the second approachit gets the top-rank in the short-term memory from the beginning.The graphs of Fig. 16 show the learning progress for both ap-proaches. The graphs are based on handwritten data listed in Ta-ble 3. As can be seen the learning process is faster in the firstapproach, however, the approach two is more reliable. This canbe explained in this way that in approach two we do not let newunknown samples to be entered in the long-term memory fromthe beginning, and therefore, in case of a rare sample without a fre-quent occurrence, it will get removed from the memory through ashifting down procedure in the short-term memory. This will letthe more important reference patterns to shift up to the long-termmemory and remain there. In other words, we keep only the mostsignificant samples in the long-term memory which makes theclassification more reliable.

It is described earlier in the explanation of the ranking algo-rithm that the key parameters for ranking up a reference patternand shifting it up to the higher locations in the long-term memory

are JL and JS which are currently taken as 7 and 5 for a reliable clas-sification case and 5 and 3 for other cases. By taking other valuesfor JL and JS, depending on the data distribution, the learning speedwill be changed significantly. Also, the border rank (s_rank) sepa-rating short-term and long-term memory, now is predefined as180, but can be defined dynamically based on a statistical calcula-tion. In that case, the learning speed will be varying during thelearning process. We have tested the system learning function withdifferent values of s_rank and rank-jump. The graphs of Fig. 17show the change in the learning function. From the graphs, wecan see that taking high values for rank-jump (in particular forJS) makes a speed up in the learning but on the other hand leadsto an early memory saturation in the long-term area which subse-quently causes shifting down of existing patterns from long-termmemory to the short-term one.

Fig. 18 displays the changes in the average of winner-input dis-tance as well as the distance threshold values (Dth) during thelearning process. The distance thresholds take initial value of 10and become averagely smaller during the optimization processwhich implies a more accurate classification.� Hardware efficiency: Using a pipelined structure in the hard-

ware, we could save hardware resources as well as processingtime. As we use a pipelined architecture also between the mainprocessing blocks, the block with the longest processing time isthe bottleneck of the system and its processing time can be consid-ered as the overall pipeline time. We realized that the segmenta-tion block with processing time of roughly n(P + p)s is the systembottleneck where P is the pixels number of input frame read bythe sensor, p is pixels of each segmented character, s is the time

Page 12: An associative memory-based learning model with an efficient hardware implementation in FPGA

0123456789

10

0 100 200 300 400Input samples

Dis

tanc

e

Winner-input-Distance

Distance Threshold

Fig. 18. Distribution of the average of winner-input distance and threshold values(Dth).

Table 6A citation of error rates (%) on the MNIST test set for fine image features (Liu et al.,2003).

Feature Pixel PCA Grad-4a Grad-8b

k_NN 3.66 3.01 1.26 0.97MLP 1.91 1.84 0.84 0.60RBF 2.53 2.21 0.92 0.69LVQ 2.79 2.73 1.23 1.05PCe 1.64 N/A 0.83 0.58SVC-polyc 1.69 1.43 0.76 0.55SVC-rbfd 1.41 1.24 0.67 0.42

a 4-Direction gradient.b 8-Direction gradient.c Support vector machine with polynomial function.d Support vector machine with radial basis function.e Polynomial classifier.

Table 7Classification times (ms) on fine image features of MNIST database (Liu et al., 2003).

Feature Pixel PCA Grad-4a Grad-8b

k_NN 98.66 14.85 16.67 37.75MLP 0.80 0.17 0.24 0.44RBF 1.03 0.23 0.34 0.58LVQ 1.43 0.13 0.31 0.78PCe 0.87 N/A 0.71 0.76SVC-polyc 16.1 6.41 3.88 5.90SVC-rbfd 62.8 17.7 12.5 21.9

a 4-Direction gradient.b 8-Direction gradient.c Support vector machine with polynomial function.d Support vector machine with radial basis function.e Polynomial classifier.

3510 A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513

period (1/f), and n is the average number of characters in a word.Given a main clock cycle of f = 20 MHz and taking the average val-ues of 15,000, 1500, and 5 for parameters P, p, and n, respectively,we will get a pipeline process time of 82.5 ls per character and0.41 ms per word, which are very satisfactory for this application.The main system clock and consequently the overall processingspeed can still be increased to higher values if faster I/O devicesare used.

It is very hard to make a comparison with other similar workssince there are only few hardware implementations for such mod-els in the literature and also it is very depending on the systemparameters and application conditions.

P1 P2 P3 P4 P5 P6 P7 P8 P9

A B

Lo Hi

Median of P1-P9

Fig. 19. A fast algorithm for implementing the median filter. P1–P9 are theneighboring pixels in a scanning window and the black nodes are simplecomparators outputting the higher and lower values of each two input pixels. Thepixels intensity values are given a number between 0 and 255.

6. Discussion

In the field of character recognition, the comparison of methodsis very difficult because many processing steps (pre-processing,feature extraction, classification) are involved. Even on experi-ments using the same training and test data, researchers oftencompare the performance at system level, that is the final recogni-tion rate by integrating all techniques. As for the classificationaccuracy and speed, a comparative study of the state-of-the-artalgorithms in character recognition is already performed by Liu,Nakashima, Sako, and Fujisawa (2003). Tables 6 and 7 illustratetheir results for a test dataset of MNIST database. It should be no-ticed that all the results are obtained from a computer processorwith a 1.5 GHz CPU and not from a hardware chip. The classifica-tion results of our proposed model for similar test data selectedfrom MNIST are listed in Table 5. As can be seen from Table 6,the highest accuracy rates are for the case of support vector ma-chine with radial basis function (0.42% error rate) and polynomialclassifier (0.58% error rate) both using the 8-direction gradient fea-ture vector. We can see that the accuracy rate of the model we pro-posed herein, when applied in classification of similar data (97.5%,that is 2.5% error rate), is comparable with the highest rates of thesoftware algorithms in the literature. But as a significant advan-tage, the classification speed of the developed model is much fasterthan of the corresponding algorithms, that is 0.083 ms vs. 0.76 msof PC classifier and 21.9 ms of SVC-rbf classifier.

A fair comparison of the hardware we developed herein, withsimilar hardware in the literature is rather difficult as there areonly few hardware implementations for such learning modelsaimed for OCR application. Some of the demonstrated models arenot stand-alone hardware and come as a hybrid hardware-soft-ware system. Some models have no learning capability and theirusage is limited to specific applications. And for some models thereis no detailed information available for the hardware specifications

to make a precise comparison possible. Among the hardware struc-tures realized for character recognition, we have selected threepractical cases to make a comparative study: ANNA-chip (Sackin-

Page 13: An associative memory-based learning model with an efficient hardware implementation in FPGA

Fig. 20. The flowchart of sequential conditions used for labeling procedure. x is the pixel address, Ix is the pixel intensity (0 or 1), Lx is the label of the scanning pixel to bedetermined, L1, L2, L3, and L4 are the labels of previously scanned neighboring pixels (located in the current row and upper row) as shown in Fig. 7(a). SLB1 and SLB2 are twobuffers for recording equivalent labels.

3 Reject rate is defined as the number of patterns with low classificationconfidence, that have to be rejected in order to achieve a desired error rate, forexample 1%.

A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513 3511

ger et al., 1992), I1000-chip (Platt & Allen, 1996), and Quicktionarypen-scanner (WizCom Technologies Ltd).

The ANNA-chip is a five-layer feed-forward neural networkimplemented in a 0.9 lm CMOS technology with 4096 physicalsynapses and a 6 bits resolution for synapses weights and 3 bitsfor input/output. The chip is equipped with an additional DSPboard and a controller for re-training the weight values. The clas-sification results for handwritten digits (10 classes) without using

any feature extraction step are reported in Sackinger et al. (1992).The average execution time for each character taking a 20 MHzclock rate is calculated as 966 ls. The minimum error rate and re-ject rate3 after the re-training process are given as 5% and 10.9%,

Page 14: An associative memory-based learning model with an efficient hardware implementation in FPGA

3512 A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513

respectively. The degradation in the accuracy is considered mainlybecause of the imprecision caused by quantization of the weight val-ues and state values to a limited number of bits (6 and 3 bits).

The I1000 is an analog VLSI chip using a linear template classi-fier with fixed weights in silicon, when combined with associatedsoftware, can optically read the E13B font. The input to the classi-fier is an 18 by 24 pixel image and the output of the chip is a 2-bitconfidence level which is sent to the software layer for a post-processing step. The system performance is evaluated by a testset of 1500 real check data and the accuracy rate is reported as99% (Platt & Allen, 1996).

The Quicktionary is a commercial OCR system which comes as apen-scanner for reading the printed texts and translating singlewords based on the data of an embedded dictionary. A text-to-speech built-in feature is also available in the system. The systemis claimed to recognize over 300,000 words written in differentfonts and sizes with an accuracy of 97% (WizCom TechnologiesLtd). The recognition speed is not available but is limited to thescanning speed given by the user’s hand.

Among the three systems described above, ANNA-chip is theonly learning model which can be applied for handwritten charac-ters, although the current application is limited to handwrittendigits. Given the same clock rate of 20 MHz and taking test datafrom standard database MNIST, we can see that our developed sys-tem is superior to ANNA-chip in both recognition speed (82.5 ls vs.966 ls per character) and error rate (2.5% vs. 5%). One of the maindisadvantages with the ANNA-chip is the lack of incremental learn-ing during the real-time process. Regarding the I1000-chip, it can-not be considered as a stand-alone hardware as it uses a softwarelayer for post-processing. Neither, it cannot be applied for recogni-tion of handwritten texts. Therefore, a precise comparison on therecognition speed and accuracy rate is not possible. Concerningthe Quicktionary, the main point is that it has no learning featurefor recognition of unknown inputs. It can be applied only for recog-nition of specified printed characters (however new words can beadded to the embedded dictionary) and therefore the accuracy rateand recognition speed are not comparable with those of a learningmodel working with new unknown and variable samples.

As for a comparison on the hardware efficiency and optimalchip-size, since there is no detailed information available aboutthe existing OCR systems, unfortunately we cannot make a practi-cal comparison.

In summary, we can see that by performing a hardware imple-mentation for the proposed learning model, by using an FPGA plat-form as a digital structure with 8-bit precision for weights andinputs/outputs, a pipelined structure, and a reasonable amount ofmemory and logic-cells, we could fully profit from the hardwareparallelism for enhancing the learning time as well as recognitionspeed, compared to other similar models which already exist. Thisis without any significant degradation in the accuracy rate of thesystem. However, there are still some limitations regarding thehardware implementation like low bit precision (for floating-pointvalues), limited vector size (now 256 bits maximum), timing con-flicts caused by higher clock rate, and so on, which should be over-come in our next works.

7. Conclusion

In this paper we have proposed a learning model based on ashort/long-term memory and an optimization algorithm for con-stantly adjusting the reference patterns. The system was imple-mented in an FPGA platform and tested with real data samples ofhandwritten and printed English characters. The classification re-sults showed an acceptable performance of classification andlearning. In order to enhance the search speed in the classification

block, we are planning to use a fully-parallel associative memoryimplemented in an LSI architecture, as the main classifier of thesystem via an ASIC design. We also intend to improve the learningperformance by taking more dynamic parameters in the rankingprocess. Comparing to other learning models, however this proto-type model is not yet robust enough but is advantageous in termsof a simple learning algorithm, high classification speed, and ahardware-friendly structure.

Appendix A

A.1. Additional illustrations

� Median filter implementation (Fig. 19).� Flowchart of labeling algorithm (Fig. 20).

References

A database of handwritten digits which is a subset of a larger set available fromNIST. MNIST data are available on web at: http://yann.lecun.com/exdb/mnist/.

A handheld OCR dictionary produced by WizCom Technologies Ltd. Link on theweb: http://www.wizcomtech.com/Wizcom/products/products.asp?fid=77.

Abedin, M. A., Tanaka, Y., Ahmadi, A., Koide, T., & Mattausch, H. J. (2007). Mixeddigital-analog associative memory enabling fully-parallel nearest euclideandistance search. Japanese Journal of Applied Physics (JJAP), 46(4B), 2231–2237.

Beiu, V. (1998). VLSI complexity of discrete neural networks. Newark, N.J: Gordon andBreach.

Boser, B., Sackinger, E., Bromley, J., LeCun, Y., & Jackel, L. D. (1991). An analog neuralnetwork processor with programmable network topology. IEEE Journal of Solid-State Circuits, 26(12), 2017–2025.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition.Knowledge Discovery and Data Mining, 2(2), 1–43.

Dietterich, T. G. (1997). Machine learning research: Four current directions. AIMagazine, 18(4), 97–136.

Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. In Proceedings ofthe first international workshop on multiple classifier systems. Lecture notes incomputer science (pp. 1–15). Italy.

Glesner, M., & Pochmuller, W. (1994). An overview of neural networks in VLSI.London: Chapman & Hall.

Harris, C. (1991). Parallel distributed processing models and metaphors for languageand development, Ph.D. Dissertation, University of California, San Diego.

Haykin, S. (1999). Neural networks. New Jersey: Prentice Hall.Heemskerk, J. N. H. (1995). Overview of neural hardware: Neurocomputers for brain-

style processing. Design, implementation and application. PhD Thesis, LeidenUniversity, Netherlands.

Hildebrand, D. K. (1986). Statistical thinking for behavioral scientists. Boston:Duxbury.

Hollis, P. W., & Paulos, J. J. (1994). A neural network learning algorithm tailored forVLSI implementation. IEEE Transactions on Neural Networks, 5(5), 784–791.

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

Liao, Y. (2001). Neural networks in hardware: A survey. University of California:Department of Computer Science. available on the Web.

Ligature Ltd. Address on web: http://www.ligatureltd.com/.Lindsey, C., Lindblad, T. (1994). Review of hardware neural networks: A user’s

perspective. In Proceedings of 3rd workshop on neural networks: From biology tohigh energy physics. Italy.

Liu, C., & Fujisawa, H. (2005). Classification and learning for character recognition:Comparison of methods and remaining problems. In Proceedings of first IAPR TC3NNLDAR workshop (pp. 1–7). Seoul, Korea.

Liu, J., & Gader, P. (2002). Neural networks with enhanced outlier rejection abilityfor off-line handwritten word recognition. Pattern Recognition, 35, 2061–2071.

Liu, C., Nakashima, K., Sako, H., & Fujisawa, H. (2003). Handwritten digitrecognition: Benchmarking of state-of-the-art techniques. Pattern Recognition,36(10), 2271–2285.

Mattausch, H. J., Gyohten, T., Soda, Y., & Koide, T. (2002). Compact associative-memory architecture with fully-parallel search capability for the minimumhamming distance. IEEE Journal of Solid-State Circuits, 37, 218–227.

McCartor, H. (1991). A highly parallel digital architecture for neural networkemulation. In VLSI for artificial intelligence and neural networks (pp. 357–366).New York: Plenum Press.

Metz, C. E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine,8(4), 283–298.

Mitchell, T. (1997). Machine learning. USA: McGraw Hill.Moerland, P., & Fiesler, E. (1996). Neural network adaptations to hardware

implementations handbook of neural computation (pp. E1.2:1–13). New York:Institute of Physics Publishing and Oxford University Publishing.

Platt, J. C., & Allen, T. C. (1996). A neural network classifier for the I1000 OCR chip.NIPS 8 (pp. 938–944).

Page 15: An associative memory-based learning model with an efficient hardware implementation in FPGA

A. Ahmadi et al. / Expert Systems with Applications 38 (2011) 3499–3513 3513

Rahman, A. F. R., & Fairhurst, M. C. (2003). Multiple classifier decision combinationstrategies for character recognition: A review. International Journal on DocumentAnalysis and Recognition, 5(4), 166–194.

Ramacher, U. et al. (1993). Multiprocessor and memory architecture of theneurocomputers SYNAPSE-1. In Proceedings of the 3rd international conferenceon microelectronics for neural networks (pp. 227–231).

Sackinger, E., Boser, B., Bromley, J., LeCun, Y., & Jackel, L. D. (1992). Application of theANNA neural network chip to high speed character recognition. IEEETransactions on Neural Networks, 3(3), 498–505.

Saon, G. (1999). Cursive word recognition using a random field based hidden markovmodel. International Journal on Document Analysis and Recognition, 1, 199–208.

Shirakawa, Y., Mizokami, M., Koide, T., & Mattausch, H. J. (2004). Automatic Pattern-learning architecture based on associative memory and short/long term storageconcept. In Proceedings of SSDM’2004 (pp. 362–363) Japan.

Smith, S. (1994). Handwritten character classification using nearest neighbor inlarge databases. IEEE Transactions on Pattern Analysis and Machine Intelligence,16, 915–919.

Smith, J. (1998). Xilinx design hints and issues. Address on the web: http://www.xilinx.com/xcell/xl23/xl23_16.pdf.

Suen, C. Y., Kiu, K., & Strathy, N. W. (1999). Sorting and recognizing cheques andfinancial documents. In S.-W. Lee & Y. Nakano (Eds.), Document analysis systems:Theory and practice LNCS 1655 (pp. 173–187). Springer.

Yano, Y., Koide, T., & Mattausch, H. J. (2002). Fully parallel nearest manhattan-distance search memory with large reference-pattern number. In Extendedabstract of the International conference on solid state devices and materials(SSDM’2002) (pp. 254–255).