Anguita 1996 Micro Processing and Micro Programming

download Anguita 1996 Micro Processing and Micro Programming

of 13

Transcript of Anguita 1996 Micro Processing and Micro Programming

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    1/13

    cEi%_s__l!i!zLSEVIER Microprocessing and Microprogramming 41 ( 19%) 757-769 MicroprocessingandMicroprogrammingMixing floating- and fixed-point formats for neural network

    learning on neuroprocessorsDavide Anguita a,*, Benedict A. Gomes by

    a Univ . of Genov a, D.I.B.E., vi a Op era Pia Ila, 16145 G-nov a, Ital yb ht. Comp. Science Inst., 1947 Center St., Berkeley, CA, USA

    Received 1 March 1995;revised 11 October 1995; accepted 18 January 1996

    AbstractWe examine the efficient implementation of back-propagation (BP) type algorithms on TO [ 31, a vector processor with

    a fixed-point engine, designed for neural network simulation. Using Matrix Back Propagation (MBP) [2] we achieve anasymptotically optimal performance on TO (about 0.8 GOPS) for both forward and backward phases, which is not possiblewith the standard on-line BP algorithm. We use a mixture of fixed- and floating-point operations in order to guarantee bothhigh efficiency and fast convergence. Though the most expensive computations are implemented in fixed-point, we achievea rate of convergence that is comparable to the floating-point version. The time taken-for conversion between fixed- andfloating-point is also shown to be reasonably low.Keywords: Neural networks; Neuroprocessors; Fixed-point format

    1. IntroductionAmong the large number of dedicated VLSI archi-

    tectures for neural networks developed in recent years,several of the most successful proposals have regardeddigital implementations. Most of these dedicated pro-cessors are oriented toward the efficient execution ofvarious learning algorithms with a strong accent on

    * Corresponding author. Email: [email protected]. Email: [email protected]

    back-propagation (BP). Some well-known examplesin this field are CNAPS [ 131, Lneuro [ 181, MA-16[20], and SPERT [28]: they are the building blocksfor larger systems that exploit massive parallelism toachieve performances orders of magnitude greater thanconventional workstations [ 2 1,4]. The common char-acteristic of these processors is the use of a fixed-pointengine, typically 16 bits wide or less, for fast compu-tation.

    The drawback for the final user who wants to im-plement an algorithm for neural network learning on

    0165-6074/96/$15.00 @ 1996 Elsevier Science B.V. All rights reservedPIISO165-6074(96)00012-9

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    2/13

    758Table IThe MBP algorithmPseudo-code/* Feed-forw ard */for 1 := I to Ls/ := s,_, W/

    S/ := S/ + b/ lrs, := f{S,}

    D. Angui ta , BA. G om es/Microprocessing and Microprogramm ing 41 (1996) 757-769

    # of operations

    (1.1) 2NPN/N/- I(1.2) NP NI(1.3) NpN/kl

    Point

    fixedfixedfixed

    /* Emor back-prop */AL:=T-SLAL := AL x ~{SL}for I := L - 1 to I

    A/ := A,+] WT+,A/ := A/ x g{S,}

    (2.1)(2.2)(2.3)(2.4)

    NP NLNPNL(I + kz)

    ~ NPNI+ININPN/ ( 1+ k2)

    floatingfloatingfixedfixed

    /* Weight variation */for I := 1 to LAW ._ ST

    Ab;:. :;-A,:;AI

    AW 7 := ,AWy + aAw;dAb;F := qAb; + &yd

    (3.1) ~ NPNINI-I(3.2) NPNI(3.3) ~N/N/-I(3.4) 3Ni

    fixedfixedfloatingfloating

    /* W eight update */for I:= 1 to L

    W, := W, + AWYb, := b, + Ab;e

    (4.1)(4.2)

    N/N/-INl

    floatingfloating

    this kind of processors is the fixed-point format thatrequires greater attention during the implementation,compared with a conventional floating-point format.This is not a new problem, in fact both analog and dig-ital implementations of neural networks suffer fromsome constraint due to physical limitations. For thisreason, the effect of discretization on feed-forward net-works and back-propagation learning received someattention shortly after the introduction of the algorithm[ 10,5,15]. Most of the results indicate that a repre-sentation of 16 bits for the fixed-point format is reli-able enough to obtain reasonable results with on-linebackpropagation. On the other hand, despite this gen-eral agreement, there has been some effort to reducethe precision needed during the computation [ 14,221,mainly because the effect of the discretization duringlearning is not completely understood and it seemsto be both problem and algorithm dependent. In fact,

    there are many variations of the BP algorithm and eachof them can show different sensitivity to the approxi-mations caused by the fixed-point arithmetic, leadingto different convergence problems. Some theoreticalresults on the precision issue have been found [ 1,231,but often they rely on difficulty to predict parameters(e.g. the number of iterations to convergence).

    One solution to overcome these limitations is tomix conventional floating-point operations with fixed-point operations when required. An example of thisapproach is [ 121 where the feed-forward and the back-ward phase of the algorithm are computed in fixed- andfloating-point format respectively. However, this solu-tion does not address the efficiency issue because themost computationally expensive part of the algorithm(the backward phase) is still performed in floating-point format, losing all the advantages of a fast fixed-point engine.

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    3/13

    D. Angui ra. B.A. Gom es/Microprocessing and Microprogramm ing 41 (1996) 757-769 759We show here a mixed floating/fixed-point imple-

    mentation of Matrix Back Propagation (MBP) [2]that isolates the most computationally expensive stepsof the algorithm and implements them efficiently infixed-point format. Other parts of the algorithm withless demand in terms of computational power butwith more critical needs in terms of accuracy are im-plemented in conventional floating-point format. Thetarget architecture is the neuroprocessor TO, but themethod is of general validity.

    Despite the need for conversions between the twoformats and the simulation of the floating-point oper-ations in software, good performances are obtainablewith reasonably large networks, showing a high effi-ciency in exploiting the TO hardware.

    The following section describes the learning algo-rithm implemented. Section 3 describes the mixedfloating/fixed-point approach. Section 4 summarizesthe main characteristics of TO. Section 5 shows the im-plementation details and performance evaluation andSection 6 compares the effect of the mixed approachwith the standard algorithm.

    2. Matrix back propagationIn Table 1 the MBP algorithm is summarized. It can

    be used to represent several BP learning algorithmswith adaptive step and momentum [ 26,271. The sec-ond column of the table contains the number of opera-tions needed by each step. The third column indicatesif the computation for each step is performed in fixed-or floating-point format (this choice will be explainedin the following section). Bold letters indicate vectorsor matrices.

    We assume that our feed-forward network is com-posed of L layers of Nr neurons, with 0 < 1 < Z,. Theweights for each layer are stored in matrices WI ofsize N, x N/-t and the biases in vectors bl of size NL.

    The learning set consist of Np patterns. Input pat-terns are stored in matrix SO in row order and targetpatterns similarly in matrix T. The order of storing is

    particularly important for the efficiency of the imple-mentation: if the patterns are stored in row order, theelements of each pattern lie in consecutive memorylocations and can be accessed with no performancepenalty on the vast majority of current processor ar-chitectures including TO. Matrices Si , . . . , SL containthe output of the corresponding layer when So is ap-plied to the input of the network. The size of S1 isNP x Nl and the size of T is Np x NL.

    The back-propagated error is stored in matrices A,of size Np x Nl and the variations of weights andbiases computed at each step are stored respectivelyin matrices AW! of size Nl x Nl_t and vectors Ablof size Nl. For simplicity, connections between non-consecutive layers are not considered.

    The total number of operations of MBP is

    nr = 2Np 3 5 N,N,_, - N, No (5)I=1L L

    +(3 + kl + k0pxN, + 4CN,N,-, - NPNLI=1 i=I (6)

    +4xNi, (7)kl

    where kl and k2 are respectively the number of op-erations needed for the computation of the activationfunction of the neurons and its derivative. If the acti-vation function is the usual sigmoid, then k:! = 2.

    On a conventional RISC, if each operation is com-pleted in a single cycle, the total computational timeis T cx noc~es= nap. On vector or multi-ALU proces-sors like TO, the expected time is T 0: ncvcle. = nOl,/P,where P is the number of ALUs. Obviously the im-plicit assumptions are: (a) there is no additional costto load or store the data in memory, (b) one instruc-tion can be issued every cycle, and (c) the order inwhich the operations are issued allows a complete ex-ploitation of the ALUs. It has already been shown [ 21that with a relatively small effort these constraints can

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    4/13

    760 D. Angui ta , B.A. Gom es/Microprocessing and Microprogramm ing 41(1996) 757-769be satisfied reasonably well on some RISCs. In Sec-tion 4 we will address this problem for TO.

    3. The neuroprocessor TOTO belongs to the family of neuroprocessors withfast fixed-point capabilities and it will be the first

    implementation of the Torrent architecture [ 33. It istailored for neural-networks calculations and inheritssome of the features of a previous neuro-processor[ 281. The next implementation (Tl ) will be the build-ing block for a massively parallel neuro-computer [ 41.

    In particular, TO is composed of a standard MIPS-II RISC engine [ 171 with no floating-point unit butwith a fixed-point vector unit that can execute up totwo operations per cycle on 8-word vectors, or, inother words, compute 16 results in a single cycle. Thistranslates to a peak performance of 0.8 GOPS (GigaOperations per Second) if the processor is clockedat SOMHz, or approximately 0.2 GCUPS (Giga Con-nection Updates per Second) for one hidden layernetworks, a result comparable to supercomputer im-plementations. Fig. 1 summarizes the architecture ofthe vector unit. The two 8-word ALUs are VP0 andVP1 connected to the 32-bit vector register bank. Eachvector register contains 32 elements, therefore eachALU can execute an operation on a complete vectorin 4 cycles. The data path to/from the memory is 128bits wide allowing the loading/storing of eight 16-bitwords in a single cycle.

    4. The mixed format algorithmWe will explain here in detail the choice of theformat for each step of the algorithm. The main ideais to perform the most computational expensive part

    of the algorithm in fixed-point and resort to floating-point only where the computation must be particularlyaccurate.

    Using Table 1 we can observe that the more expen-sive steps are ( l.l), (2.3) and (3.1). They require0( n3) operations (where n is in general the size of theproblem), therefore they will be performed in fixed-point. Note that matrix SO hat contains the input pat-terns is likely to be already in fixed-point format inreal-world applications, deriving, for example, from anA/D conversion. Step ( 1.2) can be easily computedin the same way.

    Step (1.3) requires a function computation. Withthe use of the fixed-point format, this can be substi-tuted with an indexed load from a table where the val-ues of the functions are pre-stored.

    Before starting the error-back propagation, we cantranslate the output of the network to floating-pointin order to have an accurate computation of the error(2.1) and its derivative (2.2). The interesting side-effect of performing these operations in floating-pointis that we know (after step (2.2) ) the numeric rangeof the error, therefore it is possible to choose a goodfixed-point representation for the subsequent steps.The next conversion is performed before step (3.3)and (3.4) in order to compute with great accuracy thevariation of the weights and biases of the network.Note that both I) (the learning step) and LY the mo-mentum term) are in general floating-point variables.To summarize the algorithm: the conversion fromfixed- to floating-point format must be performed atthe end of the forward phase on matrix SL and atthe end of the backward phase on AWl and Abr. Theconversion from floating- to fixed-point format mustbe performed at the beginning of the forward phase oneach Wr and bl and at the beginning of the backwardphase on AL.

    5. Opt imal implementation on the TOneuroprocessor

    If the implementation of an algorithm on TO is op-timal, in the sense that it can completely exploit itshardware, we can expect to have rzcycres nap/ 16. For

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    5/13

    D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 761

    cnd.Mv. 1 nd.Mv. ) cad. Mv. ml. Mv. Cnd. Mv. Cnd.Mv. ) cI?d.Mv. 1 CMLMV.

    Load A@lStore N Load Al@Stare Drv Load&p LwJAlgn Load A@ Load Alpn Load Al#nstomlhv StOre DN Store DN store DN SWDN

    m&usFig. 1. Simplified architecture of the Vector Unit of TO.

    this reason, we will refer to an algorithm as asymp-toticully optimal for TO (or simply opfiml) if the ef-ficiency E of its implementation goes to 1 as the sizeof the problem (Np,Nl) grows. In other words: E =nap 16ncvcres --f 1.

    Our purpose is to show that MBP can be imple-mented optimally in this sense, even though some ofthe computations are done in floating-point and mustbe simulated in software.

    As mentioned before, the computational load be-

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    6/13

    762 D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769Table 2Scalar and vectorized matrix products

    Scalar matrix productss/ =s/_i .w, for i := 0 to Np - 1

    forj:=OtoN/_l-1for k := 0 to N/ - Is! += s!- 1.J r.k * wk,j

    Vectorized matrix productsforj:=OtoNl_I-1stepV~

    for i := 0 to Np - 1 step C/for k := 0 to N/ - 1

    IS;,[j,j+V,] +=d,T * w:.[j.j+VL]s;+, rjj+q, +={+;'& *w'

    *, , klj.jth 1

    Isi+O-l.[j,jtVL]= L!f-l,k * w:,[j,jtVL)

    for i := 0 to Np - 1for j := 0 to N/+1 - 1

    for k := 0 to N/ - 1s! += ,!+I * WI.+1 r,k J.k

    AW, =S;_, .A, for i := 0 to N/-l - 1for j := 0 to IV/ - 1

    for k := 0 to Np - 1Aw!. +=J-Y es1.J k-1 k.J

    for i := 0 to Np - 1 step Vfor j := 0 to N/+1 - 1 step V

    for k := 0 to Nl - 1 step VLs! += s!+1.J dk.k+VLl w!+J,Ck.k+VL 1S! +=6!+l t b ! J t v I+V-l,[k.k+V~ I * w!+J+V-l,[k.ktV~ I

    for j := 0 to N/ - 1 step VLfor i := 0 to N/-I - I step U

    for k := 0 to Np - 1AWf[jj+VL]. f=&' *a k,CjJ+kAw!+l[jjtV,] +='s':' * 6'. . k,rtl &,[JJ+VL I

    longs to steps (l.l), (2.3) and (3.1). To computethese steps, three matrix multiplications must be per-formed: ( 1.1) is a conventional matrix product, (2.3)is a matrix product with the second matrix transposedand (3.1) is a matrix product with the first matrixtransposed.

    The three operations are shown in pseudo-code inthe second column of Table 2. The third column showsthe vectorized versions. V is the vector register length(32 in the current implementation of TO) and U, Vare the unrolling depth needed to fill the processorpipelines.

    The increase of the unrolling depth shifts the bal-ance of the loop from memory-bound to CPU-bound,therefore extra cycles are available for the memoryport to load (store) the operands while the processor

    is computing the arithmetic operations. The unrollingdepth is limited by the number of registers availablefor storing intermediate results: in our case U = 8 andv = 2.

    As can be easily noted, the vectorized version per-forms its vector references to each matrix in row orderto exploit the memory bandwith of TO. In fact, the useof stride- 1 access to the memory allows the processorto load an entire 8-word vector (of 16 bits) in a sin-gle cycle, while a generic stride-n access to memory(n > 1) requires one cycle per element.

    We will assume in the following text that all thematrix dimensions are a multiple of VL. If this is notthe case, there is some overhead due to an underuti-lization of the vector unit, but it does not affect theasymptotical behavior of the implementation. For an

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    7/13

    D. Angui tu , EA. Gom es/Microprocessing and Microprogramm ing 41 (1996) 757-769 763Table 3Number of cycles for MBP on TO in the general caseStep(1.1)( 1 . 2 )(1.3)(2.1)(2.2)(2.3)(2.4)(3.1)(3.2)(3.3)(3.4)(4.1)(4.2)

    Q&!,(4N/llf41/) [ NP I Ul N- l / h . 1( 8 N/ / h 1 1 ) b1 ~NP N//k1 h.kfNp [NLIVLI VL3k,N, ~ NLIVLI VL(4 IN/ / VL~ V* + 20 + V*) INPIV~ rN/+~ /vl12 N// VL~ NP(4Npu + 4U) [N/ / VLI [N/ -I/~ ](~ NP + 4) ~N/VL~3kfN/ ~ N/--I /~ L] VL3k.f [N// VL~ VLkfN/ [N/-I /~ L~ VLk,f [N/ / VLI VL

    Table 4Number of cycles for optimized matrix multiplicationsStep kyckr(1.1) I I~ NPN/N/ --I + XNPNI-I(2.3) $PNN+I + ~ NPN/ +I(3.1) $NPNINI-I + $N,N,-,

    exact computation of the number of cycles in the gen-eral case, the reader can refer to Table 3.

    Table 4 shows the number of cycles needed by TOto perform the optimized matrix multiplications.

    Step ( I. 1) requires four cycles in the inner loop tocompute a single vector multiplification/addition andfour cycles to store each result back in memory at theend of the loop. The load of element Si,k can be over-lapped with the computation thanks to the unrollingof the external loop. It is easy to prove the optimalityof (1.1):E(l.I) = n(l.l)v 2NpNiNl-I-I&.)c+T 16 (+NP NIN l-, + $Np~l_l)

    1= l+l/N1 + 1. (8)The second product (2.3) can be seen as a sequence

    of dot-products. This operation is not directly imple-mented on TO and needs N 20 cycles for a vector ofV = 32 words. This problem is known and could beeventually solved in future releases of the processor131. In any case, the overhead due to the absence ofthe dot-product is not particularly annoying when deal-ing with matrix products: in fact, partial dot-productsof length V can be kept in vector registers and thefinal result can be computed at the end of the innerloop. Note that matrix-vector products (used in thestandard BP algorithm) would suffer from a biggeroverhead, in this case the absence of an implementeddot-product appears only in the second-order term andbecomes negligible for large problems.

    The third product (3.1) is similar to ( 1.l ) , but witha different order of the loops.

    Other steps performed in fixed-point format are:the computation of the output of each neuron throughits activation function ( 1.3), the computation of itsderivative in the internal layers (2.4)) the bias additionin the feed-forward phase ( 1.2) and the bias compu-tation in the backward phase (3.2).

    The computation of the activation function is quiteexpensive if it is done using a floating-point math li-brary [ 111, and it would cause a large penalty on TOdue to the absence of a floating-point unit. Yet, if ( 1.3)is performed in fixed-point format, the activation func-tion can be easily computed using a look-up table ofsize 2 where B is the number of bits of the fixed-point format [ 61. The vector unit of TO is providedwith a vector instruction to perform indexed load, sothe number of cycles needed to compute the value us-ing the table is only N 1 S/element.

    The pseudo-code for steps ( 1.2)) (2.4) and (3.2)is shown in Table 5. The three loops are memory-bounded, therefore the number of cycles is easy tocompute (assuming sufficient unrolling). All the othersteps are done in floating-point format.

    Let us consider now the overhead due to the conver-sion of the matrices from floating- to fixed-point for-mat and vice versa. The scalar conversion takes N 46cycles/element on TO, but it is possible to lower this

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    8/13

    764Table 5

    D. Angui ta , BA. Gom es/Microprocessing and Microprogramm ing 41(1996) 757-769

    Other vectorized operationsStep Pseudo-code(1.2) for i := 0 to Np - 1

    for j := 0 to Nl step VLIsi,[j,j+V~]+= bl

    (2.4) for i := 0 to Np - 1for j := 0 to N/ - 1 step VL

    4[jj+V~] =l(jj+V,] *(I-sfl[jj+V,] *l[jj+Vi]). . . . . . . iN,Np

    (3.2) for j := 0 to N/ step VLfor i := 0 to NpbIj,j+VL I += l [ j j+VfJ. $NPNI + $NI

    number using the vector unit. For vectors between 100and 1000 elements, the translation from floating-pointto fixed-point format requires only kfx = 2.6 tf 1.8cycles/element and k,. = 3.6 tt 2.5 cycles/elementfor the inverse conversion (these figures have beenmeasured experimentally).

    The total number of cycles needed for the conver-sions is

    [

    L$$!!b,=(k,f+kfx) CNl(N II + 1) +NPNL .I=1 I

    (9)TO does not implement the floating-point unit of

    the MIPS architecture, so the floating-point operationmust be simulated in software. Currently the RISCcore is used to perform the simulation, but an IEEEcompatible floating-point library that uses the vectorunit is under development and the expected perfor-mance will be in the rangeof 10 ++ 50 cycles/element.Then the number of cycles for the floating-point stepsof the algorithm will be nzles = kfn$, with kf E[ 10,501.

    We now have all the elements to compute the num-ber of cycles needed by TO to execute MBP,

    N/N/I - NINO (10)

    L

    > ~N,N,_,/=I (11)

    + 4kf+kfx+k,f-i> NP NoNPNL-~N~N,+~ (12)L+LNP + 4kf + kfr + kxf + f >C N/=I

    (13)

    If we compare the 0(n3) term (10) with the cor-responding term for npP, we can deduce easily theoptimality of this implementation of MBI?

    Obviously, the asymptotical behavior of MBP onTO is not of primary importance when dealing withreal-world applications. It is interesting therefore toanalyze the second- ( 11 , ( 12) and first-order ( 13)terms of the above expression.

    First of all, we note that the overhead due to the con-versions from fixed- to floating-point and vice-versadepends mainly on the size of the network and onlymarginally on the dimension of the training set, as canbe seen from the second term of ( 11) and the first termof ( 12). The dependence from the size of the train-ing set is controlled by the number of neurons of theoutput layer ( NL), so we expect better performancewhen dealing with networks with a small number ofoutputs (e.g. classification problems, as opposed toencoding problems [ 81). If this is not the case, sometechniques to reduce the number of output neurons in

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    9/13

    765

    10.9DB0.70.6n50.40.30.20.1D

    tKUPS(Np.N) -

    Fig. 2. Efficiency and performance (MCUPS) of MBP on TOclassification problems can be applied [ 19 I. of Layers is not theoretically justified [9] and practi-There is also an explicit dependence in the tirst- cal applications seldom require more than four layersw&r term (13) on the number of layers UT Ihe ner- q see, for example, [ 161 for a real problem that re-work ( L j, This rem is of small importance being af quires such an architecture).first order, but we can expect an increase of overhead To sketch the behavior of MBP on TO, we can sim-in networks with a very large number of layers. Haw- plify both the expressions for nOp and n,,lr s assumingever, this is not a common case, as n large number VI Nf zz N and plot the efficiency and the: performance

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    10/13

    766 D. Anguita, BA. Games/ Microprocessing and Microprogramming 41 (1996) 757-769Table 6Some real-world applicationsName Network size Description

    NO Nl N2 N3NETtalk (241 203 80 26 Pronunciation of textNeurogammon [251 459 24 24 1 Backgammon playerSpeech [7] 234 1000 69 _ Speech recognition

    in MCUPS (Fig. 2) as functions of the size of thetraining set ( NP) and the network (N).

    We assume k l = 6 to compute rzop (as suggestedin [ 111) and the worst case for floating-point andconversion routines on TO (kf = 50, kf, = 4, k,f = 3)to compute n,,.,j,,s. The asymptotic performance is 160MCUPS; obviously, the asymptotic performance of ageneric RISC processor with the same clock and onlyone FPU would be 10 MCUPS.

    Fig. 2 allows us to easily understand the behaviorof the implementation, but it is of little practical usedue to the peculiar network architecture. For this rea-son we show here the performance of MBP on TOwith networks that have been used in some real-worldapplications (Table 6).

    Fig. 3 summarizes the performance for the applica-tions mentioned above. It is interesting to note that, forall problems, the number of patterns for which half ofthe peak performance is attained (ni 12) is reasonablysmall ( NP N 500).

    6. Learning with the mixed format algorithmTo test the effectiveness of the mixed format algo-

    rithm we chose the speech recognition problem de-scribed in the previous section.

    Fig. 4 shows the learning on a subset of the speechdatabase with different ranges of the fixed-point vari-ables. In particular, E is the exponent of the most sig-nificant digit of the fixed-point format. With 16-bitwords we can represent values in the range [ -2E, 2E-2E-51.

    It is clear that the error back propagation is quitesensitive to the range of the fixed-point format. If the

    fixed-point representation is too coarse (e.g. E = 2),the algorithm tends to get stuck due to the underflowof the back-propagated error. However, thanks to theuse of the mixed format, it is possible to choose agood range for the fixed-point variables before startingthe error back propagation, because the error compu-tation in the last layer is done in floating-point format.The choice of the correct range can easily be donelooking at the largest floating-point value. In this case,the learning with mixed format is comparable to thelearning in floating-point format in terms of number oflearning steps but, of course, far more efficient froma computational point of view.

    7. ConclusionsWe have detailed here an efficient implementation

    of a back-propagation algorithm on TO. The use ofthe mixed fixed/floating-point mode in the implemen-tation shows good performance with real-world net-works, both in terms of the efficiency of computationand in terms of the convergence rate. The limited pre-cision supported by the hardware is not a problem pro-vided the range is appropriately chosen. The mixedmodel computes the output layers error using floating-point, and uses the floating-point values to determinean appropriate range for the following fixed-point.

    This work shows that digital neuroprocessors andparticularly TO can be efficient test beds for variousBP-type algorithms, even when limited by fixed-pointformats.

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    11/13

    D. Angui ta , BA. Gom es/Microprocessing and Microprogramm ing 41 (1996) 757-769140 h

    120 -

    100 -

    so-

    so-

    40-

    20 -.

    o-;i

    (II,j..>.,: ,,a,/: ,/;..

    1 0 0 0 iofmlND

    -ii-$i_

    A

    _

    -

    Fig. 3. Performance for some real-world applications (in MCUPS).

    0 . 0 1 _I0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0I t e r a t i o n

    Fig. 4. Leaming behavior for different fixed-point ranges.

    161

    AcknowledgementsThanks to David Johnson for providing the emu-

    lation routines for fixed- and floating-point math and

    for several interesting discussions on TO, NaghmehNikki Mirghafori for providing the speech database,and Professor Nelson Morgan for suggestions on thelearning algorithm. We would also like to thank two

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    12/13

    768 D. Angui ta , B.A. Gom es/Microprocessing and Microprogramm ing 41 (1996) 757-769anonymous reviewers for their suggestions on how toimprove this paper.This work was developed while D. Anguita wasvisiting researcher at ICSI, Berkeley, USA, under agrant of CNR - ConsiglioNazionale Ricerche, Italy.

    References] I ] C. Alippi and M.E. Negri, Hardware requirements for digital

    121

    131

    141

    151

    161

    I7

    18

    I9

    110

    VLSI implementations of neural networks, Inr. Joint Confon Neural Networks, Singapore ( 1991) pp. 1873- 1878.D. Anguita, G. Parodi and R. Zunino, An efficientimplementation of BP on RISC-based workstations,Neuracom put ing 6 (1994) 57-65.K. AsanoviC, J. Beck, B. lrissou, D. Kingsbury, N. Morganand J. Wawrzynek, The TO vector microprocessor, Her ChipsVII Symposium, Stanford Univ. ( 13-15 Aug. 1995).K. AsanoviC, J. Beck, J. Feldman, N. Morgan and J.Wawrzynek, Designing a connectionist network super-computer, Int. J. Neural Systems 4(4) (Dec. 1993) 317-326.K. AsanoviC and N. Morgan, Experimental determinationof precision requirements for back-propagation training ofartificial neural networks, in Proc. o f 2nd Inr. Co@ onMicraeiecfronics for Neural Networks, Munich, Germany(16-18 Oct. 1991) pp. 9-15.V. Bochev, Distributed arithmetic implementation of artificialneural networks, IEEE Trans . on Signal Processing 41(5)(May 1993).H. Bourlard and N. Morgan, Continuous speech recognitionby connectionist statistical methods, IEEE Trans. on NeuralNefworks 4( 6) (Nov. 1993) 893-909.S. Carrato, A. Premoli and G.L. Sicuranza, Linear andnonlinear neural networks for image compression, in DigitalSignal Processing, V. Cappellini and A.G. Constantinides,eds. (Elsevier, Amsterdam, 1991) pp. 526-531.G. Cybenko, Approximation by superposition of a sigmoidalfunction, Mat h of Cont rol, Signal, and Syst ems 2 ( 1989)303-3 14.D.D. Caviglia, M. Valle and G.M. Bisio, Effect of weightdiscretization on the back propagation learning method:Algorithm design and hardware realization, Proc. o f IJCNN90, San Diego, USA ( 17-21 June 1990) pp. 631-637.

    ] 1 I I A. Corana, C. Roland0 and S. Ridella, A highly efficientimplementation of back-propagation algorithm on SIMDcomputers, in High Performance Compu ti ng, Proc. of th eInf. Symp., Montpellier, France (22-24 March 1989) J.-L.Delhaye and E. Gelenbe, eds. (Elsevier, Amsterdam, 1989)pp. 181-190.

    [ 121 E. Fiesler, A. Choudry and H.J. Caulfield, A universal weightdiscretization method for multi-layer neural networks, IEEETrans. on SMC, to appear.

    [ 131 D. Hammerstrom, A VLSI architecture for high-performance,low-cost, on-chip learning, Proc. of the IJCNN 90, SanDiego, USA (17-21 June 1990) pp. 537-544.[ 141 M. Hoehfeld and S.E. Fahlman, Learning with numerical

    1151

    1171[I81

    [I91

    1201[211

    [221

    [=I

    precision using the cascade-correlation algorithm, IEEETrans. on Neural Networks 3(4) (July 1992) 602-611.PW. Hollis, J.S. Harper and J.J. Paulos, The effect ofprecision constraints in a backpropagation learning network,Neural Computat ion 2 ( 3) 1990.M.A. Kramer, Nonlinear principal component analysis usingautoassociative neural networks, AIChE I. 37(2) (Feb.1991) 233-243.G. Kane and J. Heinrich, MIPS RISC Architecture (PrenticeHall, Englewoods Cliffs, NJ, 1992).N. Maudit, M. Duranton, J. Gobert and J.A. Sirat, Lneuro1 O: a piece of hardware LEG0 for building neural networksystems, IEEE Trans. on Neural N etw orks 3( 3) (May 1992)414-421.N. Morgan and H. Bourlard, Factoring networks by astatistical method, Neural Comput ation 4( 6) (Nov. 1992)835-838.U. Ramacher et al., eds., VLSI Design of Neural Networks(Kluwer Academic, Dordrecht, 199 1).U. Ramacher et al., SYNAPSE-X: a general-purposeneurocomputer, Proc. of the 2nd Int. Conf on Micro-electronics for Neural Networks, Munich, Germany (Oct.1991) pp. 401-409.S. Sakaue, T. Kohda, H. Yamamoto, S. Maruno and Y.Shimeki, Reduction of required precision bits for back-propagation applied to pattern recognition, IEEE Trans. onNeural Nehvorks 4( 2) (March 1993) 270-275.J.A. Sirat, S. Makram-Ebeid, J.L. Zorer and J.P Nadal,Unlimited accuracy in layered networks, IEEE Inf. Conf anArttficial Neural Ne/works, London ( 1989) pp. 18 I- 185.T.J. Sejnowsky and C.R. Rosenberg, Parallel networks thatlearn to pronounce English text, Complex Systems 1 (1987)145-168.

    [24

    [25 G. Tesauro and T.J. Sejnowsky, A neural network that learnsto play backgammon, in Neural Information ProcessingSysfems, D.Z. Anderson, ed. (1987) pp. 442-456.

    [26] T. Tollenaere, SuperSAB: fast adaptive back propagationwith good scaling properties, Neural N etw orks 3 ( 5) ( 1990)561-573.

    [27] T.P Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink and D.L.Alkon, Accelerating the covergence of the back-propagationmethod, Biolo gical Cybernetics 59 (1989) 257-263.

  • 8/9/2019 Anguita 1996 Micro Processing and Micro Programming

    13/13

    D. Angui ta , BA. Comes/ Mi croprocessing and Microprogramm ing 41 (1996) 757-769 769

    ( 28 I J. Wawrzynek, K. AsanoviC and N. Morgan, The design ofa neuro-microprocessor, IEEE Trans . on Neural Networks4(3) (May 1993) 394-399.

    Davide Anguita obtained the laureadegree in Electronic Engineering fromGenoa University in 1989. He worked atBailey-Esacontrol in the field of wide-area distributed control systems, then hejoined the Department of Biophysicaland Electronic Engineering (DIBE) ofGenoa University, where he receivedthe Doctorate in Computer Science andElectronic Engineering. After a one-year visit to the International ComputerScience Institute, Berkeley, CA, he is

    currently a postdoc research assistant at DIBE. His research ac-tivities cover neurocomputing and parallel architectures, includingapplications and implementation of artificial neural networks andthe design of parallel and distributed systems.

    Benedict Gomes received the B.S. de-gree in Computer Engineering from CaseWestern Reserve University, Cleveland,OH, and his M.A. in Computer Sci-ence from U.C. Berkeley. He is cur-rently working on his PhD at UC Berke-ley. His research centers around mappingstructured connectionist networks ontogeneral purpose parallel machines.