A VLSI Implementable Learning Algorithm

download A VLSI Implementable Learning Algorithm

If you can't read please download the document

Transcript of A VLSI Implementable Learning Algorithm

TABLE OF CONTENTS1. Introduction 12. Objective ............................................................................................................................. 3 1.3. Contributions of the dissertation ...................... ................................................................. S 1.4. Organization of the dissertation .......................................................................................... 6 2. Background information 2.1. On hardware implementation of neural networks ............................................................... 7 2.2. On top-down design methodology ................................................................................... 10 3. The learning algorithms 3.1. Deterministic teaming algorithm .................................................................................... 11 3.2. Original stochastic learning algorithm ............................................................................ IS 3.3 Modifications made for VLSI implementation .................................................................. 25 3.3.1. Digital Sigmoid ..................................................................................................... 27 3.3.2. Modifications of the original weight adjustment mechanism ................................. 32 3.3.3. Data representation ............................................................................................... 33 4. Top-down design methodology .................................................................................................... 37 5. Top-down design with Alopex 5.1. Choice of data format ....................................................................................................... 59 5.2. First Step: C language implementation ............................................................................. 65 5.3. Second Step: HDL functional description ......................................................................... 65 5.4. Third Step: preliminary module partitioning ..................................................................... 66 5.5. Module description 5.5.1. Weight adjustment ................................................................................................. 71 5.5.2. Output calculation units ........ : ............................................................................... 78 5.5.3. Control unit. .......................................................................................................... 83 5.5.4. Noise serial, clock generator and power-on-reset modules ..................................... 90 5.5.5. Operations and machine cycles used by the control unit and the neural array during one training iteration............................................................................... 91 5.6. Synthesis Step ................................................................................................................... 95 5.7. Placement and routing ...................................................................................................... 99 6. Conclusions and future work 6.7. Summary ........................................................................................................................ 100 6.8. Conclusions .................................................................................................................... 102 6.9. Directions for future work .............................................................................................. 106

v7. Appendix A - Software listings A.l. Perception HDL behavioral/structural descriptions & results ..................................... 107

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A.2. Sigmoid HDL behavioral description ........................................................................ 122 A.3. Alopex C language implementation and results ......................................................... 125 A. 3. ............................................................. Alopex HDL behavioral/structural description 145 8. Appendix B B. l. .............................................................................. Design, synthesis and analysis tools 160 B. 2. Tutorials ................................................................................................................... 164 9. Bibliography ............................................................................................................................ 169 10. Author's vita... _______ .................................................................................................... ------- 172

vi

Table 5.1. Actual weight adjustments for 8 iterations ........................................................... 77

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

LIST OF TABLES

Table 5.2. Error calculations in a typical iteration ____ ............ ____ ........... _ ..................89

Table 5.3. Machine cycles for one training iteration _

_______________________________________ ............................................................91

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

viiFigure 3.1.: The Percqmon ............................................................................................... 12 Figure 3.2.: Hierarchical module partitioning ................................................................... 13 Figure 3.3.: Error/weight change probabilities ................................................................. 17 Figure 3.4.: Flowchart for Alopex C language implementation ........................................ 19 Figure 3.5.: Digital sigmoid implementation .................................................................... 25 Figure 3.6.: a) Digital sigmoid transfer function for several values of b) timing waveforms of the behavioral HDL description; c) synthesized schematic and d)chip layout .................. 30 Figure 3.7.: Integer/fractional multiplication comparison ................................................. 36 Figure4.1.: Top-down design steps ................................................................................... 40 Figure 4.2.: Sample run for results of 13 iterations of perceptron algorithm ..................... 43 Figure 4.3.: Top-level schematic ...................................................................................... 48 Figure 4.4.: Training module partitioning......................................................................... 50 Figure 4.5.: Verilog XL timing waveforms ...................................................................... 52 Figure 4.6.: Training and testing patterns ......................................................... - .............. 53 Figure 4.7.: Gate level schematic and reports generated by Synergy synthesis tools ........ 54 Figure 4.8.: Verilog XL timing waveforms of the gate level simulation ........................... 57 Figure 4.9.: Perceptron (training module) final chip layout ........................................... 58 Figure 5.1.: Verilog code for signed fractional multiplier. ................................................ 61 Figure 5.2.: Data representation for Alopex implementation ............................................. 63 Figure 5.3.: F(net) for present implementation ................................................................. 64 Figure 5.4.: HDL structural description of network architecture....................................... 67 Figure 5.5.: System block diagram ................................................................................... 70 Figure 5.6.: Timing waveforms of a weight unit updating its value .................................. 74 Figure 5.7.: Timing waveforms of output calculation units .............................................. 81 Figure 5.8.: Portion of sample run showing how neurons calculate f(net)=I Wj * Xj ....... 82 Figure 5.9.: Timing waveforms for control unit ............................................................... 86 Figure 5.10: Portion of a sample run showing actual calculated values for partial and total errors ....................'. .......................................................................................................... 89 Figure 5.11: Timing waveforms for Alopex ..................................................................... 93 Figure 5.12:Gate level schematic of output calculation unit (neuron) ............................... 96 Figure 5.13:Gate count, total area and maximum delay reports for a)weight unit, b) output calculation unit. ................................................................................................................ 97 Figure 5.14: Final layout for output calculation unit ......................................................... 99 Figure B.I.: Data flow in Cadence Design Framework II Environment .......................... 161

viii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 1 INTRODUCTION

1.1 MotivationDuring the last two decades, several artificial neural network architectures have been proposed to address the issue of computational intelligence (Werbos, 1974; Kohonen, 1988; Hopfield, 198S; Grossberg, 1987; Fukushima, 1984, Minsky&Papert, 1988). A fundamental characteristic of artificial neural networks is their learning capability. They implement this by performing simple calculations, such as product sums and nonlinear functions, using local operators on a high number of interconnected processing units, trying to emulate the biological neurons (Freeman, 1992). Learning is realized as various adaptation rules that define the way synaptic weights are to be modified. After learning, neural networks automatically reproduce the relation implicitly contained in the new test data. The large number of processing elements and the large number of interconnections among them make neural network simulation on traditional hardware dramatically slew (Lehman, 1993). Most of these applications are merely software implementations in which both the learning and the testing are "simulated in a sequential, single processor machine. Other applications rely in training the networks (i.e., adjusting the weights during several iterations) using a high level programming language such as Pascal, C or C++. After training is complete, the weights are downloaded to the hardware for testing (Sackinger, 1992). This actually implements an accelerator chip, which is not learning but merely performing fast calculations after it was trained off-chip. In some cases, training may take place during several hours or even days, since the CPU must decode each instruction and update each weight individually following a learning algorithm. The fundamental drawback is that the inherent parallelism of a neural net architecture is lost entirely or in part when simulated in a traditional sequential machine, a vcctor computer, workstation or even transputer arrays (Ramacher, 1991). By implementing a neural network architecture in hardware, with on-chip 1 learning capabilities, or simulating its implementation using the design and analysis tools now available an appreciable reduction in computing time is possible. Thus, the parallel nature of the interconnected

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

neurons can be fully exploited (Sanchez, 1993; Linares, 1993). These simulations are essential for providing a reasonable assurance that a proposed computational paradigm will function as intended in hardware. Simulating networks can help assure manufacturability if the components are modeled to exhibit their characteristics, usually, switching speed, at both extremes of their tolerances (Reid, 1991). Several researchers (Ramacher, 1991; He, 1993; Melton, 1992; Mumford, 1992; Lehman, 1993; Macq, 1993; Harrer, 1992; Linares, 1993; et.al.) have successfully attempted hardware implementations of learning algorithms. Generally, they approach the design process from a bottom-up perspective and do not present a design methodology that others may follow and adapt to other applications. In addition to a hardware implementation of neural network algorithms to Iruly appreciate the parallel nature of these powerful architectures, todays high demand for shorter turn around times and increasingly complex electronic designs requires a new design methodology. Top-down design with Hardware Description Languages (HDLs), including the use of automatic synthesis, chip layout tools and backannotation at various stages for checking correctness, is the way to design today. This methodology is likely to displace bottom-up, traditional, schematic based design techniques. For many years, logic schematics served as the basis for circuit designs. However, in todays complex systems, such a gate-level, bottom-up design methodology would produce schematics with so many interconnections that the functionality of the system would be lost in a web of wires (Stemheim,1993). Thus, integrated circuit and digital logic designers are now using a different approach to circuit design: a top-down design methodology using HDLs that keeps the architecture and functionality of the complete system at the highest level of abstraction, hiding the details of implementation of lower level modules until the system performs as close as possible to the specifications.

2

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The concept of top-down hardware design correlates to the concept of structured programming in software design. As explained in (Comer, 1983), Niklaus Wirth, the developer of the Pascal programming language, provides the following definition of structured programming: Structured programming is the formulation of programs as hierarchical, nested structures of statements and objects of computation. (Wirth, 1974) This implies the partitioning of a problem into simpler, easier to handle, parts or modules. This refining process continues until each module is a task that can be implemented by program statements, minimizing the interaction between modules resulting in minimum propagation of changes or errors to other parts of the system. All these modules working together accomplish the overall system function defined in the specifications. These concepts can be extended to hardware top-down design. The complete system is characterized as a "box, with inputs, outputs and a set of specifications. Only these and the overall function of the system are known at the beginning of the design process. This may be a neural network architecture for which only its functional behavior is well understood since it may have already been extensively tested in software simulations using a high level language such as C. During several iterations, the designer divides this system into functional modules as independent of each other as possible. Further iterations continue partitioning these modules in submodules, creating a hierarchical architecture until the lowest level modules -contain simple hardware elements found in well defined libraries of components. As each iteration is performed, the refined modules are tested within the complete system for timing and functional verification. In this way, the designer proceeds further into the design process with the assurance that the final product will perform to specifications.

1.2 ObjectiveSeveral researchers have discussed the top-down design approach for evolving the next generation of computers (Comer, 1983; Franca, 1994; Wolf, 1992; Sandige, 1992; Chen 1993). All these papers address traditional computer architecture examples such as CPU design, microprocessor systems, ALUs, ROM,

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

3 RAM, decoders, etc. They assume that their audience has a substantial background or training in computer organization, assembly language programming and digital logic design. However, today researchers from a variety of fields such as physics, psychology, computational neuroscience, mathematics, etc., are proposing novel architectures and computational paradigms such as neurocomputers, generic algorithms, fuzzy controllers, etc. This thesis is targeted to such an audience in order to introduce them to this hierarchical design automation approach. For the researcher who is experimenting with a learning algorithm, the refinement process of module partitioning may stop before a purely structural description (i.e., a description with its modules consisting only of submodule instantiations, library components and state machines) is achieved since, in this case, module partitioning may not be as clear cut as in the case of a well understood digital system architecture such as a microprocessor. The objective in this case is to produce a parallel system architecture that is synthesizable and implementable in silicon, i.e., for which a gate level architecture could be automatically produced by the software synthesis tools and automatically placed and routed by the layout tools. By backannotating the information produced at each level, i.e at the synthesis and layout levels, the design can be verified with parameters that are closer to the actual ones to be encountered when the chip is finally fabricated. In this way, the learning algorithm is embedded in hardware and the parallel nature of the artificial neural networks can be fully studied. The objective of this dissertation is to develop a top-down design methodology for synthesizing neural networks using a standard CAD toolset The hypothesis is demonstrated by using the most versatile neural network, namely single layer perception. Then the methodology is extended to an advanced neural net architecture, Alopex.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

1.3 Contributions of the dissertation It is the first time ever that a top-down design methodology has been proposed for the design of neural network architectures. Top-down designing concepts have been discussed in the context of well known, digital computer architectures such as CPUs and I/O interface devices. These applications do not require such an exhaustive iterative technique of backannotation between different levels throughout the design process as the one required to implement computational algorithms, such as a neural network architecture. After each design step was completed, knowledge gained by performing extensive simulations was used to go back up one or several steps in the design process, modify the algorithm, and continue down. Several iterations at each design step were required since each step advanced the implementation further into the hardware level, uncovering new limitations and secondary effects which had been transparent at a previous higher level of abstraction. Neural network hardware implementations have been successfully attempted by several researchers. They usually approach the design using a bottom-up technique which produce a specific architecture that can not be improved or optimized to address all the issues that come up during the design process and do not provide light for other researchers to continue or optimizet

their designs. This dissertation however, applies a top-down design methodology to the hardware implementation of neural netwoik architectures. With such a methodology, a designer may undertake a different design, learning algorithm or application, using the modules developed in this dissertation and optimizing, improving or easily adapting them to suit other implementations. The approach proposed in this dissertation allows for extensive modifications in the computational algorithm that are necessary for hardware implementability. The methodology and examples shown in this dissertation are general enough for any machine learning algorithm implementable as a neural network architecture. This is shown with an example of a most versatile neural network architecture that can be used as a building block for 5 other architectures since the modules are cascadable. This is then applied and results shown

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

for a novel architecture.

1.4 Organization of the dissertationThis dissertation is organized as follows: Chapter 2 provides an overview of previous work done in the areas of hardware implementation of neural networks and design methodologies. Chapter 3 describes both the learning algorithms (i.e., the deterministic perceptron and the stochastic Alopex) and the modifications made to make them synthesizable, i.e., VLSI implementable. Chapter 4 describes the steps taken to map the perceptron learning algorithm to hardware while presenting results. Chapter 5 extends the methodology to a novel learning architecture implementation, A/gorithm for pattern extraction (Alopex). Chapter 6 concludes this dissertation with some thought* recommending additional work to continue the research. Software listings are included in Appendix A . Appendix B includes an overview of the software tools used, and a step by step tutorial on how to use these tools in an actual university laboratory environment. The bibliography and the vita of the author are also included.

6

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 2 BACKGROUND INFORMATION

2.1 On hardware implementation of neural networksHardware implementation of neural networks has been investigated by several researchers. Several of them implement the Backpropagation (BP) algorithm. However, and according to Grossberg (Grossberg, 1987), this popular learning algorithm is not plausible biologically and is inherently difficult to be implemented in hardware, since there is no evidence that biological synapses can reverse direction to propagate the calculated errors to previous layers nor that neurons can compute derivatives (Hinton, 1989). In BP, the calculation of the weights depend on differentials with respect to other parameters which in turn depend on differentials with respect to previously calculated parameters. That is, the BP algorithm recursively modifies the synaptic weights between neuronal layers. This algorithm first modifies the synapses between the output field and the penultimate field of hidden neurons. The algorithm then uses this information to modify the synapses between the next two levels back of hidden neurons, and so on all the way back to the synapses between the first hidden field and the input field (Kosko, 1992). To realize this in hardware requires an impractical amount of interconnections among neurons to propagate the information forward and backward in order to calculate the error.

Other learning algorithms implemented in hardware that are found in the literature include the Kohonen feature map (He, 1993; Melton, 1992; Mumford, 1992) and Hopfield type of networks (Lehman, 1993) both of which are architectures that have limited applicability to solving a variety of real world problems.

The proposed stochastic learning algorithm to be implemented in hardware was initially investigated by Harth in 1976 (Harth, 1976) in relation with the problem of ascertaining the shapes of visual receptive 7

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

fields. He originally proposed a model of visual perception as a stochastic process where sensory messages received at a given level are modified to maximize responses of central pattern analyzers. Later computer simulations were carried out using a feedback generated by a single scalar response and very simple neuronal circuits in the visual pathway were shown to carry out this algorithm (Harth, 1987). Harth and Pandya showed that the process represented a new method of approaching a classical mathematical problem, that of optimizing a scalar function f(xi, x 2,... x) of n parameters xit i =1,2, ...n, where n is a large number (Harth&Pandya, 1988). Herman eL al. have discussed this algorithm in the context of pattern classification, in particular using piecewise linear classifications (Herman, 1990). Rosenfield refers to it in a survey of picture processing algorithms (Rosenfield, 1987).

The algorithm works as a cross correlation between the synaptic weights and global errors calculated in the previous two iterations, at any given layer. These simple calculations can be done concurrently by all the neuronal processing elements. In addition, the proposed algorithm, due to its stochastic nature, does not have another inherent problem arising from the BP implementation: that it converges to a local error minimum, if it converges at all (Kosko, 1992).

Most of the work done in hardware implementation of neural networks is in the analog field (Macq, 1993; Harrer, 1992; Linares, 1993; Sackinger, 1992; Sanchez, 1993). These researchers believe that an analog implementation best resembles the biological learning experience, the final chip/die size is much smaller than a digital implementation would produce, and is faster. However, the relation of accuracy and chip area is a major design problem since the more precisely a designer wishes to control the matching of the analog components the larger the chip area that is required. Another drawback of an analog implementation is the difficulty to store the weights, which is usually done as charge stored in capacitors. The charges leak and thus require a refreshing mechanism, adding to the overall chip area and affecting accuracy. The chip area is also affected by the need for low power consumption . This requires low currents which in turn requires large resistors in the circuits that implement the synaptic weights. The

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

8 resistors, or switched capacitors, that implement the weights do not allow for high precision, thus limiting the complexity of the pattern that can be reliably processed with an analog net (Ramacher, 1991).

The technology available today allows for a very large component integration. Thus, for a final die size of reasonable dimensions, very powerful computations can be implemented using a digital implementation. Storage of weight information and user defined parameters can be easily accomplished in digital architectures, and speed of operation could be comparable to an analog implementation. In a digital implementation, the application and modeling of the net characteristics can be made independent of circuit design. The word length, and thus, computation accuracy, can be determined for each module before the implementation by software simulation. The use of floating point numbers can provide for high accuracy and eliminates the problem of limit cycles. According to (Ramacher, 1991), If the information must be processed with high precision (not less than 8 bits) and the learning is to be supported on-chip, digital circuitry is the right candidate for implementing a neural net. Conversely, for applications which do not need hardware support for learning and for less severe requirements in computation precision, analog design seems to dominate.

The stochastic nature of one of the proposed learning algorithms to be implemented for this dissertation, Alopex, and the simplicity of its weight adjustment calculations require a rather high level of accuracy in parameter precision. In addition, the learning algorithm is to be implemented on-chip. Thus, a digital implementation was selected.

Other authors have implemented hybrid architectures by designing parts of the neural network architecture using analog techniques and other parts using digital techniques, in an attempt to find the ideal architecture that will incorporate the best features of both worlds: ease of information storage, ease of interfacing, and high precision of a digital implementation with the speed and compactness of an analog implementation (De Yong, 1992; Sackinger, 1992).

9

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

2.2 On top-down design methodologyIndustry has developed top down design methodologies for their product development process flow with the goal of achieving a substantial reduction in design cycle time. These corporations publish internally design methodology guides, firom specifications through the end of the entire process for creating an integrated circuit that their design engineers must follow. Usually, these documents are proprietary to the companies that develop them and are not available to outside researchers. Several publications in the field of engineering education have addressed this topic as well. However, their treatment of the top-down design methodology for circuit design is limited to the teaching of courses in a university environment which generally uses non commercial software tools developed by other universities, such as the Magic VLSI CAD package developed at the University of California at Berkeley (Williams, 1991; Wolfe, 1992; Berkes, 1991), or at other universities (Reid, 1991) or tools that are not readily available or used in the industry today (Aylor, 1986; Soma, 1988; Rucinsky, 1988; Sait, 1992). The application examples are also .limited to a few well known digital architectures, such as microprocessors, and do not give a general, step by step, methodology that others can follow and thus produce the desired product. Other authors describe a simplistic top-down design methodology for logic equation implementation with a PLD, considering the top most level as the writing of a Boolean equation, the next level is a list of signals, followed by the next step to obtain the design documentation, then the functional logic diagrams, the realizable logic diagrams and finally, the bottom level, to obtain the detailed logic diagrams (Sandige, 1992). Gander (1994) presents a top-down design methodology for designing a multi-chip hardware system, with the software tools including schematic capture, circuit simulation and PC board layout capabilities This dissertation shall define a more exhaustive top-down design technique than the ones that can be found in the literature, without being as specific as a proprietary documentation would be. This will be accomplished by using todays popular industry standards, such as Verilog XL, Synergy and Epoch software tools, with the goal of producing a single VLSI chip that implements the desired functions.

10

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 3 The learning algorithms

3.1 Deterministic learning algorithmThe perception learning algorithm will be used in the next chapter to illustrate the design methodology approach to mapping a learning algorithm to hardware. By following the design steps with a simple example, the methodology thus studied will be applied to the design of a network that implements a more complex learning algorithm. The simple, but elegant perceptron learning algorithm is used to design a training module in such a fashion that each submodule adjusts one weight only. This allows for a completely cascadable design that can be used as a building block to form larger networks such as multilayer perceptions. Frank Rosenblatt developed a large class of artificial neural networks called perceptrons in 1958 (Rosenblatt, 1962). The typical perceptron consisted of a layer of input neurons (the retina) connected by paths with weights to a second layer called associator neurons (see Figure 3.1). The weights on the connection paths were adjusted by following a learning rule called perceptron training rule, which uses an iterative weight adjustment that is very powerful. For each training input, the net calculates the response of the output unit (the calculated output) by performing the sum of the products of the weights times the inputs. Then the net would determine whether an error occurred for the individual pattern (by comparing the calculated output with the target or desired value). In this design, for example, if the desired (target) output is 1 and the calculated output is negative or zero, then the weights are adjusted by adding the input value, i.e., increasing the weights as given by wj (new) = w;(old) + Xj (3.1)

11

where xi is the corresponding input for the given pattern. On the other hand, if the desired output is 1 and the calculated output is positive, the weights are not changed, i.e., they are correct. Similarly, if the desired output is -1 and the calculated output is positive, the weights are adjusted by subtracting the input value as shown below: Wj (new) = Wj(old) - Xj (3.2)

On the other hand, if the desired output is -1 and the calculated output is negative, then the weights are correct and are not adjusted. Figure 3.2 shows the hierarchical module partitioning of the neural architecture, automatically generated by the software tools when the Verilog files were imported from the Unix to the Design Framework II environment.

Response cells

Figure 3.1: The Perception (adapted 1990) from Khanna,

12

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

neLtop

train. 1

ouLunM

ouLunitl

. tram

i

ouLunit

ouLunit

Figure 3.2: Hierarchipal module partitioning

13

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The goal of the trained neural net is to classify each input pattern as belonging, or not, to a particular class. Belonging is signified by the output unit giving a positive valued response; not belonging is indicated by a negative valued response. The net is trained to perform this classification by the iterative technique described below (For a detailed discussion see Fausett,1994 or Pandya and Macy, 1995): Step 0: Initialize weights (in this design, weights are initialized to 0) Step 1: While weights change, and iterations do not exceed a maximum allowed, do Steps 2-6. Step 2: For each training pattern, do Steps 3-5 Step 3: read in the pattern into the input units, xi( and the desired output, tj Step 4: Compute response of output unit: y_in = X xiWi And calculate the activation function: y = sigmoid(yjn) Step 5: Update weights if an error occurred for this pattern: If t = -1 then if y > 0 thenwi(new) = Wj(old) *

xit elsewi(new)=wi(0ld),

Else if t = 1 then if y and -7.9375io (81ii). For the 8-bit inputs, the format selected was the following: sign . f f f f f f f with the following weights 2'1 2'2 2'3 2* 2's 2* 2'7

This allows the inputs and outputs of the neurons to take on values between + 0.9921875io (7f )6) and 0.992187510 (8116). If 16-bit representations are used, for example, then the inputs and outputs would have values between 0.9999695 for the format shown below: sign . f f f f f f f f f f f f f f f

. 2'1....................................................................................................................................... 2'iS

It is easy to see that a simple truncation of the lower 8 bits of a 16-bit number will produce an 8-bit number that is very close in value to the original 16-bit number since the most significant bits are preserved. This is not the case if the bit positions are for integers instead of fractional data. For example, a 16-bit result of multiplying two 8-bit numbers may result in 7 f f f t6 = 0.9999695iO using the above

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

34 format. Truncating this 16-bit to an 8-bit number so it can be output to the next neuronal layer to be used as an 8-bit input, results in 7 f i6 = 0.9921875io using the above format, (a 0.7% error). If integer representation is used for the neuronal inputs and outputs, then 7 f f f16

= 32767]0. Merely truncating

the lower (or upper) 8 bits will result in an unacceptable large error since 7 fi 6 = 12710. Obviously, a different mechanism, more costly in timing and silicon area, would be needed to maintain calculation accuracy. When numbers are multiplied and added, such as when the weights are multiplied by the inputs and their results are accumulated in a given register, this register has additional extension bits to allow for word growth. For ample, when multiplying an 8-bit weight times an 8-bit input, a 16-bit register is needed to store the result If several of these multiplications are to be accumulated in a register, several extension bits may be needed to avoid overflow. The accumulator registers chosen for the Alopex implementation are 19-bit registers, i.e., they allow for 3 bits of extension since 16 bits are needed to store the result of multiplying two 8-bit numbers. This is important when rounding and truncation is required, such as for example, when a 19-bit number, the output of the neuron, has to be reduced to 8 bits to be used as the input of the next level of neurons since the data path is 8-bits wide. If any of these extension bits is on, indicating the summation exceeds the maximum number representation, then limiting arithmetic is used. That is, the value of the register is clipped at the maximum or minimum allowed (the saturation value of the neurons), consistent with the size of the register to be stored and the sign of the accumulator register (the most significant bit of the extension register). Please, refer to chapter S about specific implementation issues and examples.

A signed multiplication circuit was also implemented to handle signed operands and the scaling needed when handling fractional data. An example of two numbers being multiplied together is given in Figure 3.7 below. The difference between signed integer and signed fractional data multiplication is that in hardware multipliers that handle integer data the extra sign bit is used as a duplicate sign bit (there is an extra sign bit because two sign bits exist before the multiplication and only one is needed in the result). Hardware multipliers use this extra sign bit as a sign extension bit. However, in fractional

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

multiplication, 35 this extra sign bit is appended after the least significant bit as a zero bit. In the hardware implementation of the learning algorithms, this is accomplished by a shift left of the result of the multiplication with a zero fill in the least significant bit position. Thus, the same hardware multiplier is used than for integer multiplication, but the final result is shifted left to produce a correct fractional value.

1. 11-

1 I-

ll-

SigflM Mulbglicr

| SiQntd Multiplier

Hi

.

js.Most Significant Product Lust Significant Produet |Q |

I Sign Extension

j

Figure 3.7: Integer/fractional multiplication comparison

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

36

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 4 TOP-DOWN DESIGN METHODOLOGYCurrently, a top-down design methodology is preferred over a bottom-up design methodology. The latter resembles the steps involved in breadboarding the hardware system. It gives a poor functional view of the entire system, is time consuming and does not facilitate the engineering team to work concurrently because individual pieces of the design have to be completed first (Thomas, 1991). On the other hand, top- down design begins with an HDL functional model of the top level system. Software design, simulation, synthesis, analysis and layout tools allow the designer to functionally describe, simulate and test a complete design architecture at the highest level of abstraction, i.e., specifying only a set of inputs, outputs, and its functionality. Similar to the concept of structured programming in software, the hardware system is partitioned in modules, as independent of each other as possible, after the overall system behavior is satisfactory. Each module is then optimized with respect to speed, behavior and area, tested and plugged back into the overall system architecture for behavioral verification. As each module is modified it can be incorporated back into the system architecture for verification at the system level. This design methodology allows for system performance verification early in the design process, saving many hours of woik in achieving the desired performance. It also allows for the engineering team to work concurrently since, once agreed on the desired system specifications and after a preliminary module partitioning is made, each team member can take on the design of one of the modules. In this way, the complete design is developed concunendy.

37 Verilog is a Hardware Description Language for both behavioral and structural modeling that is becoming a standard among commercial users. Descriptions using Verilog, in some cases, result in code that is much more compact than VHDL, the other prominent hardware description language currently in use (Stemheim, 1993).

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Verilog can be used to model a digital hardware system at many levels of abstraction, ranging from the algorithmic level (similar to a high level programming language implementation such as C or Pascal) to the gate level or to the switch (transistor) level. This model specifies the external view of the device and one or more internal views. The internal view of the device specifies its functionality or structure, while the external view specifies the interface, or connectivity, of the device through which it communicates with the other modules within the system (Thomas, 1991). The HDL can both describe the functionality, connectivity and timing of a hardware circuit. Using HDLs makes the design of complex systems more manageable since the designer only needs to change the HDL description instead of reconfiguring and re-wiring the hardware prototype if changes or upgrades are called for. The designer can try different design options quickly and easily so the time and cost to change a design problem are significantly reduced.

As in a software environment, in which there is a section of code that contains the higher level control statements, one of these partitions will be the control unit module. The control module is not decomposed further until all other modules are successively decomposed into simpler, more independent structures, and no further refinement is possible, i.e., until all submodules are ideally implemented with other module instantiations and basic library components (Coiner, 1983). This may not be possible in all cases, especially if the system under study is a learning algorithm for which module partitioning may remain at a higher functional level of abstraction due to the highly abstract nature of the algorithm itself or due to the inexperience of the researcher not being very familiar with computer architectures. Once the control and timing signals are defined, and the system performs satisfactorily, the control unit is optimized by implementing it as, for example, a state machine.

38It is at this stage of the designing process that the concepts of top-down structured programming and top- down hardware design become different In software, the simulations are generally run using a single CPU, thus executing in sequence the statements included within each module, and each module is put on wait while another is executed. In a hardware simulation environment, such as with the Verilog HDL simulator we used, all the modules appear to operate in parallel, controlled by the control unit which generates the necessary timing and control signals for each module. This simulator is event driven, i.e., only those elements that might cause a change in the circuit state (about 2 to 10 percent of circuit components at any given time) are evaluated and simulated, as opposed to other

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

simulators that are time driven and evaluate each element at each point in time, producing a new circuit state at each point in time. Event driven simulation has the concept of time wheel embedded, i.e., time advances only when every event scheduled at that time is executed (simulating a concurrent execution environment) and the tme wheel can only advance forward (simulating the real life passage of time). This allows for a more accurate simulation of a hardware implementation.

This approach is in contrast to the traditional bottom-up design methodology in which each system module is designed, implemented as a prototype, tested, modified, etc. After it performs satisfactorily, it is included as a building block upon which a larger module is built. For example, a one bit full adder is made up of half adder units. Several one-bit full adders are put together to build a larger adder, and so on. Not until the complete ALU is finished, the designer could verify its overall performance. With a top- down design methodology, a block diagram of an ALU is described in a high level HDL, simulated, tested, and then partitioned in modules, such as an adder, a multiplier, a shifter, etc. Each individual module is refined and optimized, plugging it back into the system until the desired behavior and performance is achieved. Several iterations may be needed at each design step, with the designer going back and forth between the product specifications and the design. Once a satisfactory performance is 39

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

achieved, the design is automatically synthesized. The gate level schematic thus produced is verified, comparing its behavior to the functional description. Finally, the design is automatically placed and routed by the tools. After the design is backannotated and verified, it is sent for fabrication. The fabricated chip is then tested. The availability of these sophisticated software tools has made possible the achieving of a complete design process in less time, and the cost of building and testing prototypes, verifying and rebuilding them, etc., has been all but eliminated, resulting in a faster, cheaper design process, shortening the design turn around time and allowing for increasingly complex architectures to be automatically implemented in a VLSI circuit.

Figure 4.1 shows the flow chart of a complete top-down design process, from the design concept to the chip fabrication stage.

Figure 4.1:Top-down design steps

40

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The following is a suggested step-by-step method for mapping a learning algorithm to hardware and are exemplified by the implementation of a single layer perception. The architecture is rather simple but it serves the purpose of illustrating the design steps. These steps are just a guide for the beginner or the inexperienced designer to easily design a hardware implementation of a general purpose, fully connected neural network architecture with the goal of investigating the performance and behavior of a learning algorithm. This work does not attempt to replace the expertise of an experienced digital circuit designer. However, it merely attempts to guide the novice designer through the steps in the design process with the hope that the truly parallel nature of a neural architecture with on-chip learning capabilities could be studied. First step: C language implementation.: This step is optional but highly recommended since through a high level language implementation, the designer can understand every detail of the algorithm, and initiates the process of identifying the block of the program that contains the control statements. This portion will become the control unit of the hardware implementation in the fourth step.

Second sten: HDL description: The complete HDL behavioral description is achieved by following programming logic developed using C language and adapting it to satisfy constraints imposed by Verilog or any other HDL that is chosen. Since the final objective is to produce a gate level description, this step may require the designer to limit the choices of language constructs and data types to those that the synthesis tools understand. This step may require some minor, or major, adjustments to the original algorithm, especially for those that use floating point numbers and hyperbolic transfer functions. For these cases, chapter 3 described some modifications made to a learning algorithm to make it synthesizable and VLSI implementable. In cases where this is not possible, the algorithm may be tested at this level of abstraction, with nonsynthesizable blocks simulated

41

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.at the behavioral level and blocks that may be synthesized simulated at the gate level. Mixed mode simulation is a powerful analysis tool available in cases such as this. Still, the event driven simulator is capable of simulating concurrent processing of parallel blocks of code allowing the researcher to experiment with the learning algorithm even though a final chip layout could not be automatically produced by the synthesis and layout tools available today. The complete software listing developed in Verilog for the perception algorithm is in Appendix A. Please

refer to portions of it during the following discussion. Figure 4.2 shows a sample run of the behavioral description, with the steps taken during a typical iteration. A set of two patterns is tested giving the conect classification. The format of the data used for the perception implementation is the following: weights = w[3:0] = s i f f sign 2 . 2'1 2'2 inputs = x [3:0] = s.fff sign. 2* 2'2 2'3 Thus, the values for the hexadecimal numbers shown in the sample run of Figure 4.2 are the following: input, xF (hex)

2 = (fract) + 0.25

input, xj= (hex) f = (tract) -0.125 input, X| = (hex) e= (fract) -0.25 weight, W| = (hex) 2 = (fract) +0.5

input, xp (hex) 1 = (fract) + 0.125 weight, Wi = (hex) f = (fract) - 0.25 weight, wi = (hex) e = (fract)-0.5

The accumulator register where the sum of products ( X| * Wj) is to be stored, is a 12-bit register. This allows for accumulation without overflowing. It has the following format:

s s s i i . f

f

f

f f O

4 2

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

count-3 wl_old 2 2jald* coune-0 nuarac- 0 traindat_tx_>2 c ra ind*t_ntx_2 -1 out (daalrad)-l x^tnaida.axilt^in^t raining* 000000001000 *i-_tnji(k_Bult_in_t raining* 111111111110 x_l* 2 xj2 1 1*2 2* f out** 1 wl*xltw2>x2 1 area ach 26n. 7B 265. 76 277.29 3334S: 249-. 65 S1H46 323.45 7S8. ie 1SX.3S. 214 .*11' S47.5& 84&_92. area total 26o.76 266.76 277.25 333.45 49?.32 512.4b* 100C.2S ISIa. 22 20S5.2S 256?. 32 3332.92 S093.ES 18221. S5

Ziaa unit. Ins.

Figure 4.7 (continued): gate level schematic and reports generated by Synergy synthesis tools.

56

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Ninth Step: Verilop XL simulation of the synthesized schematic: Using the same testfixtuies used for the functional simulation done before the synthesis step, the gate level schematic is simulated and the waveforms compared. Simulating the gate level schematic is important because the actual library cells are used. Mixed-level simulation is also possible allowing the designer to simulate modules for which a gate level implementation exists, either as a product of the synthesis tools or manually entered directly at the gate level, mixed with blocks of Verilog functional descriptions. This permits the simulation of isolated modules at the gate level within the context of the complete system, assuring a correct functionality at the system level.. After the design has been functionally verified at the gate level, and a standard delay file is generated by the software tools timing verification is done by backannotating the delay information into the timing analysis. The critical delay path will then contain information about estimated interconnect delays, giving a more accurate critical path analysis. Figure 4.8 shows the timing waveforms produced by the gate level simulation.3coora30XDc*ocd:CDCDGDCDCIXDCDGDCDGDCDG

:

lOEXIZ^DCIXIX^DCDG X: I I I I I I I I I I InnrinjTnnnnrinr ZX3(ZXZXZ)Q(iXEXE}SS 1CDGXDGDCDCDCDCEDCD CDCDC:EDCDCDCZDCDCDCZIXIXDC

IIIIIIIIIII nnnnnrrrinnnr

Figure 4.8: Verilog XL timing waveforms of the gate level simulation

57

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Tenth Step: in lepra ted circuit layout: Using Epoch(1}f the design is physically implemented. Please, refer to Appendix B for tutorials describing the step-by-step process for using the automatic placement and routing tools. The Epoch compilation includes automatic placement, routing, buffer sizing and power estimation. Figure 4.9 shows the final training module layout automatically produced by this tool. The synthesized "training module of Figure 4.3, including the manually designed multiplier, was placed and routed (i.e., the out_unit, train and add modules were not placed and routed individually at this stage). Post layout verification is run at this stage so the behavior of the circuit can be simulated including the effect of routing parasidcs.

Csdsnc* Oasiqn Systems. Inc.

1 A product of Cascade Design

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Ta unit: Ins.

Precision: lops

OOULE: training

HMuma Clock fraqutncy - 37.333 ms Longest Path Daisy - 26.63 n 249.66

259.31 4.31 321.a t327.31 333.43 359.10

249.66

259.31304.39 321.48 327.38 333.45 359.10 429.22 431.73 431.73 494.38 354.38 626.04Ui.it

428.22429.22 431.73 431.?] 414.38 277.29 313.02 213.46 303.41 333.43 333.43 301.93 304.38 137.93 321.49 326.43 438.75 74$,20 270.27 . 249.66 333.43 308.79 157.93 214.11 333.43 217.62 214.11 137.95 277.29 333.45 512.46 635.65 738.16 848.92

429.22

805.41 1000.35 1000.331003.86

1217.52 1265.60 1215.92 1305.72 1316.25 1490.40 1891.89 1997.28 2)3.15 2470.32 2685.15 2997.54 3667.93 4787.64 7707.96 8845.20 10814.31 22007.70 22548.24 24261.12 24619.82

23603.4 0185114.29

Figure 4.9: Perceptron (training module) final chip layout

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 5 TOP-DOWN DESIGN with ALOPEXIn this chapter, the top-down methodology described will be applied to a more advanced learning algorithm capable of addressing real world applications. The algorithm was described in chapter 3 with suggested modifications for hardware implementation. The stepwise methodology developed in chapter 4 which resulted from the successful development effort for the deterministic perceptron learning algorithm will be followed and applied to the Alopex stochastic learning algorithm.

5.1. Choice of data formatAs explained in chapter 3, a fractional data format was selected for implementing the neuronal inputs andoutputs, and a mixed representation for the weights, i.e., the weights may take on values larger than 1 so they have to include an integer portion, as shown below: 1) for the 8-bit weights, the format used is s i i i . f f f f . The exponents of two are: (sign) 22 212. 2'12'2 2'3 This data representation allows for a range of weights from - 7.987Si0 (81i6) to + 7.9875io (7fi6). 2) for the 8-bit neuronal inputs and outputs, the format used i s s . f f f f f f f , since they are between -1 and +1, i.e., they are squashed by the sigmoid between these values. Each bit represents the following powers of two: (sign) . 2'1 2'2 2'3 2* 2'5 2* 2'7, giving a range of values between 0. 9921875io (81M) and +0.9921875io (7fi6). Negative numbers are represented using 2s complement

format

59 The above described data formats for the weights and the inputs naturally cause the following data representation formats for the accumulator registers storing the results of their multiplication:

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

3) The implementation of the multiplication of the weights and the inputs performs signed multiplication of two 8-bit numbers. However, these numbers are of different format. Thus, the resulting product will have the format as shown below: wi (weights) = s i i i. f f f f Xj (inputs) = s . f f f f f f f

Wj * X| = siii.fffffffffffO

Since the data has a mixed format, i.e., it has integer and fractional bits, the result of the multiplication is corrected by shifting it left (with a zero fill in the least significant bit). As an example, a calculation done by one of the neurons (sample run is included in section S.3) follows:

input# 1 = 60 = 0 . 1 1 0 0 0 0 0 = s 2'1 + 2'2 = 0.75

iweight# 1 = ffI6 = 1 1 1 1 . 1 1 1 1 = ( - ) 0 0 0 . 0 0 0 l(2scomplement) = -2* = -0.0625

The result of the multiplication is a 16 bit number with the following format:

result = f f 4 0 , 6 = 1 1 1 1 . 1 1 1 1 0 1 0 0 0 0 0 0 = s i i i . f f f f f f f f f f f f =

(-) 0 0 0 . 0 0 0 0 1 1 0 0 0 0 0 0 (2s complement) = (-) 2 s + 2*=- 0.046875

60 which is the correct result of multiplying 0.75 * (-) 0.0625 = 0.046875 .The HDL description to implement this multiplication is as shown in Figure 5.1:

function[15:0] mult; input[7:0] x;

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

//

input[7:0] w; reg sign, signw, signx; reg[7:0] W; reg[7:0] X; reg[15:0] temp; begin temp[15:0]=16'h0; W[7:0] = w[7:0]; signw=W[7]; X[7:0] = x[7:0]; signx=X[7]; sign=signwAsignx; if(signw==l'bl) begin $display("W=%h ", W) ; W[7:0]=~W[7:0J +1'bl; $display("W=%h signw=%b \n",W,signw); end // if(signx==l'bl) begin $display(X= %h ,X); X[7:0]=X[7:0]+1'bl; $display("X= %h signx=%b\n",X,signx); end

//

// //

temp[15:0]=(W[7:0]*X[7:0]) 1; $display("temp[15:0]=%h\n",temp); $display("sign=%b\n",sign); if(sign==l) temp[15:0]=~temp[15:0]+1'bl;

// $display(temp[15:0]=%h \n,temp); mult[15:0]= temp[15:0]; end endfunction

Figure 5.1: Verilog code for signed fractional multiplier

61

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The shift left statement after the multiplication is performed effectively adjusts the integer result for consistency with fractional data representation.

4) When the complete summation is performed by the neuron, the accumulator register contains 4 extension bits to allow for word growth, since adding several product terms may result in an overflow condition. This is shown in Figure 5.27654.3210siii.ffff

bid 7 6 5 4 . 3 2 1 0 W| = i i i . f f f f

bit#s.fffffff

8-bits W2 = mi= s . f f f

f f f f 8-bits in] =

bit# IS 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Neti = s i i i . f f f f f f f f f f f f 16-bits

7 6 5 4.3 2 1 0 s i i i . f f f f

bit# 7 6 5 4 . 3 2 1 0

bit#

s . f f fW| = s i i i . f f f f 8-bits w2 = ini = s . f f f

f f f ff f f f 8-bits in] =

bit# 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Net2 = s ii i . f f f f f f f f f f f f7654 bit# siii 3 2 10 bit# 8-bits w2 = 8bits in] = 7 6 5 4.3 2 1 0 s i i i . f f f f s .fffffff

16-bits

f f f f s . f f f bit# 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 f f f f l W =

Neto = s i i i . f f f f f f f f f f f f

16-bits

D l l =

6 2

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

bit# 19181716 IS 14 13 12 11 10 9 8 7 6 S 4 3 2 1 0 Neti

16-biu 16-bitst i i i . f f f f f f f f f f f f Netj t i i i . f f f f f f f f f f f f

i k k _ _ _ _ Net t i i i i i

i i

i

i . f f f f f f f f f f f f i . f f f f f f f f f f f f

16-biu 20-bits

Figure 5.2: data representation for Alopex implementation The four additional extension bits effectively move the sign bit position to bit # 19, the MSB (i.e., before the partial product terms (Neti) are summed up they are sign extended to 20 bits) and allow for the sum to grow up to values of around 128. The additional bits allow for accumulation of partial sums in the case that several product terms are being summed up. The final result, after taking f(net), is then rounded and truncated back to 8 bits as required by the data path. Excessive errors would be introduced in the calculations if the product terms were truncated before the final summation.

In addition, whenever a result cannot be stored in the destination register, e.g., an output that becomes larger than +0.9921875 (or hex 7f), then it is saturated to the maximum value a given register can hold. For example,

ini =20,6 = 0 . 0 1 0 0 0 0 0 2 = (sign). 2'2 = (+)0.25,o weight 1 = Ob = 0 0 0 0 . 1 0 1 1 2 = (sign) 21 + 23 + 2* = (+) 0.6875io in 2 = 70,6 = 0.1 1 1 0 0 0 0 2 = (sign) Tl + 2'2 + 23 = (+) 0.875,0 weight 2 = 10,6 =

0 0 0 1 . 0 0 0 0 2 = (sign) 2 = (+) 1,0

63

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Neti = (0.25) * (0.6875) + (0.875) *(1) = 1.046875, 0 net = (ini = 20,6 = 0.25,0) * (weight 1 = 0b,6 = 0.6875,0) + (in2 = 70,6 = 0.875) * (weight2 = 10,6 = 1.0,o) = net = 0.171875,0 + 0.875,0 = 1.046875,0 , which is greater than 0.9921875, the laigest number that can be represented in this 8-bit fractional data format. The HDL code must check for these overflow conditions and make the necessary corrections. This will be illustrated in sections 5.2 and 5.3.

The current version of the neuron implements the transfer function, f(net), shown in Figure 5.3 to the summed up outputs:

Figure 5.3: f(net) for present implementation

64

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

5.2. First Step: C language implementationThe complete software listing is included in Appendix A. The program was run for several problems, including the exclusive-or, several linearly separable pattern recognition problems, and encoder problems. The network converged for several error measures, with results consistent with those found in (Pandya, 1994). Chapter 3 contains the pseudo code for one implementation of the architecture that has a single layer of hidden neurons. The program in Appendix A implements a two-hidden layer architecture. For the hardware implementation, a two-input, one hidden layer of two neurons each, and a single output was designed. This architecture is sufficient to solve the exclusive-or and linearly separable problems. The main concern was to scale down the architecture to solve several issues related specifically to a hardware implementation and still obtain good functional performance. The modules designed may be cascaded to build larger networks.

5.3 Second Step: HDL functional descriptionThe complete system was written in Verilog HDL to study its behavior under a different environment. All the high level constructs of the HDL were used without regards for obtaining a code that was synthesizable. The objective was to have an intermediate step between the high level C language implementation and the final HDL description that would be synthesized. This system description using the complete HDL set included the use of real numbers, random number generator system calls, reading input patterns from a Unix file, etc., all of which are not synthesizable by the software tools available.

65

5.4. Third Step: preliminary module partitioningTo implement this step, a subset of Verilog is used that will be recognized by the synthesis tools. Thus, when partitioning the system in modules, each one is re-written using a modeling style that will later be

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

used to produce a gate level schematic. No system calls such as {random can be used, so a random number generator was designed that simulates the NM810 (which is made up of 8 RBG1210s, random bit generators) as explained in chapter 3. This will not be synthesized but, since Verilog XL environment allows for mixed level simulation, the non-synthesizable blocks can still be present throughout the design steps to assure a final design that is functionally correct. The current implementation will have an off- chip random number generator such as the NM810. Neurons will read the necessary random numbers from its outputs when needed. The following modules were designed: 1) a top-level structural module, netjop, which instantiates the lower level modules and describes how they are interconnected. This top level module will be the one from which the simulation, synthesis and layout tools are called. As each module is optimized it will be tested within this system module so the functionality at the system level can be verified. Mixed level simulation allows for some modules to be simulated using their functional Verilog descriptions, mixed with modules for which a gate level description exists and needs to be tested, either because it was automatically generated by the synthesis tools or because it was manually entered gate by gate. The structural description is shown in Figure 5.4 :

66

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

module neural_net_top(in);

//March 14, 1996 7 March 29; April 1 //Laura Ruiz errorll.v; linearly separable // with overflow check and negative weights and inputs// weight3-siii.ffff input3-3.fffffff

// without initializing weights; // multiplier designed by hand//

input in;wise elk, enable_train, reset, delta_E_ready

ice uk euouie team coeu, ue-i.ua & tcauy, noise_serial 1 noise serial 2, noise__serial_3 ~noi3e_serial_4, noi3eserial_5, noIse_serial_6, no is e_s e ria 1_7, noi3e_serial_8, not_noise__serial__5, not_noise_serial_7, donewl, done_w2, done_w3, done_w4, done_w97 doneJvlO* en_l, en_2, en_5, ready_nl, ready_n2, ready_n5, enable_l7 enable_2, enable_5, fir.ished_train, enable_test, ready_nl_and_n2, finished_train_and_enable_test;wire[7:0] wire[7:0} wire[15:0] . power_on(reset);w

,

_new_l, w_new_2 w_new_3, w_new_4, wnew_9# wjnew~10; inl, in_2,~outnl, out_n2, out_n5; deltajE? // power on reset

,

power_on_reset // system clock m clock_generator 0), stores it in a register, sign_deltaE for later processing, takes the absolute value of the 16-bit input delta_E- It then calculates the feedback, x = A E * Aw. Register x is 21 bits since it is the result of multiplying the contents of register deltaE (AE), which is 16 bits, and of 5 bits of register delta_W (Aw). The data formats are as follows: delta_E[15:0] = s . f f f f f f f f f f f f f f f delta_W[ 4:0] = O . f f f f The sign bit of x is calculated (sign_x = sign_deltaE ex-or sign delta_W), and the twos complement is taken if the sign bit is 1 (indicating a negative number), thus the format of x is: x [20:0] = s. f f f f f f f f f f f f f f f f f f f f

7 2 Both delta_E[15:0] and delta_W[4:0] are fractional numbers, positive when multiplied, their original signs having been stored in registres sign_deltaE and sign_delta_W. This guarantees that x will also be all fractional. The assumption that delta_W[7:0] contains no significant digits in its integer portion is valid since weights change by very small steps, thus, their difference, delta_W = W_new - W_old, will never have any ones in its integer bit positions. Recall that the format for W_new = W_old = weight format = s i i i. f f f f; thus the format for delta_W[7:0] is also s i i i . f f f f ; but only bits [4:0] are used to calculate x.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

After calculating x, the weights are adjusted following the rule shown below: W_new = W_old - x + noise

The code then checks for saturation if the new value for weight (W_new) exceeds a value that can be represented in the format s i i i. f f f f (hexadecimal 7f and 81). Then, a new delta_W is calculated for next time, the new value for the weight is stored as W_oId, and finally, the weight unit waits for the signal delta_E_ready from the control unit to start a new weight adjustment iteration. Figure 5.6 shows the weight adjustment mechanism. All weight units operate in parallel, updating their values concurrently with information available in their local memories (w_oId) and a single measure of global performance of the network (delta_E) broadcast by the control unit to all weight unit The connectivity of the Alopex network is simpler than the one that implements the perceptron algorithm since in the latter, the control unit needs information on weight values to determine when training finishes. Results of the behavioral simulation of Alopex show that a weight unit takes 2.1 microseconds to update its value.

73

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Ba3

elimarker4

narker5

iaarker6narker7 Cursor M /0 I 1150 2700 9300 .10850 17100 |

mg mm

1

. Koise(15: 0

|0

DA6

A_

1

3W

U_neu[7.0

) F5 _1 U_old[7.0)oF5

| .00 , )(F4 F4 ' . | . F4 XF5 | F5

F5XF3 II

mm mm1.W_temp{9 0| a3F5 | . Iu 1F4 . :F4 r; F4-)[FS F5 ' FS )(f3.

m '"; 3F4 V3FS 3F5 X 3F3

1. deltaE 115 Oj 0 4643* ooiu:. . : 1

1

.

1

1

t1

1

1 4643 (2740 -

0 0 10 i Miiiii

^blta-_EU5.0j-.^74a ------------------------------------------

2T40 '

wm mm. deltaJJ[7 0 | 0

0

1

unit_i i[4 0) 6

00 .eightyid|7.0

I

I00'Woo; - 1 ooihiqq; cMi

w m ef . hef . m\

l II i '

nat_l.x{20:0| 1 . _unit_l/resetvStl '1 - 1 - '

Figure 5.6: timing waveforms of a weight unit updating its value

74As will be explained in the section related to the control unit module, the foimat for the error calculated by the control unit at the end of the

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

nth. iteration is as follows: error (n) [19:0]= O i i i i . f f f f f f f f f f f f f f f delta _error (n) = error (n) - error (n-1) = s i i i i . f f f f f f f f f f f f f f f, which will be clipped using saturation

arithmetic between the maximum/minimum values that can be represented in all fractional 16-bit number. (i.e., the error will not vary greatly between iterations so the difference, delta_error, will be small, usually less than one). This measure of global performance will be broadcast to all the weight units at the end of each iteration. The following is a portion of a sample run showing the weight adjustment mechanism of the weight units. The following fractional data values are shown in the sample run:

enor_current[ 19:0] = 0b891I6 = 0000 1.011 1000 1001 00012= 1.4419251,0

error_nl[19:0] = 0af2a,6 = 00001.010 11110010 1010,= 1.3684692,

delta_error [19:0] = 00967, = 0000 0.000 1001 0110 01112 = 0.0734558 The truncated/saturated value broadcast to the weight units is delta_E[15:0] = 0967,= 0.000 1001 0110 011 h= 0.0734558 The error has increased, thus the sign_delta_E is 0 (for a positive number). Each weight unit has its delta_weight value and sign_delta_W stored in their local memories. The feedback, x, is calculated by each weight unit as seen in the sample run. Asanexample, deltaE*deltaW = 0967,**01,6 = 0.0734558 * 0.0625= 0.004591 x[20:0] = 0012ce,6 = 0.0000 0001 0010 11001110* = 0.004591

75 for the weight that in the previous iteration changed by +01 (recall, the format for the weights is s i i i . f f f f , which for 016 = OOOO.OOO2 = 0.0625io ). The printout shows a value for X that has been shifted 4 times to the left (i.e., multiplied by 16) to move the ls to most significant positions since X is a 21-bit fractional number that win be added to an 8 bit number (the weight) which has a sign bit, 3 integer bits and 4 fractional bits. Not shifting the resultant value ofX

will make no impact whatsoever toX

the weight adjustment since when truncating it before adding will only add 0s to the weight, (recall that

is a very small value that is the

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

resultant of multiplying two small fractional numbers, delta_enor and delta_weight) The calculation of the new weight for this particular example is:

W_new[7:0] = w_old[7:0] - x[20:16] + Noise;

both W_new, W_old and x are sign extended before the previous operation is performed. Saturation arithmetic is applied if the result is greater than 7.9375, the largest numbers that the weights can take on given the data format used of s i i i. f f f f The following table shows actual weight and error values during 8 iterations:

76 Table S.l: actual weight adjustments for 8 iterationsweight [I] iter. #1 fO 0a fa Of 0a 06 70ffe iter. #2 fl (+) 0 (=) . fc(+) Of (=) Ob(+) 06(=) Qafia (-) iter. #3 13(+) 0b(+) ff(+) Of C=) 0c(+) 07 0b891(+) iter.# 4 fl (-) 0b(=) fc(-) Of (=) 0c(=) 06 (-) 0bl87(-) iter.# 5 fl (=) 0b(=) fa (-) 10(+) 0c(=) 06(=) 0aa8b(-) iter.# 6 I2(+) 0b(=) fa(=) 12 (+) Od (+) 07 (+) iter.# 7 y

/.KM.1C

O

/l.N.K?'(>

11

/_n_2