ASIC implementation of LSTM neural network...

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

ASIC implementation of LSTM neural network algorithm

MICHAIL PASCHOU

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

ASIC implementation of LSTM neural network

algorithm

Michail Paschou

July 31, 2018

Abstract

LSTM neural networks have been used for speech recognition, image recognitionand other artificial intelligence applications for many years. Most applicationsperform the LSTM algorithm and the required calculations on cloud computers.Off-line solutions include the use of FPGAs and GPUs but the most promisingsolutions include ASIC accelerators designed for this purpose only. This reportpresents an ASIC design capable of performing the multiple iterations of theLSTM algorithm on a unidirectional and without peepholes neural network ar-chitecture. The proposed design provides arithmetic level parallelism optionsas blocks are instantiated based on parameters. The internal structure of thedesign implements pipelined, parallel or serial solutions depending on which isoptimal in every case. The implications concerning these decisions are discussedin detail in the report. The design process is described in detail and the evalua-tion of the design is also presented to measure accuracy and error of the designoutput.

This thesis work resulted in a complete synthesizable ASIC design imple-menting an LSTM layer, a Fully Connected layer and a Softmax layer whichcan perform classification of data based on trained weight matrices and biasvectors. The design primarily uses 16-bit fixed point format with 5 integerand 11 fractional bits but increased precision representations are used in someblocks to reduce error output. Additionally, a verification environment has alsobeen designed and is capable of performing simulations, evaluating the designoutput by comparing it with results produced from performing the same op-erations with 64-bit floating point precision on a SystemVerilog testbench andmeasuring the encountered error. The results concerning the accuracy and thedesign output error margin are presented in this thesis report. The design wentthrough Logic and Physical synthesis and successfully resulted in a functionalnetlist for every tested configuration. Timing, area and power measurements onthe generated netlists of various configurations of the design show consistencyand are reported in this report.

Keywords: Neural networks, LSTM, Long Short Term Memory, ASIC,VLSI

2

Sammanfattning

LSTM neurala natverk har anvants for taligenkanning, bildigenkanning ochandra artificiella intelligensapplikationer i manga ar. De flesta applikationerutfor LSTM-algoritmen och de nodvandiga berakningarna i digitala moln. Off-line losningar inkluderar anvandningen av FPGA och GPU men de mest lo-vande losningarna inkluderar ASIC-acceleratorer utformade for endast dettaandamal. Denna rapport presenterar en ASIC-design som kan utfora multiplaiterationer av LSTM-algoritmen pa en enkelriktad neural natverksarkitetur utanpeepholes. Den foreslagna designed ger aritmetrisk niva-parallellismalternativsom block som ar instansierat baserat pa parametrar. Designens inre konstruk-tion implementerar pipelinerade, parallella, eller seriella losningar beroende pavilket anternativ som ar optimalt till alla fall. Konsekvenserna for dessa beslutdiskuteras i detalj i rapporten. Designprocessen beskrivs i detalj och utvarderin-gen av designen presenteras ocksa for att mata noggrannheten och felmarginali designutgangen.

Resultatet av arbetet fran denna rapport ar en fullstandig syntetiserbarASIC design som har implementerat ett LSTM-lager, ett fullstandigt anslutetlager och ett Softmax-lager som kan utfora klassificering av data baserat patranade viktmatriser och biasvektorer. Designen anvander huvudsakligen 16-bitars fast flytpunktsformat med 5 heltal och 11 fraktions bitar men okadeprecisionsrepresentationer anvands i vissa block for att minska felmarginal.Till detta har aven en verifieringsmiljo utformats som kan utfora simuleringar,utvardera designresultatet genom att jamfora det med resultatet som producerasfran att utfora samma operationer med 64-bitars flytpunktsprecision pa en Sys-temVerilog testbank och mata uppstadda felmarginal. Resultaten avseende nog-grannheten och designutgangens felmarginal presenteras i denna rapport.Designengick genom Logisk och Fysisk syntes och framgangsrikt resulterade i en funk-tionell natlista for varje testad konfiguration. Timing, area och effekt-matningarpa den genererade natlistorna av olika konfigurationer av designen visar konsis-tens och rapporteras i denna rapport.

Nyckelord: Neural networks, LSTM, Long Short Term Memory, ASIC,VLSI

4

Contents

List of Figures 8

List of Tables 10

1 Introduction 141.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.1.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 141.1.2 LSTM Neural Networks . . . . . . . . . . . . . . . . . . . 15

1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 LSTM Neural Network Structure & Algorithm 182.1 LSTM Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.1 The LSTM algorithm . . . . . . . . . . . . . . . . . . . . 182.1.2 Functionality and Procedure of Arithmetic Operations . . 202.1.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Other LSTM Designs . . . . . . . . . . . . . . . . . . . . . . . . . 242.2.1 Peephole version . . . . . . . . . . . . . . . . . . . . . . . 242.2.2 Coupled forget and input gate . . . . . . . . . . . . . . . 252.2.3 Gated Recurrent Unit . . . . . . . . . . . . . . . . . . . . 262.2.4 Classification Algorithm and Architecture . . . . . . . . . 27

3 Hardware Design and ASIC Implementation 303.1 Requirements and Goals . . . . . . . . . . . . . . . . . . . . . . . 303.2 Mathematical Operation Unit . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Matrix-to-Vector Multiplication . . . . . . . . . . . . . . . 313.2.2 Elementwise Vector Multiplication . . . . . . . . . . . . . 353.2.3 Multiplication Unit . . . . . . . . . . . . . . . . . . . . . . 363.2.4 Addition Unit . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.5 Register File . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.6 Functions, LUT and Taylor implementation . . . . . . . . 403.2.7 Accumulator Unit and Division Unit . . . . . . . . . . . . 473.2.8 Mathematical Operation Unit Flow Control . . . . . . . . 48

3.3 Memory Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6

3.4 Top Level Architecture-Complete Design . . . . . . . . . . . . . . 533.4.1 Bottom Memory Units Control-SRAMs . . . . . . . . . . 533.4.2 Top Controller . . . . . . . . . . . . . . . . . . . . . . . . 573.4.3 Bottom Controller . . . . . . . . . . . . . . . . . . . . . . 593.4.4 Top Memory Unit-DRAM . . . . . . . . . . . . . . . . . . 61

4 Performance Results 644.1 Verification Testbench Structure . . . . . . . . . . . . . . . . . . 644.2 Error and Performance . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Synthesis Results 725.1 Synthesis Preparation . . . . . . . . . . . . . . . . . . . . . . . . 725.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7

List of Figures

1.1 A representation of a standard RNN. . . . . . . . . . . . . . . . . 15

2.1 A repeating module in a standard RNN. . . . . . . . . . . . . . . 192.2 The structure of an LSTM neural network. . . . . . . . . . . . . 202.3 The sigmoid function. . . . . . . . . . . . . . . . . . . . . . . . . 222.4 The hyperbolic tangent function. . . . . . . . . . . . . . . . . . . 232.5 A data dependency graph for all operations in the LSTM algorithm. 252.6 An LSTM cell extended with peepholes. . . . . . . . . . . . . . . 262.7 An LSTM cell with coupled forget and input gates. . . . . . . . . 272.8 A gated recurrent unit cell. . . . . . . . . . . . . . . . . . . . . . 282.9 Structure of an LSTM cell followed by a Fully Connected Layer

and a Softmax Layer. . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 Block diagram of the Mathematical Operation Unit and all itsinternal blocks. Interactions between the blocks are not presentedhere for simplicity and will be described in the next sections. . . 32

3.2 Structure of the Multiplication Unit. M is equal to OUsize. Theblue line signals are FSM control signals. The red line color isused to show that during matrix-to-vector operations the sameelement from vector B is multiplied with the elements of vector A. 37

3.3 FSM of the Multiplication Unit. Conditions are represented withblack letters while value assignments with red. . . . . . . . . . . 38

3.4 The register file structure with input and output signals. R isequal to regNumber (the number of registers generated) and Mis equal to OUsize (the size of each vector in the architecture).Registers are clocked on positive edges and get asynchronouslyreset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.5 LUT error shown on the graph of the sigmoid function. . . . . . 433.6 Ranges with the corresponding method of computation for the

hyperbolic tangent function. . . . . . . . . . . . . . . . . . . . . . 453.7 Ranges with the corresponding method of computation for the

sigmoid function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.8 Ranges with the corresponding method of computation for the

exponential function . . . . . . . . . . . . . . . . . . . . . . . . . 46

8

3.9 Block diagram of Accumulator Unit. A is equal to Asize. . . . . 473.10 Block diagram of Division Unit. D is equal to Dsize. . . . . . . . 483.11 Structure of the simulated Memory Units. M is equal to OUsize. 523.12 The top level architecture of the design. M is equal to OUsize. . 543.13 A simplified version of the Top Controller FSM. . . . . . . . . . . 583.14 A simplified version of the Bottom Controller FSM. . . . . . . . 60

4.1 Shows how the textbench is structured. . . . . . . . . . . . . . . 644.2 Error graphs for error of in the hidden layer vector. . . . . . . . . 674.3 Error graphs for error of in the Fully Connected layer vector. . . 684.4 Error graphs for error of in the Softmax layer vector. . . . . . . . 69

9

List of Tables

5.1 Configuration parameters . . . . . . . . . . . . . . . . . . . . . . 735.2 Energy and area measurements . . . . . . . . . . . . . . . . . . . 735.3 Throughput and memory access measurements . . . . . . . . . . 74

10

List of Abbreviations

AI Artificial Intelligence

ANN Artificial Neural Network

ASIC Application Specific Integrated Circuit

BMU Bottom Memory Unit

CPU Central Processing Unit

DRAM Dynamic Random Access Memory

DUT Design Under Test

FPGA Field Programmable Gate Array

FSM Finite State Machine

GPU Graphics Processing Unit

GRU Gated Recurrent Unit

LSTM Long Short Term Memory

LUT Look Up Table

RNN Recurrent Neural Network

SRAM Static Random Access Memory

TMU Top Memory Unit

VHDL Virtual Hardware Description Language

11

Chapter 1

Introduction

1.1 Background

Technologies based on Neural Networks have become more popular after manysuccessful applications on a variety of subjects. Artificial Neural Networks(ANNs), inspired by biological neural networks as their name states, are comput-ing systems that can be trained to recognize specific pattern and act accordingly.These networks are trained by going through examples of data structures thatare similar to the data they will be receiving when they are deployed. Thereis a great number of applications implementing neural network algorithms andschemes, especially now since many companies experiment with Artificial Intel-ligence (AI) applications. Neural networks became popular mostly for imagerecognition and speech recognition but their use is not limited in those. How-ever, not all neural networks are fit for all kinds of applications. Different typesof Neural Networks have different ways to approach a problem, that makes eachtype of NN better fitting for specific applications.

This thesis will focus on a specific kind of Recurrent Neural Network (RNNs),Long Short-Term Memory (LSTM) neural networks, introduced by Hochreiterand Schmidhuber in [1] and extended by Felix Gers’ team in [2]. LSTM neuralnetworks are more commonly used for speech recognition, as discussed in [3][4],as well as handwriting recognition, like in [5]. Both RNN and LSTM architec-tures differ from traditional neural networks, as they use their previous outputsand cell states.

1.1.1 Recurrent Neural Networks

Traditional neural networks do not process previous results or states to makedecisions. Their computations require no feedback of previous outputs or anykind of short or long range context. A traditional neural network is trained tomatch two identical inputs to the same output. This is very efficient in manycases where past events must not alter the networks perception of the currentsituation. The typical example is image recognition where two identical images

14

must produce the same output. Recurrent Neural Networks differ in this senseas they are designed to recognize sequences of data inputs. Two identical datastructures will not produce the same output if the previous inputs were notidentical as well.

Recurrent neural networks, as their name states, use recurrent paths (loops)to insert previous outputs as inputs for the current iteration. The producedoutput is matched to this iterations input combined with the past iterationsoutput. Figure 1.1 shows a representation of an RNN. Using past outputs asinputs the neural network makes decision based on the previous result and thecurrent input combined.

Figure 1.1: A representation of a standard RNN.

The recurrent connections (loops) make RNNs very accurate in analyzingsequences. As a result they can be very effective for language modeling, likein [6]. These networks can detect the next word in a sentence based on asequence of the previous words. Generally, RNNs have proven to have a lot ofadvantages over other algorithms, like Hidden Markov Models, as Graves statesin [7]. Their limitation is that this mechanism provides access to the most recentevents only. As stated in [8], traditional RNNs are incapable of dealing withlong term dependencies.

1.1.2 LSTM Neural Networks

Long Short-Term Memory neural networks are designed to handle those longterm dependencies that RNNs suffer from. The LSTM neural networks alsouse the same loop mechanism to handle short term dependencies. Their ar-chitecture though allows to adjust the cell state to less recent events. Theircomputation complexity is much greater than the one of a standard RNN aseach LSTM neuron consists of multiple gate layers. Gate layers and their func-tion are explained in Chapter 2. These layers allow the neurons to forget andupdate the state of the cell accordingly. The cell state is adjusting to changes

15

happening in the most recent events but also retains information of events thattook place many iterations earlier. There are multiple architectures that dif-fer to offer additional advantages like the ones discussed and compared in [9].Additionally, LSTM neural networks are capable of learning to time and countby adding peephole connections like in the work of Gers and Schmidhuber in[10]. Therefore, LSTM neural networks can identify sequences and patterns andproduce the corresponding results.

Although, LSTM neural networks combine many advantages, their complex-ity makes it harder to efficiently train or deploy them. Training can be sim-plified by using backpropagation through time as it was presented by Gravesand Schmidhuber in [11]. Generally, applications are usually cloud based todeal with the high complexity. Data servers integrate hardware acceleratorsalongside CPUs to achieve significant execution efficiency. These data serversare extremely energy efficient and provide high performance. Offline executionefficient implementations of such systems are difficult to accomplish without theuse of hardware accelerators. ASIC architectures generally are expected to de-liver best efficiency in solving such complex algorithms like in [8], where CPU,GPU, FPGA and ASIC are compared in terms of efficiency of implementing aGated Recurrent Unit algorithm, with ASIC architecture reaching the highestscore.

1.2 Problem

Most applications use cloud-based systems to implement LSTM functionality.Implementing the LSTM algorithm in FPGAs, CPUs or GPUs is also possiblebut results are not promising in terms of cost and energy efficiency. ASICimplementation, on the other hand, can offer a system that is dedicated to onlyfunction as an LSTM neural network. Such an accelerator can achieve the bestpossible area, delay and energy consumption, as it is application specific.

1.3 Goal

This thesis aims to examine the LSTM algorithm and structure and go throughthe design of a proposed ASIC architecture capable of functioning as an LSTMneural network accelerator. Other ASIC architectures that aim to functionas LSTM neural networks have been designed and tested in the past. In [8]such an example of an ASIC implementations of a Gated Recurrent Unit thatis compared against other implementations on other platforms. The goal isnot only to design a system that can complete the necessary mathematicaloperations with decent accuracy but also achieves it in the most efficient wayregarding the delay, energy and area. A design-space exploration is essential tounderstand all complications of the LSTM algorithmic design and deliver a unitthat can efficiently accelerate the algorithmic process and offer solid results.

This report explains and evaluates all aspects of the proposed design. First,

16

the LSTM algorithm and various versions of it will be presented in detail. Next,the methods of designing an ASIC implementation of this algorithm will be dis-cussed. The different approaches on each step will be compared and explained.Additionally, a verification process used to evaluate the proposed design will bepresented along with the precision, accuracy and error margins that the designachieves. Finally, the design will go through both logic and physical synthesisand the resulting configurations will be compared with each other in terms ofexecution speed, area and energy efficiency.

17

Chapter 2

LSTM Neural NetworkStructure & Algorithm

2.1 LSTM Structure

LSTM neural networks differ from RNNs when it comes to long term memory.RNNs can handle situations where only short term dependencies apply butnot when it is necessary to make decisions based on events that took placemultiple iterations ago. As Graves describes in [4], the LSTM neural networkarchitecture, introduced by Hochreiter and Schmidhuber in [1], is not limitedby that factor as it is designed to adjust on both recent events and long rangecontext. In order to understand how this is achieved it is necessary to examinetheir structure. This chapter will start by going into how the LSTM mechanismworks and the final paragraphs will discuss how the output of the LSTM cellscan be used to classify the given input sequence.

2.1.1 The LSTM algorithm

Every LSTM neuron has a hidden layer output ht, a cell state Ct and threeinputs, the input frame xt and the previous time step’s hidden layer output andcell state, ht−1 and Ct−1. The network consists of M such neurons and theinput frame is a vector of N elements. The entire network behaves accordingto the following equations.

ft = σ(Wfx × xt +Wfh × ht−1 + bf )

it = σ(Wix × xt +Wih × ht−1 + bi)

ct = tanh(Wcx × xt +Wch × ht−1 + bc)

Ct = ft ◦ Ct−1 + it ◦ ctot = σ(Wox × xt +Woh × ht−1 + bo)

18

ht = ot ◦ tanh(Ct)

W are the corresponding M × N weight matrices, x is the input frame vectorof size N , h is the output vector of size M , C is the cell state vector of sizeM and b is the corresponding bias vector of size M . There is no dependencybetween M and N . Over the next chapters the implications concerning M andN will be further discussed. The variable subscripts t and t − 1 refer to theiteration (time step). It is important to note that × stands for matrix-to-vectormultiplications and ◦ stands for element-wise vector multiplication also knownas Hadamard product.

Standard RNNs usually have only one neural network layer. In LSTM neuralnetworks there are four neural network layers. The equations ft, it, ct and otdescribe these four layers and they represent the forget gate, the input gate,the memory cell and the output gate correspondingly. Figure 2.1 shows thestructure of a standard RNN module while figure 2.2 shows how an LSTMneural network is structured. This four-layer structure and the way these layersinteract with each other is also the key in dealing with long-term dependencies.Not only by observing the path from Ct−1 to Ct in figure 2.2 but also from theequation of Ct described above it becomes clear that the cell state adjusts toevery operation in the neural network. However, this adjustment is limited asthere are only some minor interactions with the forget gate, the input gate andthe memory cell in its path. This way the cell state is controlled and protectedby the the three gates interacting with it. It is important to note that anymentions of layers in this thesis will refer to the internal gate layers of a singleneuron (seen in 2.2 Neural Network Layers) unless it explicitly stated otherwise.

Figure 2.1: A repeating module in a standard RNN.

19

Figure 2.2: The structure of an LSTM neural network.

2.1.2 Functionality and Procedure of Arithmetic Opera-tions

In order to design an ASIC architecture that can efficiently function as an LSTMneural network it is important to understand the sequence of arithmetic oper-ations, their number and dependencies. This section goes through the conceptof the neural network layers, their functionality, their mathematical equivalentand the arithmetic operations that formulate the entire algorithm.

For the following calculations it is assumed that the input frame xt ∈ Ris a vector of size n, the previous output ht−1 ∈ R is a vector of size m, theweight matrices for x, W x ∈ R, have a size of m×n, the weight matrices for h,W h ∈ R, have a size of m×m and the bias b ∈ R is a vector of size m. For easeof operations the xt and ht can be concatenated into one vector [xt, ht−1] ∈ Rof size n + m and the corresponding matrices can be concatenated into singlematrices W = [W x,W h] ∈ R with a size of m× (m+ n).

First thing that comes up in the algorithm is the forget gate layer. Theforget gate layer was introduced in [2] and allows the network to reset its ownstate. This layer is responsible for throwing away any old information of theprevious cell state that is irrelevant and keep any existing information that isvalid. The forget gate produces a vector of m numbers, as many as in the cellstate vector, ranging from ”0” that stands for ”forget that completely” to ”1”that stands for ”remember that completely”. Any number in between describesat what level the cell state must keep or discard this information. Equation offt represents how the forget gate layer operates.

Before examining the next step of the algorithm it is important to under-stand the arithmetic operations applied to solve this equation. This is not onlynecessary for understanding how the equation corresponds to the gate function-

20

ality but also to count how many arithmetic operations are required to reacha final result which is needed for the ASIC design. The first step in the for-get gate equation ft is to apply on matrix-to-vector multiplication between theconcatenated weight matrix Wf and the concatenated [xt, ht−1] vector.

Wf × [xt, ht−1]

=

wx1,1 wx1,2 · · · wx1,n wh1,1 · · · wh1,m

wx2,1 wx2,2 · · · wx2,n wh2,1 · · · wh2,m

......

. . ....

.... . .

...wxm,1 wxm,2 · · · wxm,n whm,1 · · · whm,m

×

x1x2...xnh1...hm

=

wxh1wxh2

...wxhm

To complete this matrix-to-vector multiplication it is necessary to multiply

each element of the [x, h] vector with all the elements of the correspondingcolumn in the Wf matrix and then add all products for each of the m lines.This equals to m ∗ (n+m) multiplications and m ∗ (n+m− 1) additions. It isimportant to notice that this calculation can be broken into smaller operationswhich can prove valuable when it comes to designing the control of the arithmeticunits and exploring design space parallelism. For example we can achieve thesame result if we only use one multiplier and one adder. wxh1 = (· · · ((wx1,1 × x1) + (wx1,2 × x2)) + · · · ) + (wh1,m × hm)

...wxhm = (· · · ((wxm,1 × x1) + (wxm,2 × x2)) + · · · ) + (whm,m × hm)

By multiplying and adding each parenthesis one after another we will still

get the same result. Because of that the design can be highly pipelined. Moredetails on this topic will be discussed in the following chapters.

Next step is to add the bias and finally saturate the result with the sigmoidfunction.

ft = σ

(wxh1...wxhm

+

b1...bm

)This requires m more additions but also demands an implementation of thesigmoid function σ(x). The sigmoid function formula is

σ(x) =1

1 + e−x

21

and figure 2.3 shows the corresponding graph.

Figure 2.3: The sigmoid function.

If the sigmoid function was directly implemented in hardware it would costa lot in terms of area, energy and efficiency, as it would require performingdivision and an exponential function with increased precision representationsin order to receive meaningful results. In order to avoid this and also reduceexecution time this function can be implemented using a Look Up Table (LUT).More details on this will be discussed in the following chapters.

The total number of arithmetic operations for calculating ft is m ∗ (n+m)multiplications andm∗(n+m) additions without taking into account the sigmoidfunction as it should not be directly implemented in the final design.

Next step of the algorithm is to decide on what new information will bestored into the new cell state. Responsible for that are both the input gatelayer it and the memory cell layer ct. Equation it is solved following thesame principals as in ft using the corresponding weight matrices [Wix,Wih] andbias bi. Equation ct is also solved according to the same principals using thecorresponding weight matrices [Wcx,Wch] and bias bc but instead of saturatingthe result with a sigmoid function a hyperbolic tangent function tanh(x) is used.Figure 2.4 shows the graph of the hyperbolic tangent function.

Both it and ct require a total of m∗ (n+m) multiplications and m∗ (n+m)additions not considering the sigmoid and the hyperbolic tangent function, sameas in ft.

The input gate layer it returns which of the new values from the input frame

22

Figure 2.4: The hyperbolic tangent function.

and the previous output (to account for short term dependencies) will be usedto update the cell state. The memory cell ct returns the factor that will decidehow much of this new information must be added into the cell state. The vector-to-vector elementwise multiplication it ◦ ct returns the new scaled down valuesused to update the cell state. This multiplication produces an m sized vector l.

l =

l1l2...lm

=

i1i2...im

◦c1c2...cm

=

i1 ∗ c1i2 ∗ c2

...im ∗ cm

Respectively, the elementwise multiplication ft◦Ct−1 produces the old cell statevalues adjusted by the forget gate forcing the cell state to forget some of the oldvalues by a factor. This operation produces an m sized vector k.

k =

k1k2...km

=

f1f2...fm

◦C1

C2

...Cm

=

f1 ∗ C1

f2 ∗ C2

...fm ∗ Cm

23

By adding k and l the new update cell state is produced.

Ct = k + l =

f1 ∗ C1

f2 ∗ C2

...fm ∗ Cm

+

i1 ∗ c1i2 ∗ c2

...im ∗ cm

=

f1 ∗ C1 + i1 ∗ c1f2 ∗ C2 + i2 ∗ c2

...fm ∗ Cm + im ∗ cm

Solving the Ct equation requires 2 ∗m multiplications and m additions.

Last step of the algorithm is to output a result. the output is producedby the elementwise multiplication of the output gate layer ot with the cellstate Ct. The output gate layer shows which of the cell state values must bereturned to the output and by what degree. The equation that produces ot issolved based the same way as ft using the corresponding matrices [Wox,Woh]and bias bo. Finally, the cell state is filtered by a hyperbolic tangent functionto saturate between [1,−1]. The equation ht produces the final output of theneural network.

ht = ot ◦ tanh(Ct) =

o1o2...om

◦ tanh(

C1

C2

...Cm

)

=

o1 ∗ tanh(C1)o2 ∗ tanh(C2)

...o3 ∗ tanh(Cm)

This equation requires m multiplications to return a result not considering thehyperbolic tangent function.

2.1.3 Data Dependencies

It is crucial to determine data dependencies throughout the entire algorithm.Identifying data dependencies allows parallelizing operations in the ASIC im-plementation without any risk for error. Figure 2.5 shows data dependenciesthroughout arithmetic operations in the LSTM algorithm.

By observing the graph in figure 2.5 it becomes clear that all gate layeroperations for ft, it, ct and ot can take place at the same time. An ASICarchitecture that allows these calculations to be executed in parallel will save alot of time in each iteration.

2.2 Other LSTM Designs

The design discussed in the previous section is not the only version of LSTMneural networks. There are other designs that operate in similar ways but extendthe abilities of the LSTM or simplify some aspects of the algorithm shown inthe previous section.

2.2.1 Peephole version

This version of LSTM neural networks was proposed by Gers and Schmidhuberin [10] to extract additional information from time intervals relevant events.

24

Figure 2.5: A data dependency graph for all operations in the LSTM algorithm.

An LSTM neural network can handle time lags by extending the cells withpeephole connections. Peephole connections provide access to Constant ErrorCarousel, as described in [12] and give feedback on the current internal state ofthe cell throughout the procedure. Usually, the previous cell state vector Ct−1

is used instead of the previous output vector ht−1. Figure 2.6 shows a peepholeversion where xt, ht−1 and Ct−1 are all used as inputs to generate the cell stateand the output.

This version increases the complexity of the design as the peepholes requireadditional weights and operations. Below are the equations for the forget gate,input gate and output gate layers for this design.

ft = σ(WfC × Ct−1 +Wfx × xt +Wfh × ht−1 + bf )

it = σ(WfC × Ct−1 +Wix × xt +Wih × ht−1 + bi)

ot = σ(WfC × Ct +Wox × xt +Woh × ht−1 + bo)

2.2.2 Coupled forget and input gate

This variant couples the forget gate and input gate together. The decisions forforgetting and updating new information in the design discussed in Section 2.1

25

Figure 2.6: An LSTM cell extended with peepholes.

were made separately. By coupling the forget and input gates these decisions aremade together. The cell forgets something only when an updated value is beingput in its place. The state is only updated when another values is forgotten.This changes can be applied by changing the Ct equation to the one below.

Ct = ft ◦ Ct−1 + (1− ft) ◦ ct

Instead of an it we multiply 1 − ft with ct. The rest of the cell remains thesame. The structure of an LSTM cell with coupled forget and input gates isshown in figure 2.7.

2.2.3 Gated Recurrent Unit

The Gated Recurrent Unit (GRU), introduced by Cho in [13], is another versionof LSTM neural networks that combines forget and input gate into one updategate and also changes some other layers of the cell. These changes are shown infigure 2.8. The algorithm for this design is based on the following equations.

zt = σ(Wz × [ht−1, xt])

rt = σ(Wr × [ht−1, xt])

ht = tanh(W × [rt ◦ ht−1, xt])

ht = (1− zt) ◦ ht−1 + zt ◦ htThe versions of LSTM neural networks are not limited to the ones discussed

above. There are many variants that offer different advantages. This thesisaims to deliver an ASIC implementation of the algorithm presented in section

26

Figure 2.7: An LSTM cell with coupled forget and input gates.

2.1. The next chapter will go through the design methodology and explain thefunctionality of the proposed architectural schemes as well as discuss on differentpossible variations of some units and compare them.

2.2.4 Classification Algorithm and Architecture

The LSTM architecture has been described in detail in the previous paragraphs.So far there was no mention of classification though. That is because LSTMitself is not responsible for that. The LSTM algorithm receives n inputs anddelivers m hidden layer outputs. The number of the hidden layers in the networkis not the same as the number of classes though. In order to apply classificationtwo more stages of computation are necessary. First a Fully Connected Layerand then a Softmax Layer.

Fully Connected Layer

The Fully Connected Layer is responsible for getting the hidden layer outputand delivering an output vector with j elements, where j is the number of classesin the network. In this way the class scores are then available and the entireprocedure is one step closer to finish. The Fully Connected Layer is implementedby the following function:

yt = Wy × ht + byy1y2...yj

=

Wy1,1 Wy1,2 · · · Wy1,m

Wy2,1 Wy2,2 · · · Wy2,m

......

. . ....

Wyj,1 Wyj,2 · · · Wyj,m

×h1h2...hm

+

by1by2...byj

27

Figure 2.8: A gated recurrent unit cell.

This function is similar to ft, it, ct and ot but there is no sigmoid or hyper-bolic tangent function to squash the result. This means the result is not limitedin the ranges [0 : 1] or [−1 : 1]. This is necessary as this stage produces classscores and these scores must not be saturated if they exceed a specific value.

Softmax Layer

The Softmax Layer is responsible for producing the probability of the inputsequence to belong to the corresponding class. It takes place right after theFully Connected Layer is complete. This layer is implemented by the softmaxfunction seen below:

probi =exi

n∑j=1

exj

,where x is a vector of size n.The Fully Connected Layer has produced j outputs and each of them is the

total score of the corresponding class. By feeding the y vector to the softmaxfunction we can produce the probability of the input sequence to belong on oneof the j classes. The result will be as follows.

prob1prob2

...probj

=

ey1

Sey2

S...

eyj

S

,where S is

j∑i=1

eyi

28

This result means that the probability of the input sequence to belong toclass 1 is equal to prob1, the probability of the input sequence to belong to class2 is equal to prob2, and so on. The sum of all probabilities must be equal to 1.With the result of the Softmax Layer the classification of the input sequence iscomplete. Figure 2.9 show the complete structure of an LSTM cell connectedwith a Fully Connected Layer and a Softmax Layer to deliver the expectedprobability results.

Figure 2.9: Structure of an LSTM cell followed by a Fully Connected Layer anda Softmax Layer.

This concludes the background on neural network structure and the LSTMalgorithm. Over the next chapters there will be discussion over the design of anASIC implementation capable of realizing the previous mathematical operationsand delivering the expected results matching the behavior of the algorithm wepresented in this chapter.

29

Chapter 3

Hardware Design and ASICImplementation

This chapter will go into the implications of designing an LSTM neural networkfor ASIC using VHDL. As the entire design is very complex to look at it atonce it will be broken down into separate blocks following the same order inwhich they were designed during the development stage. Each of the followingsections will provide a detailed description of the corresponding block along withexplanation of its components and why this topology was finally chosen amongothers.

3.1 Requirements and Goals

Before going into details the requirements and goals of the entire design must bediscussed. First of all, the entire architecture is limited to 16-bit fixed point rep-resentation with 1 sign bit, 4 integer bits and 11 fractional bits. Some exceptionswere made as it was necessary and the reasons will be explained in detail lateron. This limitation exists because the goal is to minimize area and energy con-sumption. Additionally, it is essential to examine error in the results deliveredfrom the final design when 16-bit fixed point representation is used. Further-more, memory elements will only be simulated for the needs of this thesis work.Both SRAM and DRAM behavior will be simulated and their functionality willbe taken as granted. Their delay and energy consumption values will howeverbe added to the final calculations. Finally, the entire architecture must supportparametrized functionality both in a sense of instantiation based on generics butalso to allow variable input size, variable hidden layer size and variable numberof classes. Generally, the main focus is energy consumption but only as long asperformance is not hindered at a level that makes the entire design useless.

The next paragraphs will go through the design steps and explain how thearchitecture was structured based on the requirements and goals mentionedbefore to establish the functionality of the LSTM algorithm that was examined

30

in Chapter 2. The first section will describe the Mathematical Operation Unitand all its components at the end of which the basic concept of the design willbe clear. The next sections will go through the control logic components andhow it is structured to guide the Mathematical Operation Unit to deliver theexpected results. This chapter aims at presenting the architecture of the designcapable of performing multiple iterations of the LSTM algorithm and the entirealgorithmic flow in order to understand how the entire design can produce thefinal result and classify an input sequence.

3.2 Mathematical Operation Unit

The first thing to consider before going into designing an ASIC architecture forLSTM neural networks is the mathematical complexity of the required computa-tions. It is necessary to provide a block capable of completing all computationsand at the same time figure out the area-delay-energy consumption trade-off. Asthese parameters are not known from the start the design was made as versatileas possible to allow changes upon instantiation but also parameter-dependedbehavior.

The following paragraphs of this section will go through each operation andpresent how they were designed, introduce a proposed structure, discuss thecomplications that were considered and explain why this particular design waschosen among others. By the end of this section all blocks in figure 3.1 willhave been examined in detail and the functionality of the entire Unit will beclarified.

3.2.1 Matrix-to-Vector Multiplication

One of the most common and resource consuming operations of the LSTMalgorithm is the Matrix-to-Vector Multiplication. This operation requires a greatnumber of multiplications and additions that depends on the size of the Matrixand the Vector. A matrix with m lines and n columns multiplied with a vector ofn elements requires m×n multiplications and m× (n− 1) additions. Thereforethis operation can be very time consuming. There is a number of things toconsider before designing an ASIC implementation for this operation. First, weneed to provide a number of multipliers and adders that can work in paralleland possibly pipeline them to increase throughput and reduce the time it takesto produce a final result. Second, the size of the matrix and the vector can varyas the final design must be able to adapt for different sizes of the input sequence,hidden layer and probability vectors. Third, apart from the computation itself,this operation will require data to be streamed from a separate memory unit toa register file and this means reloading the same data must be avoided as muchas possible.

In order to address the first two problems it is important to keep in mindthat the final design must provide the possibility of parametrized instantiationwhich allows larger but faster execution blocks or pipelined operations so that

31

Figure 3.1: Block diagram of the Mathematical Operation Unit and all itsinternal blocks. Interactions between the blocks are not presented here forsimplicity and will be described in the next sections.

it can be changed to fit the needs of the user. In order to achieve that itwas decided that a vector architecture approach would make more sense as itallows to complete parallel operations faster. Additionally, a generic integerOUsize was used to define the size of vector in the Operation Unit. This meansthat the entire architecture of the Mathematical Operation Unit is instantiatedbased on this generic. Also the number of multipliers in the MultiplicationUnit and the number of adders in the Addition Unit are equal to OUsize. Theproblem that remains unsolved is that the size of the matrices and vectorsbeing multiplied during the LSTM algorithm process must not be limited toa size equal to OUsize. Generally it is expected that their sizes will be muchlarger than OUsize. In order to deal with that the matrix multiplication canbe broken down into smaller chunk computations. For example if a user wantsto instantiate a design with an OUsize equal to 6 but the matrix-to-vectormultiplications include 10 × 8 matrices multiplied by 8-element vectors thenthe sizes do not match. In this case the design must be capable of performingmultiple sequences until the entire operation is complete. There are 2 differentways of performing matrix-to-vector multiplication, either column-by-column orline-by-line.

Column-by-column multiplication is performed by loading each column ofthe matrix and multiplying it with the corresponding vector element. One suchcomputation can be examined below.

R = A× V

32

r1r2...rm

=

a1,1a2,1

...am,1

× v1 +

a1,2a2,2

...am,2

× v2 + · · ·+

a1,na2,n

...am,n

× vnThis method allows to break the computation down to steps, where at eachstep we can accumulate with the previous result. Assuming we are also limitedto OUsize sized vectors then this computation is further broken down to thefollowing form.

r1r2...rm

=

r1r2...ro

ro+1

r0+2

...r2∗o

...

rk∗o+1

rk∗o+2

...rm

=

a1,1a2,1

...ao,1

× v1 +

a1,2a2,2

...ao,2

× v2 + · · ·+

a1,na2,n

...ao,n

× vnao+1,1

ao+2,1

...a2∗o,1

× v1 +

ao+1,2

ao+2,2

...a2∗o,2

× v2 + · · ·+

ao+1,n

ao+2,n

...a2∗o,n

× vn...

ak∗o+1,1

ak∗o+2,1

...am,1

× v1 +

ak∗o+1,2

ak∗o+2,2

...am,2

× v2 + · · ·+

ak∗o+1,n

ak∗o+2,n

...am,n

× vn

,where o is equal to OUsize and k = roundDown(m

o )− 1.The line-by-line approach also allows to break down the computation to

separate chunks but this time at each step you produce the result for the multi-plication of a matrix line with the entire vector. This computation can be seenbelow.

R = M × Vr1r2...rm

=

a1,1 × v1 + a1,2 × v2 + · · ·+ a1,n × vna2,1 × v1 + a2,2 × v2 + · · ·+ a2,n × vn

...am,1 × v1 + am,2 × v2 + · · ·+ am,n × vn

When the limitation of OUsize is applied then the same result is returned bychanging the computation to take the following form.r1r2...rm

=

[a1,1 × v1 + · · ·+ a1,o × vo

]+ · · ·+

[a1,k∗o+1 × vk+o+1 + · · ·+ a1,n × vn

][a2,1 × v1 + · · ·+ a2,o × vo

]+ · · ·+

[a2,k∗o+1 × vk+o+1 + · · ·+ a2,n × vn

]...[

am,1 × v1 + · · ·+ am,o × vo]

+ · · ·+[am,k∗o+1 × vk+o+1 + · · ·+ am,n × vn

]

33

,where o is equal to OUsize and k = roundDown(mo )− 1.

These two versions of the same computation can both be applied to designa matrix-to-vector multiplication unit. But it is important to consider whichof them is more beneficial to the final product. From a delay perspective bothversions will take the same number of cycles, assuming there is always availabledata, as the number of multiplications and additions is the same in both versions.The column-by-column method will produce OUsize results in parallel everycycle that will eventually accumulate to a OUsize-sized part of the result vectorR. By repeating the same process multiple times the entire result vector R iscalculated. The line-by-line method will produce an element of the final resultvector R each cycle. After M cycles all elements of the result vector R will havebeen calculated. The column-by-column method has one advantage over theline-by-line method and that is that it returns a vector. It is useful to acquirean OUsize-sized vector as a result and further apply computations on it whilethe next OUsize-sized part of the result vector R is computed. The line-by-line method delivers the final result for a matrix line multiplied with the vectorbefore going over to the second matrix line which has no essential benefit if thenext block requires a vector input. Additionally, the entire behavior of the unitis simplified if a common vector architecture is used as much as possible. Basedon these choices and facts the column-by-column method was preferred.

Finally, the loading/storing of data in the mathematical unit must be struc-tured in a way that improves energy consumption and delay. As the entire unitis working with vectors to reduce delay and add to the parallelism of the designthe memory should work in the same way. Instead of sending and receiving oneelement of a vector at a time the memory is designed to send and receive vectorsof OUsize elements. Even though this functionality saves a lot of cycles andallows parallelism it also limits our design in some ways. First, loading and stor-ing data becomes more complex and there is need for more control logic whenit comes to sending and receiving data from memory. Second, when choosingOUsize for the design the user must also keep in mind that this number isdirectly related to the size of the memory IO ports. If an SRAM is used thenthis means that this can lead to potentially high energy consumption. More onthis will be discussed on the section about memory.

Another matter to consider is the loading order of data.Assuming we onlyhave a limited number of OUsize-sized register the following column-by-columnmethod loading/storing schedule minimizes reloading of the same data.

load

v1v2...vo

→ load

a1,1a2,1

...ao,1

→ load

a1,2a2,2

...ao,2

→ load

a1,3a2,3

...ao,3

→ · · · →

34

→ load

a1,oa2,o

...ao,o

→ load

vo+1

vo+2

...v2∗o

→ · · ·,where o is equal to OUsize.

It is worth mentioning that the line by-line method has a solution of equalperformance but it is not presented as this method is not used. The loadingorder above allows the unit to compute the multiplication of each vector elementwith the corresponding matrix column while the next matrix column is fetchedfrom memory. Also this way we minimize loading of same data to the reloadingof the vector V when the first OUsize lines have finished their computation.Another alternative is to load all parts of a matrix column before moving onto the next one but this requires more reloading of the same data as we needto additionally load the previous result after every computation in order toaccumulate the sum. Therefore, the proposed way above is better fitting forthis design approach.

3.2.2 Elementwise Vector Multiplication

Before going into more details on the multiplication and addition units it is im-portant to examine another operation that is necessary for the LSTM algorithmand its logic can be combined with the matrix-to-vector multiplication. The el-ementwise vector multiplication or otherwise known as Hadamard product isapplied in many gate layers through the LSTM algorithm. Elementwise mul-tiplication will follow the same concept as the previous architecture to designa unit that applies elementwise parallelism. An example of elementwise vectormultiplication can be seen below

R = A ◦Br1r2...rm

=

a1 × b1a2 × b2

...am × bm

,where A and B are m-sized vectors.

This operation can be executed with the same multipliers used in matrix-to-vector multiplication operations as long as it is not necessary to have theseoperations at the same time. This will require much less area than creating aseparate unit for elementwise parallelism and reduce energy consumption as weuse less multiplication elements in the design.

In order to use the same multiplication unit the elementwise multiplicationmust also be broken down to OUsize-sized chunks. The operation will take thefollowing form.

35

r1r2...ro

ro+1

ro+2

...r2∗o

...

rk∗o+1

rk∗o+2

...rm

=

a1 × b1a2 × b2

...ao × bo

ao+1 × bo+1

ao+2 × bo+2

...a2∗o × b2∗o

...

ak∗o+1 × bk∗o+1

ak∗o+2 × bk∗o+2

...am × bm


o )− 1.The loading/storing of data is much simpler in this case as only one part

of the first vector and one part of the second vector are required to produce apart of the result vector. No reloading of the same data can occur through thiscomputation if the following loading order is applied.

load

a1a2...ao

→ load

b1b2...bo

→ · · · → load

ak∗o+1

ak∗o+2

...am

→ load

bk∗o+1

bk∗o+2

...bm

3.2.3 Multiplication Unit

The LSTM algorithm requires many multiplication operations. Executing onemultiplication per cycle will increase the delay of the computation as therecould be cases where more than 1000 multiplications are required just for onematrix-to-vector multiplication. As the vector architecture has already beendiscussed and the column-by-column method of computation of matrix-to-vectormultiplication along with the elementwise vector multiplication were presentedthey can all be implemented effectively in the ASIC design. The MultiplicationUnit structure can be seen in figure 3.2.

36

Figure 3.2: Structure of the Multiplication Unit. M is equal to OUsize. Theblue line signals are FSM control signals. The red line color is used to show thatduring matrix-to-vector operations the same element from vector B is multipliedwith the elements of vector A.

Instead of a single multiplier OUsize multipliers are used in order to delivermultiple results in the same cycle. The FSM block is responsible for controllingthe multiplier inputs. The op signal provides info on the requested operationfrom the mathematical operation unit. It can either be 0 for a matrix-to-vectormultiplication or a 1 for an elementwise vector multiplication. The valid inputnotifies the multiplication unit that the input data is valid and that the producedresult should be counted as valid. This is important for the matrix-to-vectormultiplication as the next multiplication must be between the given vector A andthe next element of vector B. Therefore, valid indicates that the index pointingat the element of B should be increased by 1. The skip input indicates thateven though a matrix-to-vector multiplication might not have been completedit should be interrupted. This can occur in cases where the size of the vector,same as the number of columns in the matrix, is smaller than OUsize. In thesecases the computation is interrupted by the mathematical operation unit. TheFSM of the Multiplier can be seen in figure 3.3.

From the two previous figures the functionality of the Multiplication Unitbecomes clear. The Mathematical Operation Unit is responsible for feeding datato the Multiplication Unit. By setting the op signal to 0 a matrix-to-vector mul-tiplication procedure begins. All the elements of the first vector are multipliedwith the indexed element of the second vector. The Multiplication Unit willcontinue to perform matrix-to-vector multiplication until it is interrupted with

37

Figure 3.3: FSM of the Multiplication Unit. Conditions are represented withblack letters while value assignments with red.

a positive skip signal or until it completes OUsize iterations meaning it com-pleted the multiplication for a matrix with OUsize lines. If the actual matrixis bigger then the Mathematical Operation Unit must issue another matrix-to-vector multiplication to complete the operation following the column-by-columnmethod of computation as it was described before. If the op signal is set to 1a valid input indicates an elementwise vector multiplication in which case theresult is returned immediately. One thing that is important is that the Math-ematical Operation Unit is responsible for feeding the correct data into theMultiplication Unit. The Multiplication Unit holds no registers as it containssingle cycle logic.

There is one thing that is not mentioned in the paragraphs above and that isthe accumulation by adding multiplication results over time. This functionality

38

is necessary for matrix-to-vector multiplication and is not covered in the blockswe have described above. In order to account for that it is required to providean Addition Unit.

3.2.4 Addition Unit

This unit is responsible for performing vector addition. It is far simpler than theMultiplication Unit. The Addition Unit has to perform the following operation

R = A+Br1r2...rm

=

a1a2...am

+

b1b2...bm

,where m is the size of vectors R, A and B.

The same pattern of breaking down the operation in smaller chunks canbe applied again. Instead of performing single element addition it is preferredto add multiple elements at the same time by instantiating multiple adders.This way the vector architecture principal is further applied on this unit. Theoperation will then take the following form.

r1r2...ro

ro+1

ro+2

...r2∗o

...

rk∗o+1

rk∗o+2

...rm

=

a1 + b1a2 + b2

...ao + bo

ao+1 + bo+1

ao+2 + bo+2

...a2∗o + b2∗o

...

ak∗o+1 + bk∗o+1

ak∗o+2 + bk∗o+2

...am + bm


o )− 1.It now becomes obvious that this pattern allows to pipeline the matrix-to-

vector multiplication. The Mathematical Operations Unit FSM can feed theresult of the Multiplication Unit straight to the Addition Unit and over timeaccumulate the results on a register. When OUsize lines of the matrix havefinished being multiplied with the vector the result is returned back to memoryand the next OUsize lines of the matrix begin the multiplication procedure.It is obvious that in all cases of matrix-to-vector multiplication in the LSTM

39

algorithm the result is added to the bias. Therefore, the Mathematical Oper-ation Unit can simply add the corresponding bias part to the partial result ofthe multiplication before sending it back to memory. This functionality savescycles and requires less load/store operations.

This concludes the Addition Unit structure examination. Before presentingany further operation units it is essential to set a standard for holding temporarydata that are used for computations. The next section aims to present thestructure of the register file and its functionality.

3.2.5 Register File

The implications of only simple operations have been discussed so far. Up to thispoint the Mathematical Operation Unit can apply a series of different operationsincluding matrix-to-vector multiplication, elementwise vector multiplication andvector addition. In order to complete these operations intermediate results mustbe saved temporarily.

First idea was to put input and output registers on each separate block. Thismethod however is not ideal as in many cases the result from one operation mustbe fed directly to another block for further computation. One such example isin matrix-to-vector multiplication where the multiplication result is directed tothe Addition Unit for accumulation. If this method was implemented additionalbypass paths would be necessary to save cycle and the control logic would bemuch more complex for each block. Furthermore, the number of registers in-stantiated with this method is much higher than the minimum that is requiredto perform all operations.

In order to reduce area and complexity and to simplify the entire designthe Mathematical Operation Unit was enhanced with a Register File. It canbe accessed by any block in the Unit. The Mathematical Operation Unit FSMhas to just direct the correct register inputs/outputs to the corresponding blockinputs/outputs. Additionally, this method reduces the amount of total registersas the unit can apply all operations with only 5 registers. Figure 3.4 shows thestructure of the Register File.

The number of registers in the register file is generated based on a genericinteger regNumber. Each register can store OUsize×16 bits which means thateach register can hold an entire vector of OUsize elements. Almost all opera-tions are vector operations so this way of holding data is much more efficient.There is an regNumber-bit enable signal that indicates if the input on the cor-responding register must be saved or not. The valid signal is 1 if the vector wassaved in the previous cycle and 0 otherwise. This signal is not necessary butsimplified the control logic of the Mathematical Operation Unit.

3.2.6 Functions, LUT and Taylor implementation

Simple operation functionality has been established with the Multiplication andthe Addition Units. The LSTM algorithm requires some more complex oper-ations as well though. The forget gate ft, input gate it, memory cell ct, the

40

Figure 3.4: The register file structure with input and output signals. R is equalto regNumber (the number of registers generated) and M is equal to OUsize(the size of each vector in the architecture). Registers are clocked on positiveedges and get asynchronously reset.

output gate ot and the hidden layer ht equations require the use of the sigmoidand the hyperbolic tangent functions to squash vectors within range (0, 1) or(−1, 1). Their corresponding graphs were introduced in Chapter 2. Further-more, there is another function that is required in the Softmax Gate layer andthat is the exponential function ex. The implementation of such functions inASIC is problematic as they cost a lot in terms of area and energy. This isbecause these functions are being estimated with other methods (for exampleTaylor series theorem) and require multiple cycles, multipliers and adders toperform them. In order to reduce complexity, area and energy consumptionthere are other ways to get the same results instead of the direct implementa-tion of these functions. Over the next paragraphs there will be no mention ofvectors. The following discussion will be based on fixed point numbers insteadof vectors consisted of fixed point numbers. The transition from numbers tovectors will be explained in the end of this section.

LUT

A simple way of producing the results for these functions is to create a simpleLUT logic. The LUT will simply match the input with the corresponding outputand all of the three functions can be implemented with a separate LUT. However,there are some issues to consider. First of all, the precision in our architectureis indicated by 16-bit fixed point representation which is broken down to 1 signbit, 4 integer bits and 11 fractional bits. This means that each LUT will need

41

to account for 216 = 65536 pairs ranging for inputs of −16 to about 15.9995which were the minimum and maximum number that can be represented withthis precision. There will be some cases where multiple inputs will match to thesame output and the synthesis tool will optimize by removing extra multiplexers.With the precision of 11 fractional bits though the number of these cases islimited. Therefore, the area resulting of these LUTs will be very big. Evenif the sigmoid and hyperbolic tangent LUTs are halved as their behavior issymmetric for positive and negative inputs too many pairs still remain.

In order to reduce the number of pairs that need to be represented firstattempt was to simply reduce precision of inputs in the LUT. The inputs in theLUT were truncated to 8 bits, 1 sign bit, 1 integer bit and 6 fractional bits.This method reduced the number of pairs that needed to be represented to 256for each LUT. In the case of sigmoid and hyperbolic tangent this was halvedto 128 by adding some extra control logic derived from symmetry to accountfor negative inputs. Additionally, smaller changes in the output could not berepresented so the number of multiple inputs that produced the same outputincreased leaving a very small number of pairs that needed to be placed in theLUT (less than 250 pairs).

Even though this method reduced the size of the LUTs it also increased theerror. The values received from the LUT had an increased error that resulted inan average error of more than 10% in the final computation of the probabilityin the first iteration. This is natural as the smallest absolute value that can berepresented with the proposed 8-bit fixed point representation is 2−6 = 0.015625.The smallest absolute value that can be represented with the initial precisionwas 2−11 = 0.00048828125. Furthermore, the error increased further throughiterations as the design has a feedback mechanism. By observing the behavior ofthe LUTs the error was much higher when the output changed more drasticallycompared to the input. For example in the case of the sigmoid function the errorreached a maximum for inputs closer to 0. Figure 3.5 shows this behavior.

This effect appears because the output changes faster closer to 0 and thismeans more pairs need to be represented near zero to reduce the error. In orderto deal with this problem one alternative was to apply linear interpolation. Thiscould minimize the error especially for inputs that are closer to 0. The problemis that linear interpolation would take more area and would require multiplecomputation steps. Additionally, it would require a division unit on its own.The other alternative was to implement the three functions using Taylor series.All three functions can be represented using the Taylor series and the cost interms of area, delay and energy consumption can be changed according to howmany terms are used.

Taylor Series Implementation

The Taylor series for the sigmoid function, the hyperbolic tangent function andthe exponential function can be seen below. Only the terms used in the final

42

Figure 3.5: LUT error shown on the graph of the sigmoid function.

implementation are shown.

σ(x) =1

2+x

4− x3

48+

x5

480− ...

tanh(x) = x− x3

3+

2x5

15− 177

315+ ...

ex = 1 + x+x2

2+x3

6+x4

24+

x5

120+

x6

720+

x7

5140+

x8

40320+

x9

362880+ ...

Directly implementing those polynomials in VHDL is not recommended asx7 for example requires 7 multiplications in the same cycle. Instead we can getthe same result by changing their form to suit the needs of the design.

σ(x) =1

2+ x(

1

4− x2(

1

48+ x2

1

480))

tanh(x) = x(1− x2(1

3+ x2(

2

15− x2 17

315)))

43

ex = 1+x(1+x(1

2+x(

1

6+x(

1

24+x(

1

120+x(

1

720+x(

1

5040+x(

1

40320+x

1

362880))))))))

This form can be implemented by creating a pipeline of stages that includeup to 2 multipliers and 1 adder each. This way data can be streamed intothe Taylor Unit and the result will be available after a fixed number of cycles.The sigmoid requires 3 cycles, the hyperbolic tangent requires 4 cycles and theexponential requires 9 cycles to produce results.

It is can be easily noticed that the last 3 fractions in the exponential functionTaylor series are represented with the same number using the 16-bit fixed pointrepresentation from the current architecture. The first attempt to perform ex-ponential operations with taylor showed better results than the previous LUTlogic. Measurements proved that this method reduced the error absolute valueunder 0.01. However, the precision of the Taylor unit should be increased sothat it can handle these fractions and limit the induced error even if the resultis then truncated back to the 16-bit fixed point representation used in the restof the design. To increase precision some of the operations in the taylor unitwere performed using 32-bit fixed point representation with 5 integer bits and27 fractional bits. This allowed the final 3 fractions present in the exponentialtaylor series to be more accurately represented. The next paragraph provideda more detailed description of the structure of the taylor unit.

In order to perform all operations the final version included 3 16-bit multipli-ers to perform x2 operations, 9 32-bit multipliers to multiply with the operandin each stage and finally 9 32-bit adders. This structure was tested and is func-tional but has a flaw. Two of the three 16-bit multipliers can be removed andsimply use 32 bit registers to store x2 for the second and third pipeline stagesfor the hyperbolic tangent and sigmoid computation. Obviously this would costless in area and probably consume less power as well. However, this was no-ticed too late in the development process and all the following procedures areperformed with 3 16-bit multipliers instead of one.

The results produced from the Taylor series implementation reduced theerror in the areas where the output changed faster closer to 0 to an absolutevalue of 0.001. However, the terms above can only handle inputs that rangedfrom −0.6 to 0.6 for the hyperbolic tangent, −1 to 1 for the sigmoid and −1.25to 1.25 for the exponential. Any inputs outside these ranges would result inhigh error. Furthermore, adding more terms to reach the maximum range ofthe architecture −16 to 15.9995 would cost a lot in terms of area, delay andenergy consumption.

LUT-Taylor Combination

As each of the solution described above could not handle the problem on itsown it was decided to use both. The output would be calculated from a TaylorUnit when the input was inside range that the Taylor expansion could handleand the rest of the inputs would be handled by the LUT Unit. Any inputs thatwould result in an output lower than −16 of higher than 15.9995 would not beincluded in the LUT and the output saturates to the minimum/maximum value

44

that can be represented. Figures 3.6, 3.7 and 3.8 show the ranges matchedwith the corresponding computation unit, Taylor or LUT.

Figure 3.6: Ranges with the corresponding method of computation for the hy-perbolic tangent function.

Figure 3.7: Ranges with the corresponding method of computation for the sig-moid function.

The combination of the LUT and Taylor method reduced the average errorof the final result to under 1%. Additionally, the LUT could now handle higherprecision as it is accountable for far less inputs. The precision was increased to13 bit fixed point representation with 1 sign bit, 4 integer bits and 8 fractionalbits. Taking advantage of symmetry in the sigmoid and hyperbolic tangent

45

Figure 3.8: Ranges with the corresponding method of computation for the ex-ponential function

function reduces the size of these two LUTs by simply adding some controllogic to handle the negative inputs. The final LUT included 119 pairs for thehyperbolic tangent, 69 pairs for the sigmoid function and 453 pairs for theexponential function. The Taylor implementation has 4 pipeline stages for thehyperbolic tangent function, 3 pipeline stages for the sigmoid function and 9pipeline stages for the exponential function.

Taylor implementation and LUT blocks are included in the Squash Unitblock which decides which of implementation must be used depending on theinput and the desired operation (function). The previous paragraphs described adesign that can get a fixed point number and return another fixed point number.The problem is that the previous architecture works with vectors. In order toadjust to that the Squash Unit can be instantiated based on another genericinteger SQnum, which is the total number of Squash Units in the MathematicalOperation Unit. The reason for not instantiating OUsize Squash Units is that ithas to be independent from the rest of the design in a sense that the user must beable to limit its size. If OUsize is very big then instantiating OUsize SquashUnit could make the design very cost inefficient in terms of area and energyconsumption. Therefore, a separate generic is used. Also, the MathematicalOperation Unit now has to be able to adjust to the number of Squash Unitswithin. To achieve that the Squash Unit operates in parallel to the rest of theblocks. More details about this part will be discussed at the end of this chapterwhere the entire control flow of the design will be examined.

46

3.2.7 Accumulator Unit and Division Unit

Almost all necessary computation blocks were described in the previous sections.Up to this point all gate layer equations can be executed with the exception ofthe Softmax gate layer equation. The Softmax equation requires two morefeatures, an accumulator unit to produce

ySize∑i=1

eyi

,where yi is an element in the ySize-sized Y vector resulting from the FullyConnected layer and a division unit to divide each eyi by the sum producedfrom the accumulator.

Starting with the accumulator this unit was designed to instantiate a numberof adders based on another generic integer ACsize. The adders are connectedin a way so that the result of the first adder is fed as an input to the second oneand so on. The result is the sum of all input signals along with sumIn whichcan be the previous sum in case multiple iterations are required to get the finalsum ey for all elements in the vector. The block diagram of the AccumulatorUnit can be seen in figure 3.9

Figure 3.9: Block diagram of Accumulator Unit. A is equal to Asize.

It is worth mentioning that the sumIn and sumOut ports are 32 bits wide.This is one of the exceptions where more precision is required. After runningsimulations with a sum represented by 16-bit fixed point number the followingerror occurred. In some cases the sum exceeded the maximum representednumber of 15.9995 and the result was saturated to that value. This caused anuncontrollable error in these cases. In order to deal with it a 32-bit accumulatorregister was added on the Mathematical Operation Unit. This accumulatorregister is responsible for holding a 32-bit fixed point number with 1 sign bit, 20integer bits and 11 fractional bits. Instead of using the registers in the register

47

file the Mathematical Operation Unit uses this accumulator register when itcomes to computing the sum of all eyi .

After the sum was accumulated next operation in sequence is the division ofthe eY with the sum. The operation can be seen below.

prob1prob2

...probj

=

ey1

Sey2

S...

eyj

S

,where S is

j∑i=1

eyi

The Division Unit instantiates a series of single cycle divisors working in parallelbased on a generic integer Dsize. Again it is necessary the size of the DivisionUnit to be independent of the other Units so that it can be changed accordingto the needs of the final implementation. The Division Unit block diagram canbe seen in figure 3.10.

Figure 3.10: Block diagram of Division Unit. D is equal to Dsize.

An alternative way of producing the same result would be to compute

1/

j∑i=1

eyi once and then use the Multiplication Unit again. This would reduce

the number of the dividers to one and simply add one extra cycle of execution.This method was not examined more carefully due to limited time on devel-opment but it is assumed that it could potentially reduce energy consumptioneven though it requires the instantiation of a second Multiplication Unit whosesize is independent to the rest of the design.

3.2.8 Mathematical Operation Unit Flow Control

Now that all internal blocks of the Mathematical Operation Unit have beencovered and their functionality has been explained it is time to go into how they

48

are all combined to execute the equations of the LSTM, the Fully Connectedand the Softmax layers. These equations can be seen below.

LSTM Layer Equations:

ft = σ(Wfx × xt +Wfh × ht−1 + bf )

it = σ(Wix × xt +Wih × ht−1 + bi)

ct = tanh(Wcx × xt +Wch × ht−1 + bc)

Ct = ft ◦ Ct−1 + it ◦ ctot = σ(Wox × xt +Woh × ht−1 + bo)

ht = ot ◦ tanh(Ct)

Fully Connented Layer Equation:

yt = Wy × ht + by

Softmax Layer Equation:

probi =exi

n∑j=1

exj

The Mathematical Operation Unit has 3 input ports xSize, hSize and ySize.These ports indicate the size of the input vector x, the size of the Hidden Layervector h and size of the prob and y vectors, which is also the number of classes.These sizes must be known for the operations to be completed correctly. Theinstr input is a record that indicates which operation must be completed andwhich registers to use.

The general control flow of the Mathematical Operation Units works byexamining operation requests and delivering them. In the beginning the unitremains idle until validIn becomes ’1’. On that event the instr is checked and ifthe operation field matches one of the preset operations then the correspondingFSM is activated. Generally, after this step validIn is only indicating thatan input is valid until the operation in fully completed. There are 5 presetoperations WxXHpB, AoBpCoD, tanhAoB, FC and softmax. Depending on theinstr operation field op the corresponding FSM is activated and the executionprocedure starts.

The WxXHpB operation performs matrix-to-vector multiplication betweena matrix of size hSize × (xSize + hSize) and a vector of size xSize + hSize.It also adds the bias vector after the multiplication is complete as this is theexpected behavior in all equations. This fits the needs for the four gates inthe LSTM algorithm ft, it, ct and ot. Additionally, there is an sq field inthe instr record that indicates if squashing of the result should be applied. In

49

all 4 cases this is true and therefore the Squash Unit applies the sigmoid of thehyperbolic tangent function on the result of the matrix-to-vector multiplication.As there is no need for the Multiplication and the Addition Units to remainidle during operation of the Squash Units the next OUsize lines of the matrixstart being multiplied as soon as the previous lines have finished. By the timethe next part of the result vector has been calculated the Squash Units havealso finished their own operations and the result is returned to memory witha positive validOut indicating data validity. The FSM functionality of thisoperation will be described in the next paragraph.

The procedure starts by feeding the first OUsize-sized part of the concate-nated xh vector to one of the registers. On the next cycle, the first OUsize-sizedpart of the first column of Wxh is moved to another register. On the next cy-cle, a matrix-to-vector multiplication is issued on the Multiplication Unit. Atthe same time the first OUsize-sized part of the next column of Wxh is savedon the same register and the previous column. Multiplication Unit results areaccumulated using the Addition Unit and are saved on a separate register. Theprocedure moves on like that until the next OUsize-sized part of the xh vectormust be loaded. When the last part of the xh vector is being multiplied withthe first OUsize-sized part of the last column in Wxh the first OUsize-sizedpart of the bias vector is loaded. A dummy cycle follows to allow the necessaryaddition of the last multiplication result vector with the previous vector. Atthat cycle there is no loading from memory so if there are any previous resultsover the next steps of the procedure the dummy cycle offer an ideal windowto store the result back to memory. On the next cycle the first OUsize-sizedbias vector part is added to the result of the matrix-to-vector multiplication. Ifthe sq field in the instr record is sigm or tanh then the Squash Unit is com-manded to start applying the corresponding function to the result. The numberof Squash Units in the design is equal to SqNumber. This means only SqNumberelements of a vector can be inserted to these Squash Units per cycle and thedelay for receiving a response for each one can vary. That is why the SquashUnit operations are completed in parallel to other operations to save cycles. Atthe same cycle the previous result is converted in the Squash Units the firstpart of xh is fetched from memory and the same procedure is repeated for thesecond OUsize lines of Wxh. By the time this multiplication has finished theprevious result will have been processed by the Squash Unit and can be returnedduring the dummy cycle window. The procedure is repeated until the operationis completed for all hSize lines of the Wxh matrix.

This operation is used to compute results for the four gates in the LSTMalgorithm ft, it, ct and ot. The FC operation has same exact FSM with theonly difference that the result is not squashed by the Squash Unit.

The AoBpCoD operation is used to calculate Ct. The procedure starts byloading the first OUsize-sized part of the ft vector on a register. On the nextcycle the first OUsize-sized part of the Ct−1 vector is loaded on another register.Next, an elementwise multiplication is issued on the Multiplication Unit whilethe first OUsize-sized part of the it vector is loaded on the same register as ftbefore. On the next cycle the first OUsize-sized part of the ct is loaded on the

50

same register as Ct−1 before. Another elementwise multiplication takes placeon the next cycle and the two results are added on the next cycle after that.Finally, the result is returned to memory before another iteration of the sameprocedure stars again. The procedure is repeated until all parts of the vectorshave been fed to the operation Unit.

The tanhAoB is simpler as it only requires squashing the previous resultwith the hyperbolic tangent function and performing an elementwise vectormultiplication. This operation is applied to get ht. First step is to load thefirst OUsize-sized part of the Ct vector on a register and immediately issuea hyperbolic tangent operation on the Squash Unit. While the Squash Unitoperates the firstOUsize-sized part of the ot vector is loaded on another register.When tanh(Ct) has been computed the elementwise vector multiplication isissued on the multiplication Unit. On the next cycle the result is returned tomemory before another iteration starts. Again the procedure is finished whenall parts of the vectors have completed computation.

The softmax operation is one of the more complex ones as it requires 2 sep-arate stages of computation. This operations performed to apply classificationon the input sequence and results in a vector containing the probabilities tobelong in each class. The procedure starts when the first OUsize-sized part ofyt is fed to the input accompanied by a positive validIn and the op field of instrset to softmax. The first stage involves getting the sum of all elements in thevector E = eyt . First, the vector is fed to the Squash Unit(s). After the SquashUnit(s) finishes applying the exponential function the Accumulator Unit startsaccumulating the elements of the result. At the same cycle this procedure startsthe result of the exponential function is also sent back to memory as it is neededfor the second stage of the operation. When this procedure has finished the nextOUsize-sized part of the yt vector is fetched and the same cycle is repeated untilthe entire yt vector has gone through. Then the second stage begins by loadingthe first OUsize-sized part of the result of the exponential function from mem-ory. The next step is to use the Division Unit to produce eyt/S, where S is thesum in the accumulator register. It takes multiple iterations to finish the pro-cedure for each part of the vector. When all elements of the vector parts havebeen divided the result is returned to memory and the next OUsize-sized partof the vector is fetched. This is repeated until all probabilities are calculated.

The procedures described above are not the only possible alternative. Thereason the specific schedules of operations where used is because they takethe minimum amount of cycles to compute in the proposed architecture. Alsokeep in mind that changing the Mathematical Operation Unit will also requirechanges in the operation FSMs. Additionally, as it was mentioned before actionsthat would increase energy consumption are avoided as much as possible. Thegoal is to achieve a trade-off between delay and energy consumption where bothare at the very minimum but with greater focus on conserving power. This alsoconcludes the structure and functionality of the Mathematical Operation Unit.It is now time to move on to the rest of the design.

51

3.3 Memory Units

Memory hierarchy is a very extended chapter in computer architecture and itvery important in any hardware structure. Within the scope of this thesis mem-ory was used only in simulation to account for some delay cycles and also designthe control units based on a memory structure. Two memory levels were used tosupport this design, a Top Memory Unit and a Bottom Memory Unit. The TopMemory Unit is by far larger in size than the Bottom Memory Unit and it holdsall weights, biases as well as all data required or produced by the operations forthe current operation. The Bottom Memory Unit consists of two parts. Eachpart contains 4096 16-bit words, resulting in 8KB memory units. The reason forhaving two Bottom Memory Unit parts is to load one part with data from theTop Memory Unit while the other part is used by the Mathematical OperationUnit to perform the operations. This type of memory architecture is known as a”Ping-Pong” buffer. Figure 3.11 shows the structure of the simulated memoryunits. The Bottom Memory Unit has two instantiations of this structure.

Figure 3.11: Structure of the simulated Memory Units. M is equal to OUsize.

The memory is 2-byte addressable meaning the address refers to a 16-bitword. The address port remains 16 in every case for simplicity even thoughnot that many bits are required to address every 16-bit word in the memoryblock. DataIn and DataOut are M ×16 bits where M is equal to OUsize. Thisagain is done for simplicity as the entire architecture is instantiated based onOUsize the same number is used to determine how many 16-bit words are sentand received to and from memory. This way by addressing the first element ofthe vector that is needed the memory sends the first OUsize elements of thatvector in the same cycle. Writing to memory works the same way as an entireOUsize-sized vector is put on the dataIn port. The we signal indicates which ofthe received elements should be written on memory and which not. For exampleif OUsize = 4 and at some point address = d134 and we = b0011. Memorydata on addresses d134 and d135 will be replaced by dataIn(0) and dataIn(1)correspondingly while memory data on d136 and d137 will remain the same.

In the scope of the actual implementation these simulated behavior shouldbe exchanged for actual memory units. It would be ideal to use two SRAM

52

units for the Bottom Memory Unit and a DRAM unit for the Top Memory Unitin order to account for the necessary size and speed in its case. The rest ofthe design is structured in a way to handle further delays from memory units.Possibly only slight changes need to be made so that the entire architecture isfully functional with actual memory implementation instead of the simulatedbehavior currently used.

3.4 Top Level Architecture-Complete Design

The Memory Units and the Mathematical operation Unit functionality andstructure have been examined. In order to guide the entire LSTM algorithmicprocedure control units are required. Figure 3.12 shows the complete structureof the entire architecture as it is currently implemented. The following para-graphs will examine this architecture and describe its functionality. All aspectsconcerning the design will be explained.

The Top Controller is responsible for forwarding the correct data to theBottom Memory Units and commanding the Bottom Controller. The BottomController is responsible for commanding the Mathematical Operation Unit,feeding the correct data into it and finally communicating with the Top Con-troller to let him know if an operation is complete. The communication betweenthe Top and Bottom Controller is limited to 3 signals, ready, start and func.Ready signal is generated on the Bottom Controller to let the Top Controllerknow it is available to perform operations. At this point the Bottom ControllerFSM is at an IDLE stage. Start is sent from the Top Controller to get theBottom Controller to start the operation indicated by the func signal which canbe either ft, it, ct, ot, Ct, ht, yt or SM . As soon as the Bottom Controllerreceives a positive start signal it starts the func operation and lowers the readysignal. There is no communication between the two controllers until the BottomController finishes the operation and gets the ready signal back to 1.

3.4.1 Bottom Memory Units Control-SRAMs

It is important to examine the two instantiated Bottom Memory Units whichare also going to be referred to as SRAMs in the next paragraphs. First questionis why have two of those. The Top Controller could simply load data from theTop Memory Unit (DRAM) to one Bottom Memory Unit (SRAM) and theninitiate the Bottom Controller to start operations. This method limits howeverthe parallelism of the design meaning only one part of the architecture can workat a time. If the Top Controller loads data on the SRAM then it cannot be usedat the same time by the Bottom Controller to feed the Mathematical OperationUnit. One alternative that partially solves this is to use dual port SRAM.

53

Figure 3.12: The top level architecture of the design. M is equal to OUsize.

54

This way the Top Controller and the Bottom Controller could work at thesame time in the following sense. The Top Controller starts to load data on theSRAM and commands the Bottom Controller to start operating at the sametime. While the Top Controller loads the second chunk of data of the operationon the SRAM the Bottom Controller utilizes the second port to read through thefirst chunk of the data and perform the necessary calculations. This way bothcontrollers work at the same time on the same operation. Data from the nextoperation cannot start loading before the first operation has finished though asit is assumed that the size of our SRAM is limited to fit just enough data forone operation at a time. The problem with this approach is that the dual portSRAM has high leakage power and therefore increased power consumption.

So instead of settling for one dual-port SRAM, two single-port SRAMs areused. This way the Top Controller can load data of one operation onto thefirst SRAM while the Bottom Controller work on another operation with dataalready loaded onto the second SRAM. This allows complete parallelism.

The control of two SRAMs is a bit more complex. Each SRAM is accompa-nied by a semaphore register sem0 and sem1. These registers can hold either 0indicating a loading state or 1 indicating an executing state. Only one SRAMcan be on the executing state and both start on the loading state on reset.When a semaphore register is on loading, the associated SRAM is only con-trolled by the Top Controller. Therefore, only the Top Controller can adjustthe address#, dataIn# and we# signal of the corresponding SRAM. When thesemaphore register is set to executing, only the Bottom Controller has access toits input ports. This way both controllers can change the data of each SRAM atseparate stages so that correct functionality of SRAM privileges is ensured. Inorder to change privileges on the SRAM the current owner must issue a positivechangeSem# signal. For example, if Bottom Memory Unit 1 is on a loadingstate and the Top Controller has just finished loading the necessary data on it,the next step is to command the Bottom Controller to start working with thatdata on an operation. The Bottom Controller cannot start operating unlesshe assumes control of Bottom Memory Unit 1. For this to happen, the TopController gets TopChangeSem1 to 1 for one cycle to command the semaphoreregister sem1 to change to an executing state. On the next cycle, the BottomController will have control of Bottom Memory Unit 1. The Top Controlleralready safely issued an operation command on the Bottom Controller at thesame time it changed BMU1 privileges. Finally, the fixed arbitration block isplaced to ensure that if both BMUs are on a loading state the Bottom Con-troller reads from BMU0. This is just to drive the input to one of the two BMUoutputs.

One last thing that is not only important about the Bottom Memory Units,but also for the Top Memory Unit, which is also being referred to as DRAM, isthat the addresses corresponding to the expected data must be known by boththe Top Controller and the Bottom Controller. The architecture is parametric tothe xSize, hSize and ySize port meaning that the size of the input sequence, thehidden layer number and the number of classes can change. This happens so thatthe design can be reusable for more than one problems. The internal structure of

55

the Mathematical Operation Unit already supports these parameters as it wasexplained in the previous sections. The two controllers logic is also adaptiveto changes of those input ports. When it comes to memory addresses though,data has to be placed on specific slots in memory in a way that they occupy asless space as possible. To achieve that both controllers use the same formulasto figure out the address of each data piece based on the xSize, hSize andySize inputs. For example assuming that the weight matrix W is placed in theSRAM right after the input sequence x. When either the Top Controller or theBottom Controller have to access the element on the 2nd column and 3rd rowof the weight matrix in the SRAM they simply have to index the SRAM by2 × hSize + 3 and add the SRAMaddressW. It is important that all matricesare stored column-by-column in all memory units for simplicity. The followingformula shows how this operation is calculated.

SRAMaddressX = 0

SRAMaddressW = SRAMaddressX + xSize

W (3, 2) = SRAMaddressW + 2× hSize+ 3

In order to simplify logic, all addresses of the first element of each data pieceare calculated both on the Top Controller and the Bottom Controller and theyare an exact match. This way the addresses agree and the controllers do notneed to apply further communication to indicate where the necessary data isstored. If the Top Controller commands the Bottom Controller to perform anft operation on BMU0, then the Bottom Controller knows exactly where therequired data is placed on BMU0. Additionally, if one of xSize, hSize or ySizechanges, both controllers still agree as the addresses are calculated based onthese signals. Of course such a change must be made at the beginning of theLSTM algorithmic procedure and the DRAM must already have been loadedwith the correct data on the correct addresses but this is done outside the scopeof this design.

This method causes additional adders and multiplier instantiation outsidethe Mathematical Operation Unit and leads to higher area and energy consump-tion by logic. However, the two alternatives would be either to remove theseinput ports and use generic integers instead making the design usable only whenthe problem parameters were matching the generic integers used for instanti-ation, or use larger memory unit to be able to fit data of various sizes. Thefirst alternative constitutes the design non-dynamic, and the second alternativewould cost much more than the current implementation in terms of area, powerconsumption and delay. Finally, to optimize the design even further the ad-dresses for both the DRAM and the SRAM can be calculated only once in theTop level. Then, the SRAM addressed can simply be passed to both controllersas inputs and DRAM addresses can be be passed to the Top Controller and tothe unit responsible for loading the initial data to the DRAM (if such function-ality exists) outside this design. This way requires that the hardware used foraddress calculation is instantiated only once instead of twice. Currently thismethod is not implemented to keep interfaces simple.

56

3.4.2 Top Controller

The Top Controller is basically an FSM running multiple iterations of the LSTMalgorithm to produce the final result we need. A procedure starting from thevery beginning will be described to get complete understanding of the Top Con-troller functionality. When a valid input sequence is presented, it stores it bothin the DRAM and BMU0. When the start signal becomes 1 the Top Controllerstarts loading Wf , bf and ht−1 (which is a vector of zeros in the first iteration)from the TMU (DRAM) onto BMU0 (SRAM0). When loading is complete,it changes sem0 to executing state and, if BCready is 1, it puts 1 in BCstartand ft in func. The Bottom Controller starts immediately performing the ftoperation by using the data existing on BMU0 (SRAM0). The Top Controllerstarts loading data for the the next operation it onto BMU1 (SRAM1). WhenWi, bt and ht−1 are loaded onto BMU1 (SRAM1), the Top Controller goes intoa storing state where it awaits for the Bottom Controller to finish computingon the ft equation. When the Bottom Controller has finished, it changes theBMU0 (SRAM0) privileges to loading. The Top Controller is alerted by that,understands that the ft operation is complete, and starts storing the results offt into TMU (DRAM) for future use. Also the it operation command is sentto the Bottom Controller and privileges on BMU1 (SRAM1) are changed toexecuting. This way the Bottom Controller starts operating on it as soon asthe Top Controller starts storing the results of ft. When storing is complete,the Top Controller moves to another state where it loads data for the ct equa-tion on BMU0 (SRAM0). This loading/storing procedure jumping from oneBottom Memory Unit to the other and commanding the Bottom Controller toperform the computations is repeated until all equations in the LSTM algorithmhave been completed. The hidden layer vector is then returned (through the hport) with a positive validOut signal and is also saved in the TMU (DRAM)for future use. Then Fully Connected layer computation starts right away. Thedata required for calculating yt have already been loaded on a BMU0 whileht was being calculated. While the Bottom Controller works on performingthe yt computation the data for the next iterations ft+1, Wf , bf and ht arebeing loaded onto BMU1 (SRAM1). When both Top Controller has finishedloading data and Bottom Controller has finished the yt calculation, yt vectoris returned (through the y port) with a positive validOut and is also saved inTMU (DRAM) for further use. Finally, the Softmax operation begins by simplyissuing the SM command to the Bottom Controller and changing privileges onBMU0 (SRAM0) to executing again as yt vector, which is the only requireddata for this operation, is already on BMU0 (SRAM0). When the BottomController finishes computation on the Softmax layer, the privileges of BMU0(SRAM0) are changed to loading and the probability result vector prob is re-turned (through the prob port) along with a positive validOut and also saved inthe TMU (DRAM). The next iteration can then start as soon as the next inputsequence x is received by the Top Controller. The same procedure is repeatedfor every iteration of the algorithm with only the initial BMU (SRAM) changingas there is an odd number of jumps from one BMU (SRAM) to another.

57

Figure 3.13: A simplified version of the Top Controller FSM.

58

The procedure described above concludes the functionality of the Top Con-troller. Figure 3.13 is showing the simplified FSM to get a better understandingof how the Top Controller moves from one step to the next throughout the entireprocedure. Some states in the FSM as well as some output logic is not presentin the figure for simplicity as the actual functionality is very complex.

3.4.3 Bottom Controller

The Bottom Controller is responsible for receiving operation requests from theTop Controller and delivering the result. In order to carry out these requests,he must be granted access to one of the Bottom Memory Units (SRAMs). Asit is described in the previous section, when the request comes through theTop Controller has already loaded a BMU (SRAM) with the data required forthe requested operation and changed the corresponding semaphore register toexecuting to grant the Bottom Controller access privileges on it.

The Bottom Controller utilizes an FSM that consists of smaller FSMs thatmatch the various operations. The top FSM has 3 states IDLE, ONGOINGand RETURN. During the IDLE state counters are being reset and the BottomController awaits operation requests. If start becomes 1 the procedure movesto the ONGOING state and the requested operation is saved on a register. TheONGOING state includes a different action sequence for every equation in theLSTM algorithm, the Fully Connected layer and the Softmax Layer. Each ofthe action sequences guide the Mathematical Operation Unit by issuing thecorrect operation requests, forwarding the correct data and storing the resultsback to the BMU (SRAM). If the requested operation is ft, it, ct or ot, theBottom Controller uses the same FSM, as these equations are very similar andcan be performed almost the same way. The first step is to issue request for anWxXHpB operation on the Mathematical Operation Unit, setting the sq fieldto tanh if the requested equation is ct, or sigm it the requested equation is ft,it, or ot. If the equation is Ct the Bottom Controller requests and AoBpCoDoperation by setting the op field of the instr record signal accordingly. In caseof htt a tanhAoB operation is issued. Again, for Fully Connected layer againa WxXHpB operation is requested but this time the sq field is set to none.Finally, in case of a softmax equation, an SM operation request is made.

The next step is to load and store data following the order the Mathe-matical Operation Unit expects. The ONGOING state has a different actionsequence for each equation, as the loading and storing order differs in every case.The communication between the Mathematical Operation Unit and the BottomController is limited to validIn, validOut and instr signals. The order the datamust be given to the Mathematical Operation Unit is known beforehand and,as xSize, hSize and ySize are the same for both Units, they are used to adjustthe steps in both cases. This way, the Bottom Controller knows exactly whenthe Mathematical Operation Unit expects a chunk of data (a vector part of avector or a matrix column) and which chunk that is. When the Mathemat-ical Operation Unit returns valid data, the Bottom Controller already knowswhat result vector part to expect. This allows to store the valid output to the

59

correct place in the BMU (SRAM). Every case of action sequences is basicallyan FSM going through the separate steps of performing the specific operationas it was described in the Mathematical Operation Unit section. The Mathe-matical Operation Unit is focused on performing the requested operation stepscorrectly, assuming that all the inputs it receives are the ones it expects. TheBottom Controller makes sure that the input data is the same as the data theMathematical Operation Unit expects at every step. Additionally, the BottomController receives the Mathematical Operation Unit outputs and knows exactlywhat they correspond to. Figure 3.14 shows a simplified FSM of the BottomController that briefly describes its functionality.

Figure 3.14: A simplified version of the Bottom Controller FSM.

60

This concludes the functionality of the Bottom Controller. This unit andthe Mathematical Operation Unit are highly interconnected. If any changes aremade on one of the units the other one must be adjusted as well. This happensbecause there is not a detailed communication protocol between them. Thatis, because it is not necessary for them to operate. All operations performedare dependent on the sizes of the vectors and matrices given by xSize, hSizeand ySize. All steps of operations require a number of cycles that is dependentonly on those inputs. This makes it very easy to guide both units separatelybased on these inputs and keep them synchronized at the same time. The onlyexception is the Squash Unit. This is because it could either use the LUTlogic (single cycle delay) or the Taylor implementation (multiple cycle delay) toproduce an output. This delay cannot be calculated without extra steps andcommunication signals. But there is no need to go into this effort, as the SquashUnit operates in parallel to other Units. The result is also saved temporarily inthe Mathematical Operation Unit and is forwarded during a Dummy cycle as itwas mentioned in a previous section. The Bottom Controller knows when thisDummy cycle takes place and simply fetches the data. This way, the uncertaintyof the Squash Unit delay is handled.

Another alternative would be to use a detailed protocol that would make theBottom Controller independent to changes in the FSMs of the MathematicalOperation Unit. A logic required for this would not differ that much in termsof area from what is already implemented. However, creating a protocol is notnecessary for correct functionality and, therefore, it was avoided to keep thedesign as simple as possible. Additionally, a protocol would have to also beupdated whenever one of the units was changed in order to reach full potentialof the design so it would not require less time to design and evaluate.

3.4.4 Top Memory Unit-DRAM

The final part that remains to be examined is the Top Memory Unit. Thismemory unit has to fit all the weight matrices, bias vector, previous cell stateand hidden layer vectors as well as all the resulting vectors of the LSTM, FullyConnected layer and Softmax layer equations. As SRAM leakage is very highand depended on their size a DRAM would be much more appropriate for thisrole. For the needs of this thesis work though, Memory Units were only sim-ulated. The behavior of the TMU is exactly the same as the behavior of theBMU and its structure is identical, only on a larger scale.

There are some requirements for the design to function properly with theTMU. The addresses for each matrix and vector must be already known bythe Top Controller Unit. This way the Top Controller will know where therequested data is. This was also discussed in the BMU section, where the sameprincipal is used. Matrices must also be stored column-by-column to match thestoring of matrices in the BMUs. In the current implementation the TMU isbeing preloaded with all the necessary data before the design is initiated (duringthe reset stage). However, an outside unit could possibly load the TMU withnew data when the rest of the structure is at an idle state waiting for start to

61

initiate the procedure. This functionality can be very easily implemented.This concludes the proposed design architecture. Each separate block func-

tionality was examined and combined with other blocks to finally create a com-plete structure that can perform the LSTM algorithm accompanied by a FullyConnected layer and a Softmax layer. The final architecture can iterate multipletimes and perform the necessary calculations. The next step is to evaluate ifthe design performs as expected and see if the results are meaningful. In orderto do that, SystemVerilog was used to build a verification testbench that canverify that the design behaves as expected and additionally gathers and com-pares result data to measure error. The next chapter will go through the basicsof the testbench design and present the results concerning the accuracy of thedesign.

62

Chapter 4

Performance Results

This section will go through the verification procedure of the design and theevaluation of its results. The first section will describe the basic structure ofthe testbench. The next section will provide some feedback on how the out-put received from the design differed from the expected outcome and how thisdifference was evaluated.

4.1 Verification Testbench Structure

The verification testbench was constructed using SystemVerilog. SystemVerilogprovides more features when it comes to verification than using a VHDL test-bench. All of the blocks were of course tested in separate and combined withsmaller VHDL testbenches during development but the final testbench was de-signed with SystemVerilog and was of a much bigger scale than the previousones. Figure 4.1 shows the general structure of the testbench.

Figure 4.1: Shows how the textbench is structured.

64

The Top entity is a module that instantiates an interface which is going tobe used to pass signals to and from the DUT. It also handles the reset and clocksignal, which are handled separately. Additionally, it instantiates the DUT,whose ports are mapped to the interface. Finally, it instantiates Test which is aprogram responsible for instantiating an environment object. Additionally, Testis setting some variables inside the Environment object. These variables definexSize, hSize, ySize. Furthermore, Test is passing the generic integer values thatwere used for the DUT instantiation to Environment as they are needed forthe driver control protocol. Also, the number of iterations that the testbenchis going to run is defined in Test. The Environment class is responsible forinstantiating a Generator object and a Driver object and running generatorand driver tasks in a correct order ensuring functionality of the testbench. Thegenerator task must run first and must not progress simulation time. TheGenerator generates stimuli for the DUT. It first instantiates a DRAM objectthat is randomized based on constrains and contains all the necessary weightmatrices, bias vectors and all other vectors required by the DUT. The DRAMobject is placed first in a mailbox created in environment for communicationbetween the Generator and the Driver. Next, multiple Transaction objects aregenerated, randomized and again placed on the mailbox. Transactions containthe input sequence x vector. One transaction object is generated for everyiteration defined in test. Both the input sequence and all other data in theDRAM contain randomized elements within range of −1 to 1. Finally, whenthe Generator task is finished the Driver tasks begin. The Driver has separatetasks for sending data from DRAM, storing data to DRAM and finally drivingthe interface connected to DUT. The Driver gets a transaction from the mailboxand starts by performing the LSTM algorithm, the Fully Connected layer andthe Softmax function using a software implementation. This is done to comparethe result with the DUT outputs. While the DRAM tasks simply read theinterface ports that correspond to the DRAM access, the other Driver task isdriving the rest of the interface ports and evaluate the results of the DUT bycomparing them with the expected results calculated at the start of the iteration.When all iterations have been completed and all transactions passed through,the simulation ends.

Assertion properties were added to Test to check for unexpected behaviorsof the interface. This helped figure out some wrong behaviors when it cameto the protocol of the DUT for outside communication. However, it was moreimportant to ensure the correct internal functionality of the DUT and that theresults received from the DUT were valid. To achieve that, the DRAM inputtask in the Driver, responsible for storing data from the interface, was enhancedto also update some variables with the received data. These variables were crosschecked with the expected results every iteration and error was printed on thelog or saved on a text document for further evaluation. The same was done forall the results.

This was a brief description of the basic aspects of the verification test-bench. It is important to go into more detail about how the DUT performanceis evaluated in the testbench.

65

4.2 Error and Performance

In order to evaluate DUT performance, first thing is to produce a correct resultthat the DUT result can be compared to. One thing that must be explained isthat an already trained weight and bias set was not used in the scope of thisperformance evaluation. This evaluation is only checking if the DUT can per-form accurate enough operations and produce results that match the equationsof the LSTM algorithm, the Fully Connected layer and the Softmax function,and measure the accuracy of the produced results. The ability of the design toperform speech recognition or other similar operations that the LSTM neuralnetworks are theoretically capable of performing was not evaluated in this part.

In order to compare results of the operations performed in the DUT, thesame operations must be performed with much higher precision representation.To achieve that 64-bit floating point variables of SystemVerilog real type wereused throughout the software computation. The DRAM, however, contained16-bit fixed point integers because they were fed straight to the interface ofthe DUT. One alternative would be to make the DRAM contain SystemVerilogreal numbers instead. The problem with this approach is that in most casesthese real numbers cannot be accurately represented with the 16-bit fixed pointformat used in the DUT. This would mean that even before going into anycomputation, the number inserted to the DUT would be different from the oneused in the software calculation. This is undesired behavior as the aim is toisolate the error produced by the design at that point. The numbers in bothcases must be possible to be represented both with the 16-bit fixed point formatand the 64-bit floating point format. The same approach is used with the inputsequence generated in the transaction class.

The 16-bit fixed point integer values were therefore converted to real 64-bitfloating point values in the software calculation. The calculation performed alloperations in the equations in LSTM layer, Fully Connected layer and Softmaxlayer, maintaining the same accuracy. All results are saved in separate variables.When the software calculation was completed, the simulation continued so thatthe DUT can produce its own results. The error was then calculated for allequation results by getting the absolute value of the subtraction of the softwareresult and the DUT result. This error can be printed on the screen log or on atext document.

In order to evaluate if the DUT can actually perform well enough, a seriesof 1000 iterations with xSize = 30, hSize = 30 and ySize = 30 was simulated.Multiple such simulations for different instantiation were performed to check ifthe instantiations based on different generic values caused inconsistencies butthe results were always identical. The first results were received using a SquashUnit that implemented only the LUT functionality as it was described in aprevious chapter. As it was mentioned in the previous chapter, this architecturecaused high average error ranging around above 10% (calculated by comparingthe error to the range of −1 to 1) in the probability vector and getting worseas the iterations proceeded. This first results were very inaccurate and that ledto the implementation of a combined Taylor and LUT architecture that was

66

presented.The final design produced some much more promising results showing an

average error of under 1% in the probability vector. Figures 4.2a, 4.2b, 4.3a,4.3b, 4.4a and 4.4b show the error in absolute values.

(a) Maximum vs Average Error.

(b) Average Error.

Figure 4.2: Error graphs for error of in the hidden layer vector.

67


(b) Average Error.

Figure 4.3: Error graphs for error of in the Fully Connected layer vector.

68


(b) Average Error.

Figure 4.4: Error graphs for error of in the Softmax layer vector.

As it can be observed from the figures above average error are the highest inthe Fully Connected layer. This happens because the result of this layer is notsquashed within range of (−1, 1) as it happens with ht. The probability vectorelements are also divided by the sum of all elements in eyt . This also reducesaverage in the probability vector. However, the probability vector producessome very high maximum error values in some points. After careful examinationthese errors occur from a specific type of error. In a previous chapter it wasdiscussed that an Accumulator Register was required to provide representation

69

of numbers outside the range of [−16, 15.9995] provided by the 16-bit fixed pointrepresentation used throughout the design. The Accumulator register holds thesum of all elements in eyt in 32-bit fixed point format (described before) in orderto use it in the division for the Softmax equation. However the resulting vectoreyt was then saved in the BMU for further use and its values were truncated tothe original 16-bit fixed point format used in the entire design. These extrememaximum error cases are elements of the vector whose values exceeded the[−16, 15.9995] range and were then saturated to fit in this range. Even thoughthe Accumulator Register allows to divide with numbers outside this range thedividend number is still limited by that. Additionally, it is obvious that theaverage error is getting above 0.0005 only when these extreme cases occur. Thismeans that, excluding these cases, the rest of the design can limit the erroroccurring in probability at approximately 0.00015. However, this extreme errorcases can be avoided by enhancing the squashing unit to produce a 32-bit fixedpoint result and performing another exponential computation in the secondstage of the Softmax layer algorithm presented in Chapter 3. Instead of savingthe result produced in the first stage, thus truncating it to 16 bit, and using itagain on the second stage, an additional computation could be performed andthe result could be saved on 32-bit registers. This method would remove theseextreme error cases but requires additional control logic, more 32-bit registersand more cycles to execute the Softmax layer equations. Due to limited timethis configuration was not implemented, even though it would most certainlyincrease the performance of the design.

Error can be further limited using higher precision representation in theentire design. A 32-bit fixed point representation with 27 instead of 11 fractionalbits can potentially increase the accuracy of the final results. This will alsopossibly remove the need for additional logic, as it is described in the previousparagraph, to remove extreme error cases in the probability vector.

This concludes the performance evaluation of the design concerning the pro-duced results of the LSTM algorithm. The next step is to move onto the syn-thesis part to measure area, delay and energy consumption of the proposeddesign.

70

Chapter 5

Synthesis Results

In order to perform logic and physical synthesis the design had to go throughsome minor changes. These changes will be described in the next section. Thenext section after that will discuss the results of the logic and physical synthesisand their meaning. Finally, the last section will go through some improvementsand future work concerning this design and the thesis subject in general.

5.1 Synthesis Preparation

The Synopsys Design Vision tool was used to perform synthesis. Any pack-ages that were incompatible with Synopsys were removed and the logic wasredesigned accordingly to perform the same as before.Additionally, all adders,multipliers and divisors in the design were replaced with the equivalent Design-Ware components for optimal performance. Finally,it was necessary to com-pletely remove the two SRAM units as they were not part of the synthesisprocess. The SystemVerilog testbench was adjusted to the final structure andprepared to handle the synthesis generated netlist.

5.2 Results

Logic and Physical synthesis were performed for 4 different configurations ofthe proposed design on 40nm technology. Each configuration is described byOUsize, SQnum, ACsize and Dsize which are the variables mentioned inChapter 3 and are used to parametrically instantiate blocks. The frequency forall simulations and measurements was set to 200MHz (period equal to 5ns) andSynthesis was successful on all 4 configurations. The tables below show eachconfiguration along with the area and power reports extracted from CadenceInnovus physical synthesis tool [14].

The configurations that were synthesized and tested can be seen in followingtable:

72

Configuration 1 2 3 4OUsize 8 16 32 64SQnum 1 4 4 8ACsize 1 4 8 8Dsize 1 1 2 4

Table 5.1: Configuration parameters

Configuration 1 2 3 4Average Power(mW) 9.038 16.84 33.64 59.83Total Energy/iteration (nJ) 88.663 81.337 87.464 132.523Area (nm2) 121207 233730 371709 737712

Table 5.2: Energy and area measurements

The OUsize parameter describes the size of the vectors that the architecturecan handle as it is described in Chapter 3. As a result the register file contains 5×OUsize×16 flip-flops for storing these registers plus another 5 registers to storea valid signal for the next cycle. Additionally, the Addition and Multiplicationunits contain OUsize adders and multipliers respectively. SQnum defines thenumber of squashing units in the design. ACsize defines the number of serialadders in the Accumulation Unit and finally the Dsize parameter describesthe number of parallel 4-stage pipelined Division Units. These configurationswere successfully synthesized down to the physical level and the area, timingand power analysis was done in the post-layout placed and routed design. Theresults can be seen on Table 5.2.

The configurations that were synthesized were also simulated to provide somedata about throughput and SRAM accesses. All configurations were testedunder the same simulation specific parameters for ease of comparison. Thesimulation parameters were set as xSize = 30, hSize = 30 and ySize = 30which correspond to the size of the input vector the number of hidden layerand the number of classes. The simulated SRAMs are also instantiated withbit sized ports as it is described in Chapter 3. Throughput was measuredduring simulation by counting how many cycles it takes for the design to deliverprobabilities for each class, meaning it has to go through the LSTM layer, theFully Connected layer and the Softmax layer. Table 5.3 shows the results of thesimulations for each configuration.

The average power is increasing by 2 for every configuration. That is becausethe most drastic change in these configurations is the size of vectors (OUsizeparameter). Additionally, this affects the entire architecture of the design so thisbehavior is expected. It is can be noticed that the total Energy per iterationis increased a lot in the last configuration. That happens because we used thethroughput data from the performed simulations to get the time required for

73

Configuration 1 2 3 4Cycles/Iteration 1962 966 520 443Throughput(Iterations/Cycle) 0.0005 0.001 0.0019 0.00225Throughput(Iterations/sec) 101936 207039 384614 451466SRAM number of read accesses 1339 643 313 307SRAM number of write accesses 1342 674 340 186

Table 5.3: Throughput and memory access measurements

a single iteration and then multiplied with average power to get the Energyper iteration. The simulation was performed with xSize = 30, hSize = 30and ySize = 30. These problem parameters, however, are not very demandingwhen it comes to the 4th configuration. The 3rd configuration can operate withvectors of up to 32 elements and therefore can go through an entire matrix-to-vector multiplication without the need to split it horizontally. The samecounts for the 4th configuration although this one can operate with vectors of64 elements. This means that every cycle only 30 out of 64 parallel addersand multipliers are used. The rest remains idle during the matrix-to-vectormultiplication. So the reason for high power consumption compared to theother configurations is that the problem parameters are not equally demandingfor all of them. If the problem parameters were xSize = 30, hSize = 64 andySize = 30 then the Total Energy per iteration of the last configuration wouldbe similar to the Total Energy per iteration of the rest of the configurations.This proves that even though the design can perform under various problemdependencies when it comes to xSize, hSize and ySize the throughput andenergy consumption can differ from one configuration to the other. The behaviorof the design can be changed to match the expected problem parameters andfit the needs in terms of delay, power consumption and area.

The throughput and the cycles/iteration in the table correspond to the initialiteration (right after reset). The initial iteration is the worst case iteration asit takes some more cycles than every following iteration. This is because theweight and bias data for the next iteration start being loaded on the SRAMduring the Softmax layer calculation. Additionally, the Squashing Unit blockthat performs the sigmoid, the hyperbolic tangent and the exponential functiondoes not return a result after a fixed number of cycles. If the input is in theLUT range then the result is returned on the next cycle. However, if the inputis within the Taylor range the result will be returned after a number of cyclesdepending on which function is performed. Especially on the exponential casethe delay can vary between a single cycle or 9 cycles before the result is returnedto the Operation Unit. The cycles/iteration and the throughput correspondto the worst case scenario where all input values required a result from theTaylor implementation Unit (9 cycles delay). To receive these results a separatesimulation was performed with the same parameters but the data were forcedto lead to such a behavior.

74

5.3 Conclusions

The LSTM design discussed in this work shows promising results both in termsof accuracy and from a synthesis aspect. The results show that synthesizingsuch a design is possible. Furthermore, the design can be adjusted to fit variousneeds by choosing appropriate instantiation parameters (generics). The currentdesign can handle any combination of input vector size, number of classes andnumber of hidden layers as the instantiation is independent of them. However,in order to achieve better results in terms of delay, area or power consumptionthe instantiation is based on separate generics that allow to change the designaccording to the corresponding preferences. This level of adaptability allows tocreate various combinations and review the behavior of each one.

Even though this thesis work resulted in this adaptable design, a verifica-tion hierarchy that provides the basis for a complete design evaluation it is stilllimited by some factors that require further investigation. One important limi-tation is the absence of evaluation with a trained dataset. The tests performedwere evaluating the mathematical accuracy of equations performed by the de-sign (both before and after synthesis) but there was no evaluation with actualtrained data for the LSTM. This is due to limited time which did not allowto further review how an evaluation method of this scale can be performed.Another important point is that this thesis work does not include an extendedexamination of possible memory hierarchy solutions. Memory units were notsynthesized and their behavior is simulated in all evaluation procedures.

5.4 Future Work

The work of this thesis provides the starting ground for further investigationof some aspects of the specific design and the LSTM ASIC implementation ingeneral. First, the design can be adjusted to apply the same logic using 32-bitor even higher fixed point format which will most likely deliver much betteraccuracy and also reduce the possibility of these extreme error cases that werediscussed in Chapter 3. Second, this design implements a unidirectional LSTMalgorithm. The design can be improved and optimized for bidirectional versionsof the algorithm. Another important aspect of the design that can be furtherexamined is the structural level parallelism. Currently the design offers variousoptions for arithmetic level parallelism but only implements structural level par-allelism within some blocks of the design. The design can be further examinedto offer additional solutions that offer parametric pipeline and parallelism appli-cations in the blocks within the design. Moreover, the design can be adjusted tofunction in parallel with more instances of itself to perform faster computationsand require less memory space and power. This principal can be applied by tak-ing the design apart and creating separate instances each capable of performingonly one of the required equations or by creating multiple instances that canperform all operations and simply feed sequences of input vectors instead of asingle vector for each calculation.

75

References

[1] S. Hochreiter and J. Schmidhuber, “Long short-term memory”, NeuralComputation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997, issn: 0899-7667.doi: 10.1162/neco.1997.9.8.1735.

[2] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Con-tinual prediction with lstm”, Neural Computation, vol. 12, pp. 2451–2471,1999.

[3] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,Y. Wang, H. Yang, and W. J. Dally, “ESE: efficient speech recognitionengine with compressed LSTM on FPGA”, CoRR, vol. abs/1612.00694,2016. arXiv: 1612.00694. [Online]. Available: http://arxiv.org/abs/1612.00694.

[4] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition withdeep recurrent neural networks”, CoRR, vol. abs/1303.5778, 2013. arXiv:1303.5778. [Online]. Available: http://arxiv.org/abs/1303.5778.

[5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J.Schmidhuber, “A novel connectionist system for unconstrained handwrit-ing recognition”, IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 31, no. 5, pp. 855–868, May 2009, issn: 0162-8828. doi:10.1109/TPAMI.2008.137.

[6] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur, Re-current neural network based language model. Jan. 2010, vol. 2, pp. 1045–1048.

[7] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionisttemporal classification: Labelling unsegmented sequence data with recur-rent neural networks”, in Proceedings of the 23rd International Confer-ence on Machine Learning, ser. ICML ’06, Pittsburgh, Pennsylvania, USA:ACM, 2006, pp. 369–376, isbn: 1-59593-383-2. doi: 10.1145/1143844.1143891. [Online]. Available: http://doi.acm.org/10.1145/1143844.1143891.

78

http://dx.doi.org/10.1162/neco.1997.9.8.1735

http://arxiv.org/abs/1612.00694





http://dx.doi.org/10.1109/TPAMI.2008.137

http://dx.doi.org/10.1145/1143844.1143891

http://dx.doi.org/10.1145/1143844.1143891

http://doi.acm.org/10.1145/1143844.1143891

http://doi.acm.org/10.1145/1143844.1143891

[8] E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr,“Accelerating recurrent neural networks in analytics servers: Comparisonof fpga, cpu, gpu, and asic”, in 2016 26th International Conference onField Programmable Logic and Applications (FPL), Aug. 2016, pp. 1–4.doi: 10.1109/FPL.2016.7577314.

[9] K. Greff, R. K. Srivastava, J. Koutnık, B. R. Steunebrink, and J. Schmid-huber, “Lstm: A search space odyssey”, CoRR, vol. abs/1503.04069, 2015.arXiv: 1503.04069. [Online]. Available: http://arxiv.org/abs/1503.04069.

[10] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count”, inProceedings of the IEEE-INNS-ENNS International Joint Conference onNeural Networks. IJCNN 2000. Neural Computing: New Challenges andPerspectives for the New Millennium, vol. 3, 2000, 189–194 vol.3. doi:10.1109/IJCNN.2000.861302.

[11] A. Graves and J. Schmidhuber, “Framewise phoneme classification withbidirectional lstm and other neural network architectures”, Neural Net-works, vol. 18, no. 5, pp. 602–610, 2005, IJCNN 2005, issn: 0893-6080.doi: https://doi.org/10.1016/j.neunet.2005.06.042. [Online].Available: http://www.sciencedirect.com/science/article/pii/S0893608005001206.

[12] F. A. Gers and E. Schmidhuber, “Lstm recurrent networks learn simplecontext-free and context-sensitive languages”, IEEE Transactions on Neu-ral Networks, vol. 12, no. 6, pp. 1333–1340, Nov. 2001, issn: 1045-9227.doi: 10.1109/72.963769.

[13] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H.Schwenk, and Y. Bengio, “Learning Phrase Representations using RNNEncoder-Decoder for Statistical Machine Translation”, ArXiv e-prints,Jun. 2014. arXiv: 1406.1078 [cs.CL].

[14] Cadence innovus implementation system, https://www.cadence.com/content/cadence-www/global/en_US/home/tools/digital-design-

and-signoff/hierarchical-design-and-floorplanning/innovus-

implementation-system.html, Accessed: 2018-05-23.

79

http://dx.doi.org/10.1109/FPL.2016.7577314




http://dx.doi.org/10.1109/IJCNN.2000.861302

http://dx.doi.org/https://doi.org/10.1016/j.neunet.2005.06.042

http://www.sciencedirect.com/science/article/pii/S0893608005001206

http://www.sciencedirect.com/science/article/pii/S0893608005001206

http://dx.doi.org/10.1109/72.963769


https://www.cadence.com/content/cadence-www/global/en_US/home/tools/digital-design-and-signoff/hierarchical-design-and-floorplanning/innovus-implementation-system.html




TRITA TRITA-EECS-EX-2018:153

www.kth.se

ASIC implementation of LSTM neural network...

Documents

Transcript of ASIC implementation of LSTM neural network...