DIGIT SERIAL PROCESSING PIPELINING

7/31/2019 DIGIT SERIAL PROCESSING PIPELINING

1/41


2/41


3/41

SEMINAR REPORT PIPELINING OF DIGIT-SERIAL PROCESSING ELEMENTS IN RECURSIVEDIGITAL FILTERS

Dept. of ECE, Sree Buddha College of Engineering 3

CHAPTER-2

DIGITAL FILTERS

There are several reasons why digital filters have become more common in electronic

systems over the years. Like many digital systems today, digital filters are often implemented

in a computer using a high-level programming language. This results in a short development

time and makes them flexible and highly adaptable, since changing the filter characteristics

simply implies changing some variables in the code. Analog filters on the other hand are

implemented using analog components, such as inductors and capacitors, which must be

carefully tuned. This makes analog filters harder to develop and modify. Another advantage

with digital design is that the characteristics of digital components do not change over time.

Digital systems are also unaffected by temperature variations. Advances in CMOS processeshave resulted in higher packing density and lower threshold voltages, leading to a

considerable decrease in power consumption, which further explains the increased interest in

digital filters.

Today, frequency-selective digital filters are important and common components in modern

communication systems. Like their analog counterparts, digital filters are used to suppress

unwanted frequency components. A linear, time-invariant and causal filter can be described

by a difference equation (1)

Applying Z-transform we get,

(2)

The function described by eqn(1) is a recursive function, since the computation requires the

value of former output samples. Since the impulse response of the filter described in (2) is

infinite, these filters are known as infinite impulse response (IIR) filters. In the case where a k

=0, for k =1 to N the function described by (2) is a finite impulse response (FIR) filter. FIR

filter structures are non-recursive.

The recursive nature of the IIR filter can cause these filters to become unstable. It is therefore

necessary to perform stability analysis when designing IIR digital filters, especially at finite

wordlength conditions. This is not the case for FIR filter, they cannot become unstable. FIR


4/41



filters can also be designed with exact linear phase. The main drawback of FIR filters is that

they require higher filter orders than IIR filters to achieve a certain filter specification. The

higher filter order makes FIR filters larger to implement in hardware than the corresponding

IIR filters.

Recursive algorithms require extra attention due to potential problems with stability, finite

word length effects etc. A class of realizations of IIR filters that fulfills these requirements is

Wave Digital Filters, (WDF) , which are derived from low-sensitivity analog structures and

inherits their low sensitivity during the transformation to the digital structures. A special type

of WDF is the Lattice Wave Digital Filter (LWDF), which consists of two all pass filters in

parallel. This type of realization of IIR filters is superior in many ways, low pass band

sensitivity, large dynamic range, robustness, highly modular, and suitable for high speed andlow power operation, but the stop band sensitivity is high.However, this is not a problem

since the coefficients are fixed.

Realizations of the allpass filters can be performed in many ways, a structure of interest is a

cascade of first- and second order Richards allpass sections connected with circulator

structures. The first and second order Richardsallpass sections using the symmetric two-port

adaptor with the corresponding signal-flow graph is shown in Fig.2.1 and the whole filter is

shown in Fig 2.2.

Fig. 2.1First and second order Richards allpass sections.


5/41



2.1 LATTICE WAVE DIGITAL FILTERS

Lattice wave digital filters have a regular structure. An example of a lattice wave digital filter

is shown in Fig.2.3. The filter consists of two parallel all pass branches that are added or

subtracted at the output of the filter. The all pass branches are often realized by cascading

first- and second-order sections . This structure has good properties regarding dynamic range,

stability, and coefficient sensitivity in the pass band, but hasvery high sensitivity in the stop

band. In practice, the coefficients can be truncated to a small number of bits, which is an

important property, since it results in less complex processing elements.

Fig 2.3 Lattice Wave Digital Filter

The structure is regular, with a small number of adaptor operations. The regular structure

makes it easy to implement large filters by reuse of the processing elements. Further, the

recursive loops are contained within each first- and second-order section of the filter, making

pipelining and interleaving of operations straightforward.

Bireciprocal lattice wave digital filters is a subset of half band filters with certain symmetryin the attenuation function around T =/2. A distinguishing feature is that half of the filter


6/41



coefficients are zero. Hence, half of the adaptor operations are removed with a reduction of

the workload by 50%, and moreover, the maximal sample frequency is increased by a factor

of two since the critical loops now contain two delay elements. This subset of lattice wave

digital filter is therefore particularly attractive from an implementation point of view. An

example of a bireciprocal lattice wave digital filter is shown in Fig. 2. 4

Fig 2.4 Bireciprocal Lattice Wave Digital Filter


7/41



CHAPTER-3

PROCESSING ELEMENTS

The digit-serial approach is interesting for performing a trade-off between area, throughput,

and power consumption. In recursive algorithms the critical loop limits the maximum

throughput. Binary arithmetic can be classified into three groups based on the number of bits

processed at the time. In the digit-serial approach a number of bits is processed concurrently ,

i.e., the digit-size, . IfD is unity the arithmetic reduces to bit-serial arithmetic , while for

D=Wd , where Wd is the data word length, it reduces to bit-parallel arithmetic . Hence, all

arithmetic can be regarded as digit-serial with bit-parallel and bit-serial just as two special

cases.

Timing of the operations is conveniently defined in terms of clock cycles. The execution time

in terms of clock cycles for a digit-serial Processing Element (PE), is defined as

(3)where Wd is required to be an integer multiple of the digit-size.

The throughput or sample rate of a system is defined as the reciprocal of the time between

two consecutive sample outputs. Introducing the clock period Tclk.

(4)The minimum clock period is determined by the delay in the critical path where the critical

path is defined as the path with the longest delay between two registers.The latency of a

system is defined as the time needed to produce an output value from the corresponding input

value. For digit-serial arithmetic it is defined as the time needed to produce an output digit

from an input digit of the same significance. The actual latency is conveniently divided into

algorithmic latency and clock period as in Eq 5.The algorithmic latency is determined by the

operation and pipeline level, which will be discussed in the following section.

(5)

As an example of a recursive algorithm, a bireciprocal third-order Lattice Wave Digital

Filter, LWDF, is used.The filter shown in Fig.3.1 has earlier been implemented using parallel

carry-save arithmetic, and bit-serial arithmetic,. The filter coefficient is =0.375 . Hence, the

number of fractional bits is Wf=3. In digit-serial arithmetic, a sign-extension circuit is


8/41



required in front of the multiplier to produce the most significant bits. The multiplication

increases the word length with 3bits, which are removed by the quantization block in order

to keep the word length constant in the loop. By allowing the output y to be quantized and

sign-extended, these operations can be located as shown in Fig.3.1. The input word length is

assumed to be 12 bits, which is sign-extended with Wf bits at the input, yielding an internal

word length of 15 bits, in order to equalize the execution time for all operations. The Wf

extra bits are also sufficient to prevent overflow in all nodes. Thus, no over/underflow

saturation circuits are required, which otherwise would have reduced the throughput. The

extra bits do not decrease the throughput, since it is independentof Wd. The maximum

throughput for this slightly modified filter algorithm is

(6)

Where are the latencies of adder, multiplier and the combinedquantization/sign extension circuit, respectively

Fig.3.1 A third order bireciprocal LWDF

3.1 PIPELINING AND INTERLEAVING.

Pipelining is a transformation that is used for increasing the throughput by increasing the

parallelism and decreasing the critical path. It can be applied at all abstraction levels during

the design process. In the pipelining at the algorithm level, additional delay elements are

introduced at the input or output and propagated into the non-recursive parts of the algorithm

using retiming. Retiming is allowed in shift-invariant algorithms, i.e., a fixed amount of delay

can be moved from the inputs to the outputs of an operation without changing the behavior

of the algorithm. Ideally the critical path should be broken into parts of equal length. Hence,

operations belonging to several sample periods are processed concurrently. The latency

remains the same if the parts have equal length, otherwise it will increase. Pipelining always

changes the properties of the algorithm, e.g., algorithm pipelining increases the group delay

and increases the parallelism.


9/41



Another approach to increase the throughput of sequential algorithms without increasing the

latency is interleaving . Different samples are distributed onto different processing elements

working in parallel. This relaxes the throughput requirements on each processing element

and the clock frequency can be reduced according to Eq.(7)

(7)3.2 LATENCY MODELS

Pipelining at the arithmetic or logic level will increase the throughput by decreasing the

critical path enabling an increased clock frequency. By inserting registers or D flip-flops, the

critical path is decreased by split ideally into parts of equal length. This decreases the

minimum clock period, but at the same time increases the number of clock cycles before the

result is available, i.e., the algorithmic latency increases. Since the pipeline registers are not

ideal, i.e., the propagation time is non-zero, pipelining the operations beyond a certain limit

will not increase the throughput but only increase the latency. The level of pipelining of

processing elements is referred to as Latency Model (LM) order. It was introduced for

modelling of latency for different logic styles implementing bit-serial arithmetic. LM 0 is

suitable for implementations with static CMOS using standard cells without pipelining, LM 1

corresponds to implementation with one pipeline register or using dynamic logic styles with

merged logic and latches, and finally LM 2 which corresponds to pipelining internally in the

adder suitable for standard cells implementation. It was generalized for digit-serial arithmetic

and introducing pipeline registers after each additions. We use the reciprocal, i.e., an

extension of the LM concept to include fractional latency model orders defined by Eq.(8)

(8)where

is the number of adders between each pipeline register. By this we keep the

relationship between logic style and LM order and in addition we gain a tuning variable for

the pipeline level.

Hence, with LM equals zero, LM 0, which corresponds to a non-pipelined addershown in Fig.3.2. The algorithmic latency equals .. Hence, the algorithmic latencyin terms of clock cycles equals zero, while the clock period, Tclk0, is determined by the critical

path, and denoted with dashed arrows in Fig.3.2.A cascade of LM 0 adders, yields an

increased critical path and thus and the total algorithmic latency becomes

(9)


10/41



LM 1, corresponds to a pipelined adder, according to Fig. 3.2, with algorithmic latency

while the clock period Tclk1 , is determined by the delay of one adder . A cascadeof LM 1 adders yields an unchanged Tclk1 and the algorithmic latency in the cascade

becomes

(10)A fractional LM order is obtained by a cascade of LM 0 adders followed by one LM 1 adder at the

end, shown in Fig. 3.2. The algorithmic latency becomes

=1 (11)and the clock period is determined by critical path given by

( ) + (12)

Fig 3.2 Adders with different LM orders, and adders in cascade with their corresponding critical path.

3.2.1 Multiplication Latency

In fixed function DSP algorithms, the multiplications are usually performed by applying the data

sequentially and the constants in parallel using serial/parallel multipliers. The result of the

multiplication has an increased word length, i.e., the sum of the data word length W d, and the

constants number of fractional bits Wf , yielding Wd+Wf . For maintaining the word length a

truncation or rounding must be performed, discarding the Wf least significant bits . A digit-serial


11/41



multiplier produces D bits each clock cycle. Hence, the result during the first clock cycles isdiscarded and the algorithmic latency of the multiplication with LM 0 becomes

(13)

It is not required that the number of fractional bits of the coefficient is an integer multiple of

the digit-size. However an alignment stage has to be used in cascade for alignment of the

digits. The amount of hardware is reduced but the algorithmic latency increases to the nearest

integer clock cycle. The execution time for the multiplier becomes,

(14)

Introducing a pipeline register at the output yields a LM 1 multiplier and the algorithmic

latency increase with one clock cycle. However the execution time is unchanged.

(15)3.3 MAXIMAL SAMPLE FREQUENCY

Non-recursive algorithms have no fundamental upper limit of the throughput in contrast with

recursive algorithms that have under hardware speed constraints a maximum throughput fmax ,

limited by the total latency.

{ } (16)Where is the total latency and Ni is the number of delay elements in the recursive loop .

The reciprocal Tmin is referred to as the minimum iteration period bound.It is convenient to

divide Tmin into Lmin and Tclkas shown in Eq.(17)

(17)Where is the algorithmic latency for operation k in the recursive loop i , Lmin isdetermined by the algorithm and LM order that can be known before the actual minimal

iteration bound. The loop with is called the critical loop.

To get a clear idea the bireciprocal lattice wave digital filter is shown with its critical loop

marked with dashed lines in Fig 3.3.


12/41



Fig 3.3 Example of critical loop in IIR Filter

(18)3.4ALGORITHM TRANSFORMATIONS

Algorithm transformations are used for increasing the throughput of algorithms or

decreasing the power consumption. A tremendous amount of work has been done in the past

to increase the throughput. However these results can be traded for low power using the

excess speed to lower the power supply voltage, i.e., voltage scaling.

Pipelining, described earlier, is based on retiming. In non-recursive algorithms, which are

limited by the critical path, pipelining at different design levels is easily applied to split the

critical path and thereby increase the throughput. In recursive algorithms limited by the

critical loop, pipelining at the algorithm level is impossible. Methods called Scattered Look-

Ahead pipelining or Frequency Masking Techniques can be applied introducing more delay

elements in the loop and thus increase fmax . The former method introduce extra poles which

have to be cancelled by zeros in order to obtain an unaltered transfer function, this may cause

problems under finite word length conditions. Hence, Frequency Masking Techniques are

preferable. However, pipelining at the logic level for split of the critical path by increasing

the LM order is easily applied.

Another possibility is to use numerical equivalence transformations in order to reduce the

number of operations in the critical loop. These transformations are based on the

commutative, distributive, associative properties of shift-invariant arithmetic operations.

Applying these transformations often yield a critical loop containing only one addition, one

multiplication, and a quantization . By rewriting the multiplication as a sum of power-of-two

coefficients the latency the can be reduced further yielding, for bit-serial arithmetic, L=Wf ,independent on LM order. The multiplication is rewritten as sum of shifted inputs,


13/41



introducing several loops. The shifting is implemented using D flip-flops in bit-serial

arithmetic. The power-of-two coefficient that require most shifting, i.e., largest latency, is

placed as the inner loop and the other coefficients is placed as outer loops in order with

decreasing amount of shifting. This technique can be expanded for larger digit-sizes than

one.

3.5 IMPLEMENTATION OF DSP ALGORITHMS

The implementation of the DSP algorithms can be performed by a sequence of descriptions

introducing more information at each step of the design.

3.5.1 Precedence Graph

The precedence graph shows the executable order of the operations, which operations that

have to be computed in a sequence, and which operations that can be computed in parallel. It

can also serve as a base for writing executable code for implementation using signal

processors or for use of implementation of DSP algorithms using a personal computer. As

examples the IIR filter in Fig. 3.3 yields the precedence graph shown in Fig. 3.4. The critical

loop is denoted with a dashed line.

Fig 3.4 Precedence Graph for IIR Filter

3.5.2 Computation Graph

By including timing information of the operations to the precedence graph a computation

graph is obtained. This graph can serve as a base for the scheduling of the operations. The

shaded area indicates the execution time, while the darker area indicates latency of the

operations. The unit on the time axis is clock cycles. The timing information is then later used

for the design of the control unit.

At this level of the design the LM order has to be chosen. Here we use three different LM

orders 0, 1/3, and 1. As an example, the computation graphs over a single sample interval


14/41



for the IIR filter is shown in Fig. 3.5. The As Soon As Possible (ASAP) scheduling approach

has been used for the scheduling of the IIR filter.

Fig 3.5 Computation graphs for LM 0, 1/3 and 1 implementations of the IIR filter using D=3

3.6 OPERATION SCHEDULING

Operation scheduling are generally a combinatorial optimization problem. The operations in

data independent DSP algorithms are known in advance and a static schedule, which is

optimal in some sense, can be found before the next design level. The contrary case is called

dynamic scheduling, which is performed at execution time. We are interested for the

recursive algorithm to obtain maximally fast and resource minimal schedules, i.e., the

schedule reach the minimum iteration bound with a minimum number of processing

elements. For the non-recursive algorithm is the aim to obtain a schedule that can be unfolded

at an arbitrary degree

3.7 UNFOLDING AND CYCLIC SCHEDULING OF RECURSIVE ALGORITHMS

To attain the minimum iteration bound it is necessary to perform loop unfolding of the

algorithm and cyclic scheduling of the operations belonging to several successively sample

intervals if

The execution time for the processing elements are longer than, or The critical loop(s) contain more than one delay element

For digit-serial arithmetic a lower bound of m is derived in the following. The execution time

for a digit-serial PE is . The ratio between and determines the number ofdigits that need to be processed concurrently, i.e., equal to the number of concurrentsamples

(19)


15/41


16/41



scheduling for several sample periods just by taking m sets ofand delay them in time.In the case with a non-integer , a non-uniform delay with alternating and integer clock cycles between the sets can be used. This aligns all samples with the sameclock phase.An example of a maximally fast schedule using ,D=3 LM 1 is shown in Fig 3.6

Fig 3.6 Maximally Fast scheduling

3.8 MAPPING TO HARDWARE

Fig 3.7 Hardware Structure for Unfolded Filter

An isomorphic mapping of the operations in the cyclic schedule to hardware yields a

maximally fast and resource minimal implementation. The branches in the computation graph

with different start and stop time are mapped to shift-registers, i.e., the time difference is


17/41



referred to as slack or Shimming Delay (SD) .The shimming delay is registers implemented

using D flip-flops. Properties of a maximal fast implementation are that the critical loop has

no SD, i.e., the delay elements in the critical loop are totally absorbed by the latencies of the

operations in the loop. Isomorphic mapping to hardware also yields low power consumption,

since the amount of dataflow without processing data is low. The processing elements yields

a good utilization grade, hence power down or gating of the clock only increase the

complexity without reducing the power consumption. Fig 3.7 shows the IIR Filter structure

after unfolding.

3.3 THE SUM PRODUCT ALGORITHM

The sum-product algorithm is a generic algorithm that operates on a factor graph through a

sequence of local computations at every factor-graph node . The computation rules consist

only of multiplications and additions, hence the name sum product algorithm. The local

results are passed as messagesalong the edges of the factor graph. The algorithm can be used

to compute the exact function summary in a factor graph that forms a tree, that is, has no

loops. But the sum-product algorithm can also be applied to factor graphs with cycles where

it results in an iterative algorithm without a natural termination. This makes the function

summary non-exact But decoding of turbo codes or low-density parity-check codes are some

of the most exciting applications that reflect precisely this situation with a factor graph

having cycles. And with some precautions, the algorithm performs very well.

A mathematical representation of the sum product algorithm can be observed in the following

example. Let us consider the case with the real-valued global function as in equation (1) that

may represent a conditional joint probability mass function of a collection of discrete random

variables, given some observation y. We are then interested in the function summary

2 3 4 5

)4()(),,,,(| 11543211x x x x

xgxxxxxgyxp

2 3 4 5

)5(),(,,,| 5343321211x x x x

EDCBA xxfxxfxxxfxfxfyxp


18/41



Figure 3.3 Gathering separate product terms in factor graph to compute g1(x1)

We observe immediately that g1(x1) can be calculated by only knowing fA and fBCDE . The

latter can be computed by just knowingfB , fCandfDE . Finally,fDEcan be calculated by just

knowing fD and fE. The products can be assembled in the factor graph as shown in Fig

3.3.With each node in the factor graph we can now imagine an associated processor which is

capable of doing local products and local function summaries. They may communicate

together by sending and receiving messages from neighbouring nodes. The messages are

whole distributions, that is., the outcome of the function nodes, which are passed from one

factor graph node to another connected by an edge. In general, they represent discrete

probability mass functions, but also continuous probability distributions are included in the

framework. Through the message passing behaviour, all information needed to calculate g1

(x1) becomes available at x1. Hence, the information is distributed fully bi-directional on all

branches of the network if we calculate the function summary for all variables.


19/41



CHAPTER-4

PROBABILITY CALCULUS MODULES

4.1 BUILDING BLOCK OF PROBABILITY PROPOGATION NETWORK

In the previous sections we have introduced the basics of factor graphs and the sum-product

algorithm which can be run on these graphs. The messages passed from one node to another

often have the meaning of probabilities or probability density functions. To construct

probability propagation networks, we consider in the following building block

Figure 4.1 Building Block Of Probability Propagation Network

The building block compute a discrete probability mass function pz from the discrete

probability mass functions px and py as follows:

Let X, Y, and Z be finite sets. Letpxandpybe the input probability mass functions defined on

the sets X and Y, respectively. Letpzbe the output probability mass function on Z defined by

)6(.............),,( ZzzyxfypxpzpXx Yy

YXZ

Using this equation we can calculate the output probability function a scaling factor is

added in order to adjust the summation of probabilities to 1.The {0,1} valued functions f can

be illustrated by trellis modules.Such a trellis module is a bipartite graph with labelled edges.

The set of left-hand vertices is X, the set of right-hand vertices is Z, and an edge between

x X and z Z with label y Y exists if and only if , f (x, y, z) = 1. Conversely, the trellis

module uniquely defines f . In the context of coding theory, the binary indicator functions f


20/41



are known as local indicator functions of the factorized global code membership indicator

functions.

4.2 SOFT LOGIC GATES AND TRELLIS REPRESENTATIONS

EQUAL GATE

The functionf(x,y,z) for this particular case is equal to 1 if and only ifx=y=z and

f(x,y,z) = 0 otherwise. The corresponding trellis is shown below.

Figure 4.2 EQUAL Gate Trellis

The probability formulation of the output distribution is given by

The equal gate is simply a local function node that multiplies two probabilities

together. A scaling factor is provided in order to round the value to either 0 or 1.

SOFT XOR GATE

Another common functionf is defined asf(x,y,z)=1 if and only if z= x xor yand f(x,y,

z)=0 otherwise. The corresponding trellis is shown in Fig 4.3.


21/41



Figure 4.3 XOR Gate Trellis

The probability formulation of the output distribution is given by

=

Here, the soft XOR both multiplies and adds probability distributions. There are two multiply

operations here.


22/41



CHAPTER-5

CIRCUIT IMPLEMENTATION

This section is concerned with the hardware implementation of the soft logic gates discussed

in the previous section. For local functions f it was shown that the sum product algorithm

allowed probability distributions which behaved similar to Boolean logic gates. This section

documents the building blocks and circuits that accomplish these probabilistic adds and

multiply.

Is it not obvious how these circuits will operate. Will they be voltage or current mode

implementations? How can adds and multiplies be constructed? Adds are simple in analog

circuitry. Essentially they are available for free. Recall that Kirchoffs Current Law states the

sum of all currents into a node is zero. By connecting one wire to another (shorting a wire),

the addition of current is realized.

Multiplication is not straightforward in analog CMOS. In 1975 Gilberts current multiplier

used both the exponential and logarithmic characteristics of BJTs to multiply current.

Performing multiplication using analog CMOS requires the manipulation of the operation

range of the MOS transistor. This operating range is known as the sub threshold region, and it

is synonymous with Carver Mead. Meads work focused on exploiting the non linearities of

MOS transistors operating in the sub threshold region. The goal was to create circuits that

used little power and behaved like biological functions. By operating in the sub threshold

region, Mead found that CMOS technology behaved like bipolar junction transistors, that is,

the CMOS implementations modeled the characteristics of BJTs necessary have current

multiplication

5.1 SIGNAL SUMMATION

Summing signals is easily accomplished in the current domain that is when signals are

represented by currents. This is due to Kirchhoffs current summation law, which states that

the sum of all currents along the incoming branches to a given node is equal to the sum of all

currents of the outgoing branches. If only one outgoing branch exists, it automatically carries

the sum current of all incoming branches. This means that current summation is simply done

by connecting wires.


23/41


24/41



where UT is the thermal voltage, kT/q. As expected, the current response is exponential

function of Vgs. The deviation from the exponential behavior occurs when the flow of Isat

becomes restricted.

Figure 5.1: Saturation current of an NMOS transistor as a function of gate voltage.

We have just defined a critical operating parameter in our probability circuits. From the graph

P1 (z = 1) = Isat = 10nA and P1 < 10nA when P1 (z 1) (9)

P0 (z = 1) = Isat = 10nA and P0 < 10nA when P0 (z 1) (10)

under the following constraint that,

P1 + P0 = 1 (11)

so that the currents are complimentary.

5.3 A SUBTHRESHOLD CMOS MULTIPLIER

In order to mimic the functions described by the soft logic gates, a circuit must be

created to allow for the multiplication. The multiplier circuit is nothing more than a

collection of several current mirrors, and a differential pair - a simple, but extremely

elegant solution. We begin by discussing several critical operation parameters of a

diode connected transistor, show how it creates a current mirror, define the transfer


25/41



characteristics of a differential pair, and then show how these circuits create a current

multiplier.

CURRENT MIRROR

Figure 5.2 Current Mirror

Current mirrors are a key element in analog integrated circuits. They are used to duplicate

currents or to fold or cascade parts of the circuit in order to reduce the supply-voltage

requirements. The simplest structure of an NMOS current mirror is shown below. Both

transistors have to be in saturation and we rely on perfect matching for an ideal operation. A

brief analysis shows that the copy errors due to the finite output resistance of the mirror

transistors are relatively large. The output resistance is improved by making the transistors

longer. Note that the current mirror can be operated in both strong inversion and weak

inversion. However, current mirrors operated in strong inversion match better than those

operated in weak inversion. For a given WL, where W is the width of the transistor, thematching of the transistors is best if the current mirror is designed to operate with a large VGS,

that is, to force the transistor to deep strong inversion. Matching can be improved by

augmenting the active transistor area WL. This reduces the relative errors of random

fabrication errors.

The design of a current mirror is generally started by imposing current mirror a certain

voltage swing at the nominal current. This leads to a W=L in a given semiconductor

technology. The minimum voltage between drain and source that still allows the operation in

the saturated region can be derived from the given gate-source voltage. Finally, the active

transistor area WL is adapted until the desired level of matching is achieved. Note, however,

that the parasitic gate capacitance is augmented by the same amount. Hence, the increase of

the parasitic capacitance will reduce the maximum operation speed. Generally, several

parameters have to be traded off during the design process.


26/41



In an NMOS device, the drain current Isat is an exponential function of Vs and Vd.

)12(0T

sg

U

VV

eII

If this equation is solved in terms of Vg the result is a logarithmic current to voltage converter

)13(log0

I

IUVV TSg

In a current mirror, a diode connected transistor shares its gate node with another transistor of

the same type. If the current sources are fixed and the devices are in saturation, they act as

current sources. Similarly if both devices are the same geometries and share the same source

potential then they source the same current.

This is explained in the current mirror shown above Current IIN sets gate voltage Vg for both

transistors. Consequently Vg now sets current IOUT . In this example, Vs,T1 = Vs,T2 and Vd,T1 =

Vg,T1 = Vg,T2 . Using above equations it is easily verified that IOUT = IIN because of the

logarithmic-exponential characteristics.

Current mirrors allow the current output to be scaled by either using different source voltage

potentials in which

)14(

21

IN

U

VV

OUT IeIT

ss

however we can also scale the current output by varying the ratio of transistor widths given as

)15(11

22INOUT I

LW

LWI

The implementation of efficient and noise tolerant current scaling is important when a

decoding application is considered. Due to the multiplicative nature of the sum productalgorithm and the fact that the multiplied probabilities will not always be unity, the current

level will tend to 0 if it is not scaled from time to time.

THE DIFFERENTIAL PAIR

As the final step towards building the Analog CMOS multiplier, the differential pair is

introduced.


27/41



Figure 5.3 DIFFERENTIAL PAIR

Consider the circuit shown above, comprising three NMOS transistors. The sources ofM1

and M2, which are called the differential-pair transistors, are connected together at node V.

The third transistor,Mb, which is called the bias transistor, is supposed to sink a constant bias

current,Ib, from node V. The gate voltages ofM1andM2 are the inputs to this circuit and the

currentsI1 andI2 are its outputs. For a circuit like the differential pair, we often express each

of the two input voltages, V1 and V2, in terms of a common-mode input voltage, and a

differential mode input voltage. Analogously, we also often express the two output currents,

I1 andI2, in terms of a common-mode output current, and a differential-mode output current.

This circuit is quite powerful, by careful consideration of what voltage is applied at V1 and

V2, the current at the sources of the differential pair can be steered to appear at either output.

)16(21

11

ININ

INb

II

III

and

)17(

21

2

2

ININ

INb

II

III


28/41



Core Circuit For Matrix Multiplications

The fundamental circuit for matrix multiplication is as shown in Fig 5.4.

FIGURE 5.4 CORE CIRCUIT FOR MATRIX MULTIPLICATION

Its inputs are the currents Ix,i:i=1, 2,,m and the currentsIy,i :j=1,2,.n. Its outputs are

the currentsIi,j. All transistors are assumed to be weakly inverted MOS transistors.

The function of this circuit is given by

)18(,,,

j

jy

j

ixZji I

I

II

II

)19(,1

,

1

,

n

j

JYy

m

i

ixX IIII

Now recall our equation,

)20(.............),,( ZzzyxfypxpzpXx Yy

YXZ

The application of the circuit of Figure above to the computation of equation is now

straightforward. Let X={x1,x2,..xm } and Y={y1,y2,..ym }. The input terminals of the circuit are

fed with the currentsIx,i =Ix px(xi) andIy,j =Iy py(yj), respectively. The output currents then

equal to

Ii,j =Iz px(xi) py(yj) (21)


29/41



The computation is completed by summing the currents Ii,j for each zZ for which

f(xi,yj,z)=1.

CMOS Multiplier

There is not much work left to create this multiplier. In fact, all the necessary components

have been discussed. All that remains is to add a diode connected transistor to each of the

inputs in figure a. Recall that the current Ib was steered by the domineering voltage. By

adding diode connected transistors to inputs V1 and V2, creating current mirrors at the inputs

of the differential pair, Ib is now multiplied by currents instead of voltages.

Formally that is given by the following equations

21

12

ININ

INbII

III

and21

22

ININ

INbII

III

5.4 IMPLEMENTATION OF SOFT XOR GATE

Let us start with the most simple module, the soft-XOR gate. It can be drawn directly using

its butterfly trellis section that has been derived from its binary indicator function f. Like all

modules with binary input distributions, it consists of six core transistors forming the

multiplication matrix and its characteristic connection pattern of the trellis. In fact the trellis

pattern will be directly visible on silicon if the devices are properly arranged. This fact may

be helpful in order to create automatic tools for generating such building blocks in a chip-

design environment. All product terms are used to build the output probability distribution pz.

The output terms are mirrored by the current mirrors on top of the kernel circuit. The input

currentsIx,i are also passed through an input current mirror. By doing this, the module gets

freely cascadable by simply connecting the output current vectorIzto one of the input current

vectorsIx andIy , respectively, of the next circuit section. This method marks the simplest

way of interconnection of several building blocks. Note also that all the current mirrors may

equally well operate in the strong inversion region of MOS transistors, thereby having a

standard quadratic behaviour.


30/41



Figure 5.5 Soft XOR gate

The six NMOS transistors in the middle section performs the multiplication operation . The

PMOS transistors at the top two corners are output transistors they are current mirrors that

source current rather than sink current, by this it will be able to cascade this block to other

blocks for a large circuit. The current mirrors are necessary for scaling of current. The input

current probabilities are given through diode connected NMOS. These diode connected (gate

tied to drain) NMOS at each input have a low input impedance and logic gate obtained are

current-mode circuits. Transforming the blocks to voltage-mode operation is easily achieved

by shifting the diode connected NMOS from the input to the output side. From the trellis

diagram and equations discussed earlier it is clear that the SOFT XOR gate will perform twoadds also this is made possible by shorting of sources of T5 and T7 and also T6 and T8 each

giving the following realizations of PZ(0) and PZ(1) respectively.

=

The circuit operation is straightforward current at Ix P(x=0) and Ix P(x=1) bias Vg of T11 andT12 and T13 and T14 respectively. T12 and T13 mirror the current. Transistors T9 and T10


31/41


32/41



The diode connected transistors at each input have a low input impedance and the soft logic

gate obtained are current-mode circuits. Transforming the blocks to voltage-mode operation

is easily achieved by shifting the diode connected NMOS from the input to the output side.

The current mirrors at the output make possible easy cascading of several blocks of soft

EQUAL gate.

The operation is simple and there is only a small difference from the circuit of soft XOR gate

.Here the transistor at the middle is not shorted to the gate terminals of top transistors this is

because in case of EQUAL operation the need of addition is not there, only product is

required so the wires are not shorted instead the drain of T6 and T7 are connected to the VDD.

The voltage VBIAS is given in order to normalize the probability of current as in the equation.


33/41



CHAPTER-6

DESIGN CONSIDERATIONS

Several questions come to mind when presented with the above schematics. In specifying

parameters such as voltage and transistor sizes it is important to understand the context in

which the circuit will be performing. As mentioned previously, these circuits are suited well

for decoding applications. Why? First, they can indicate, through very little computational

cost, the probability of whether a bit is a one or zero. Secondly, analog circuits biggest

criticism, the lack of accuracy, is not an issue in this application. This statement deserves

careful consideration.

Suppose that after passing through several gates we operate at 50% of maximum current

value, then we leave ourselves vulnerable to noise contamination and we lose the meaning of

what we wanted to compute. Now suppose, that after several probability computations we

scale our probability current using W/L ratio or source voltage Because our circuits give

indications of a 1 or 0, worrying about whether or not we can accurately bring our current

back to its maximum operating value is irrelevant. Instead, we can preserve our computations

against the circuits inherent noise by periodically "raising" the current level; this is

analogous to the natural error correction of digital logic, where an input can be pulled high orset low after each stage of logic.

6.1 DC OPERATING VOLTAGE

The supply voltage VDD is a trade-off between power consumption and signal noise. For

example, lowering VDD results is lower current draw from the supply which results in more

power savings, however this is at the expense of degraded transistor operation, the output

currents may not rise to the maximum operating level. However, because some error is

tolerated, a large supply voltage is also not necessary. Instead the voltage VDD should be

chosen to ensure to ensure maximum power savings while tolerating some percentage of

error.

In addition to VDD the circuit diagrams of the soft logic gates use a bias voltage called

VBIAS to assist in the probability calculation.

VBIAS = 25%(VDD). (22)


34/41



Also the output of a current mirror Iout can be scaled according to the difference in Vs

. Voltage VBIAS attempts to reduce this scaling effect and allow this second stage of current

mirrors to have a more ideal operation, that is

IOUT = IIN

Because we can never be sure of the voltage Vs at the transistors in the differential pair, a bias

voltage is selected to reduce the scaling of IOUT.

6.2 TRANSISTOR SIZING

CMOS implementations have the benefit of occupying less die area than their BJT

counterparts in layout. Therefore, the circuit must operate correctly, but should occupy an

area smaller than a BJT implementation. With this constraint in mind, sizing in these circuits

employs where possible, a minimum W/L ratio. In digital design, the sizing of the transistors

(in particular, the transistors width) can allow the designer to favour different types of input

conditions. The designer does this to decrease the delay of that particular gate. In the case of

these circuits, widening the transistors does not decrease the rise time. In fact, widening the

transistors will simply increase power consumption, current levels will need to be raised, and

a larger supply voltage is needed to calculate the circuits outputs. Therefore digital

techniques to improve gate delay do not apply. Our only other recourse is to set the current atlevel just below Vt ,so that we operate at the maximum current density in the sub threshold

region.

Transistor length is also important. Short channel effects, such as the Early effect can

modulate our transistors current output. By increasing the transistor length we can protect

against any short channel effects. As a rule of thumb in any of the circuits presented in this

paper, current mirror transistors length were given by

Lcurrent_mirror = 6Lmin

This sizing rule is needed for proper operation.

6.3 IMPLEMENTATION ISSUES

Topology-Induced Problems

The topology-induced problems means that, for example, biasing a large probability

propagation network may cause severe problems since no local matching can be guaranteed


35/41



for a distributed biasing networks (e.g. distributed current mirrors for the current sources

needed in each cell). Since the geometrical dimensions may get very large, additional effects

such as non-zero resistance of long metal wires show up. This may affect signal tracks as

well as power-supply lines. It must be kept in mind during the design phase that a distributed

bias-network implemented with BJTs draws a considerable amount of base current which

causes large voltage drops on long metal tracks. In the extreme, these voltage drops may

prevent the whole network from working correctly. Comparable problems arise in digital

circuits for the clock distribution. There the solution is to balance the load in different

branches of a clock distribution tree instead of having one large single track. Adapted to our

networks, this would mean using local repeaters for the biasing circuits. Errors introduced by

these circuits are not critical, since all calculations rely on relative signal strength.

Construction of Large Analog Networks

A second, more general implementation issue is how to construct large analog computational

networks. Up to a few hundred transistors, an analog system may be drawn very easily if a

hierarchic design approach is chosen. But imagine a large factor graph of several hundreds or

thousands of individual nodes. The drawing of interconnection lines between the individual

building blocks would be complex and there may be error. It is certainly a good idea not to

rely on your own drawing capability if a schematic can be generated by a computer program.

In general, it will not be possible to design a large analog network first-time-right without

computer aided design. By this we subsume not only computer-aided drawing (CAD), but

also computer-aided engineering (CAE), which includes much more than only the sketch of

schematics.

Testability

Another big issue of such large networks is testability. How can we guarantee that a circuit

leaving the waferfab works as expected? Rudimentary tests such as checking the supply

current or verifiying individual test blocks are generally not sufficent to guarantee the overall

functionality. Testing large digital circuits is much easier than doing the same thing for

analog networks. Boundary scan and JTAG test access ports are commonly used today for

looking inside the working digital circuit. They mostly rely on digital registers that can be

addressed and read out serially on certain circuit chip pins. Testing large analog systems is

much more difficult. The measuring circuitry should not modifiy (by creating additional loads

on the interesting nodes) the overall behaviour. Additionally, the resolution of measured


36/41



values should be better than the resolution of the actual circuit under test. This means that the

circuits for measuring have to be more precise, and are thus in general also more complex

and space consuming, than the circuit to verify. Even if a measurement circuit can be shared

among many circuit nodes to test, it may add a considerable overhead to the overall network.

So it would be desired that the circuits functionality could be guaranteed by design. One

approach to this may lie in an information-theoretic approach that tries to quantify the impact

of individual error sources to an overall probability propagation network. Unfortunately, we

did not have the time yet to investigate such an approach, but it will be subject to future

research.


37/41



CHAPTER-7

ISSUES AND FUTURE WORK

The key question is how to apply these circuits in decoding applications. We know that any

codeword or sequence of data can be represented by a corresponding trellis diagram. Each

individual trellis section corresponds to a function variable on a factor graph. The output

characteristics of that trellis depend on the local function f(x; y; z) that the variable is passed

into. Therefore, we can create a factor graph that is a visual depiction of this trellis diagram.

Furthermore, we can map our equal and XOR probability modules to this trellis, and observe

any given variable on that factor graph.

What happens if the trellis diagram does not have a specific entry point and exit point? In this

case the factor graph has no clear entry and exit point and becomes cyclical. This model

accurately describes any codes that are "tail-biting." Like a trellis of arbitrary states, the

possible paths of this project branch in every direction. In order to fully study the circuits

presented in this paper, the next step would be to layout these schematics and compare the

performance of the ideal circuits presented with those simulated with interconnect delay. In

addition, tape out of these schematics for actual test may prove whether the MOSIS process

allows for proper analog behaviour. A really aggressive schedule would seek to fullyunderstand turbo codes, a high performing channel encoding scheme, and move to creating a

tool to automate schematic design and layout.

Beside the decoding applications, the probability propagation networks may be applied in

various other related domains such as the tracking of hidden-Markov models, widely used for

many pattern recognition tasks, and the inference on Bayesian networks, which appear in the

context of artificial intelligence problems.

Application of the probability-propagation calculus to other problems

Many problems can be described by factor graphs which in turn can be directly converted to

an analog probability-propagation network. It would be very interesting to apply the design

technique to other application fields such as artificial-intelligence problems that might appear

in on-line fault-detection circuits of complex systems.

Adaptive filters

By changing the signal representation from a probability-based interpretation to a real-valued

interpretation, the well-known equal gate and soft-XOR gate may be operated as real-valueadders and real-value multipliers, respectively. Hence, they represent the basic operations of


38/41



discrete-time filters. By making the filter taps adaptive, we could easily build adaptive FIR

and IIR filters. Adaptive FIRfilters are commonly used for equalizing wire-line channels.

Joint channel-equalizer.

In the communications community, there exist several concepts of jointly equalizing a given

channel and decoding the transmission code. But all of them work in the digital domain and it

thus may make no sense to do decoding in the analog way, whereas the remaining part of the

receiver works digitally. So why not build most of the receiver front-end using our analog

probability networks? For example, the decision-feed-back equalizer (DFE) is a good

candidate for an analog network implementation, since all the basic operations can be

implemented using our generic building blocks. By doing so, we get one step closer to the

antenna or the line interface of a data communication system without flipping too much back

and forth between analog to digital.

All-analog receiver system

Our experience so far is that many individual blocks of a receiver system can be

implemented in analog electronics. Despite the fact that many renowned researchers postulate

software radio, that is, a system that consists merely of an A/D converter as close as possible

to the antenna and digital processors for signal processing, we think that for certain

demanding applications analog signal processing in an intelligent manner is the way to go.

Our long-term aim is an all analog receiver system, that is, to have no digital signals before

the decoder block, since the analog decoder can make an inherent A/D conversion. This

would potentially provide very efficient highest-speed and ultra-low-power communication

systems as needed by todays e-society.


39/41



CONCLUSIONS

A technique for efficiently implementing the sum-product algorithm (or probability

propagation algorithm) in analog VLSI technology has been discussed. The described new

type of analog computing networks exhibits a natural match between probability theory and

transistor physics. The elementary modules of which these networks are composed include

probabilistic versions of standard logic gates. The obvious application of such networks is the

decoding of error-correcting codes. However, any factor graph where all function nodes of

degree larger than one can be mapped onto such analog networks.

The transistor-level implementations of the building blocks are very simple current-mode

vector multipliers and current-mode selective adders that process discrete probability

distributions. Basically one transistor is needed to build the pair-wise product of two elements

of discrete probability distributions.

The presented networks follow a bio-inspired approach and therefore omit many plagues of

traditional analog circuit design such as data-representation overflow, temperature

dependence, linear approximations of non-linearities, component variations, and tedious

manual design flow. The circuits exploit rather than fight the inherent non-linearities of the

used exponential characteristic of both bipolar junction transistors and weakly inverted MOS

transistors. By building large, highly connected networks out of very simple and low-

precision computation nodes, a high precision and high processing throughput is reached on

the system level. Due to their simplicity and computational efficiency, analog networks

exhibit a distinct advantage in the speed-power-ratio compared to their comparable digital

counter-parts. According to experience (still limited), this advantage amounts to at least two

orders of magnitude.


40/41



REFERENCES

[1] C. A. Mead, Neuromorphic electronic systems. in Proc. IEEE, vol. 78, pp 1629

1636, Oct. 1990.

[2] Gilbert B , "A precise four-quadrant multiplier with sub nanosecond response,"

IEEE J. Solid-State Circuits, vol.3, no.4, pp. 365- 373, Dec. 1968

[3] H.A. Loeliger, M. Helfenstein, F. Lustenberger, and F. Tarky. Probability

propagation and decoding in analog VLSI. in Proc. IEEE Int. Symp. on Information

Theory, Aug. 1998, pp 146.

[4]Wiberg N, H.A. Loeliger. and Kotter R, "Codes and iterative decoding on general

graphs,"in Proc.IEEE Int. Symp.on Information Theory , 17-22 Sep 1995 pp.468.

[5] M. Frey, H.A. Loeliger, F. Lustenberger, P. Merkli, and P. Strebel, Analog decoder

experiments with sub threshold CMOS soft-gates, in Proc. IEEE Int. Symp. on Circuits

and Systems, June 2003, pp. 8588.

[6] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, Factor graphs and the sum -product

algorithm,IEEE Trans. on Information Theory, vol. 47, no. 2, Feb. 2001.

[7] J. Hagenauer and M. Winkelhofer , The analog decoder, in Proc. IEEE Int. Symp.

on Information Theory,vol 4, Aug. 1998,pp 145.

[8] Xiao-An Wang and Wicker S.B , "An artificial neural net Viterbi decoder,"IEEE

Trans. on Communications, vol.44, no.2, pp.165-171, Feb 1996.

[9] H.A. Loeliger, F. Lustenberger, F. Tarky, and M. Helfenstein, Decoding in Analog

VLSI,IEEE Communications Magazine ,vol. 37, no. 4, pp. 99101, April 1999.

[10] G. D. Forney, Codes ongraphs: Normal realizations. in Proc. IEEE Int. Symp.

on Information Theory, June 2000, pp 9-19.

[11] Gunhee Han and Edgar Sanchez-Sinencio, CMOS Transconductance Multipliers A

Tutorial. IEEE Trans. on Circuits and Systems II: Analog and Digital signal Processing,

vol.45, no.12, Dec 1998.

[12] F. Lustenberger, On the design of analog iterative VLSI decoders, Ph.D.

dissertation, ETH, Zurich, Switzerland, 2000.


41/41


[13] F. Lustenberger and H.A. Loeliger, On mismatch errors in analog-VLSI error

correcting decoders, in Proc. IEEE Int. Symp. on Circuits and Systems, July 2001, pp.

198201.

DIGIT SERIAL PROCESSING PIPELINING

Documents

Transcript of DIGIT SERIAL PROCESSING PIPELINING