fpga_notes_april23

116
Introduction to FPGA DSPedia Notes 1

Transcript of fpga_notes_april23

Page 1: fpga_notes_april23

Introduction to FPGA

DSPedia Notes 1

Page 2: fpga_notes_april23

THIS SLIDE IS BLANK

Page 3: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

As 1

ave been enabled

alog Devices

P

have been widely

high speed DSP (FPGA)

h FPGAs!

R. Stewar

Introduction: DSP and FPG• In the last 20 years the majority of DSP applications h

by DSP processors:

• Texas Instruments Motorola An

• A number of DSP cores have been available.

• Oak Core LSI Logic ZSP 3DS

• ASICs (Application specific integrated circuits) used for specific (high volume) DSP applications

• But the most recent technology platform for applications is the Field Programmable Gate Array

This course is all about the why and how of DSP wit

Page 4: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

DS the course, we will see thatmo ve filters, Fourier transformsan uare root is quite a rare thingin

He ements. In particular, whenco s than the other, then clearlythe ns. One is that the requiredMA processor based situationswe 6 bit digital filter coefficientsetc are required. Therefore weca y.

e Input

tes:

10.423 R. Stewart,

P is all about multiplies and accumulates/adds (MACs). As we progress throughst algorithms that are used for different applications employ digital filters, adaptid so on. These algorithms all require multiplies and adds (note that a divide or sqDSP).

nce a DSP algorithm or problem is often specified in terms of its MAC requirmparing two algorithms, if they both perform the same job but one with less MAC “cheaper” one would be the best choice. However this implies some assumptioCs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 1. With FPGAs this constraint is removed - we can use as many, or as few bits, as

n choose to optimise and schedule DSP algorithms in a completely different wa

Voltag

Circuit Board

Amplifiers/Filters

ADC

General Purpose Input/Output Bus

Voltage Output

DACDSP56307

DSP Processor

Page 5: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

n 2

DSP market hassuch as the ever

e implemented inice are possible.

ed multipliers ontoMHz. Number of

rithms signal flow) FIR SFGs filters

robably more DSPuare root, divide),ured more DSP is

R. Stewar

The FPGA DSP Evolutio• Since around 1998 the evolution of FPGAs into the

been sustained by classic technology progress present Moore’s law.

• Late 1990s FPGAs allow multipliers to bFPGA logic fabric. A few multipliers per dev

• Early 2000s FPGAs: vendors place hardwirthe device with clocking speeds of > 100multpliers from 4 to > 500.

• Mid 2000s FPGA vendors place DSP algographs (SFGs) onto devices. Full (pipelinedfor example are available

• Late 2000s to early 2010s - who knows! Ppower, more arithmetic capability (fast sqFFTs, more floating point support. Rest asscoming....

Page 6: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Te

An the next quarter you will getthe it another quarter and in 6mo ely to be cheaper also! Suchis

DS ill be bring out prepackagedalg - higher level design tools,de

So plementation of a softwarerad But of course who can wait?

Th igning DSP for FPGAs. Likeall

tes:

10.423 R. Stewart,

chnology just keeps moving.

yone who has purchased a new laptop knows the feeling. If you just wait, then in new model with integrated WiFi or WiMax, a faster processor. Of course wanths it will be improved again - also, the new faster, better, bigger machine is lik

technology

P for FPGAs is just the same. If you wait another year its likely the vendors worithms for precisely what you want to do. And they will be easier to work withsign wizards and so on.

if you are planning to design a QR adaptive equalizing beamformer for MIMO imio for 802.16 - then if you wait, it will probably be a free download in a few years.

erefore in this course, we discuss and review the fundamental strategies of des technologies you still need to know how it works if you really want to use it.

Page 7: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

cks 3

epositories of DSP

available are finite.

ful about runningconsiderations are.

build it:

R. Stewar

FPGAs: A “Box” of DSP blo• We might be tempted to this of the latest FPGAs as r

components just waiting to be connected together.

• Of course the resource is finite and the connections

• In the days of circuits boards one had to be carebusses close together, lengths of wires etc. Similiar required for FPGAs (albeit out of your direct control)

• However, the high level concept, take the blocks, &

“Connectors” Logic Arithmetic

Registers and Memory

DesignVerify

Place and Route

Clocks Input/Output

Page 8: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th nect them together and thealg

Do

Do

Ye the same person.

Th t (ie overflows, underflows,sa

Fo te. What device do we need,an sign flows will give differentres

As ch allows a complete FIR tobe aken care of.

tes:

10.423 R. Stewart,

is is undoubtedly the modern concept of FPGA design. Take the blocks, conorithm is in place.

we actually need an FPGA/IC engineer then?

we actually need a DSP engineer?

s in both cases, but moderm toolsets and design flows are such that it might be

ere is lots to worry about. In terms of the DSP design; is the arithmetic correcturates etc). Do the latency or delays used allow the integrity to be maintained.

r the FPGA, can we clock at a high enough rate? Does the device place and roud how efficient is the implementation (just like compilers, different vendors deults (some better than others).

vendors provide higher level components (like the DSP48 slice from XIlinx whi implemented) then issues such as overflow, numerical integrity and so on are t

Page 9: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ply 4

lots of them!

bit number:

N bit number:

N bit numbers welength.

R. Stewar

Binary Addition and Multi• The bottom line for DSP is multiplies and adds - and

• Adding two N bit numbers will produce up to an N+1

• Multiplying two N bit numbers can produce up to a 2

• So with a MAC (multiply and accumulate/add) of twocould, in the worst case, end up with 2N+1 bits word

+ =N+1N N

=

x =2NN N

Page 10: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

If t e the situation of numericalov

W for.

Fo ulator, i.e. the largest resultof tion we may require to take,sa . The result of each multiplywi t happen to be large positiveva rs together (and they just allha of quite a few bits. So onemu ely that the result of addingthe hosen. (Of course if you didha to the code to catch this.)

tes:

10.423 R. Stewart,

he wordlength grows beyond the maximum value you can store we clearly haverflow which is a non-linear operation and not desirable.

ithin traditional DSP processors this wordlength growth is well known an catered

r example, the Motorola 56000 series is so called because it has a 56 bit accumany “addition” operation can have 56 bits. For a typical DSP filtering type operay an array of 24 bit numbers and multiply by an array of another 24 bits numbersll be a 48 bit number. If we then add two 48 bit numbers together, if they both juslues then the result could be a 49 bit number. Now if we add many 48 bit numbeppen to be large positive values), then the final result may have a word growthst assume that Motorola had a good look at this, and realised it was fairly unlikse 48 bit products together would ever be larger that 56 bits - so 56 bits was c

ve a problem that grew beyond 56 bits you would have to put special trapping in

Page 11: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

5

ple adder:

rallel at a cost of N

0 B0

S0

‘0’

LSB

0 carry in

R. Stewar

The “Cost” of Addition• A 4 bit addition can be performed using a simple rip

• Therefore an N bit addition could be performed in pafull adders.

Σ

A3 B3

S3

Σ

A2 B2

S2

Σ

A1 B1

S1

Σ

A

S4

A3 A2 A1 A0B3 B2 B1 B0

S4 S3 S2 S1 S0

C3 C2 C1 C0

C0C1C2C3

MSB

+

Page 12: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th

Ad

BC

BC

Cin

tes:

10.423 R. Stewart,

e simple Full Adder (FA):

ds two bits + one carry in bit, to produce sum and carry out

1011 +11 + 1101 +13

11000 +24

0 0 00 0 10 1 00 1 11 0 01 0 11 1 01 1 1

0 00 10 11 00 11 01 01 1

A B Cin Cout Sout

Sout ABC ABC ABC A+ + +=

A B C⊕ ⊕=

Cout ABC ABC ABC A+ + +=

AB AC BC+ +=

ΣCout

A B

Sout

Page 13: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

6

ltiply/add cells:

ly 4 times more

0

0

0

0 a0a1

b0

2

b1

p0p1

R. Stewar

The “Cost” of Multiply• A 4 bit multiply operation requires an array of 16 mu

• Therefore an by multiply requires cells......

......so for example a 16 bit multiply is nominalexpensive to perform than an 8 bit multiply.

0

0

00

b3

a2a3

b

p7 p6 p5 p4 p3 p2

a3 a2 a1 a0b3 b2 b1 b0c3 c2 c1 c0

d3 d2 d1 d0e3 e2 e1 e0

f3 f2 f1 f0+

p7 p6 p5 p4 p3 p2 p1 p0

N N N2

Page 14: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Ea t wires:

An

Partial Product

tes:

10.423 R. Stewart,

ch cell is composed of a Full Adder (FA) and an AND gate, plus some broadcas

8 bit by 8 bit multiplier would require 8 x 8 = 64 cells

a

aout

bbout

s

sout

cout = s.z.c + s.z.c + s.z.c + s.z.c

ccout

z = a.bbout = b

aout = asout = (s ⊕ z) ⊕ c

1 0 1 11 0 0 11 0 1 1

0 0 0 00 0 0 0

1 0 1 1+1 1 0 0 0 1 1

11x9

99

Page 15: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

7

s:

gates to form

R. Stewar

The Gate Array (GA)• Early gate-arrays were simply arrays of NAND gate

• Designs were produced by interconnecting thecombinational and sequential functions.

Page 16: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th produce any Boolean logicfun

• ction and test. Metal layers

Ea

Fr

Ho sers for similar systems.....

.... and perhaps addition andsu

Fo s, no updates, no fixes.

So en these and gate arrays:

• le

• tion of multi-input logic, flips-

tes:

10.423 R. Stewart,

e NAND gate is often called the Universal gate, meaning that it can be used toction.

Early gate array design flow would be design, simulate/verify, device produmake simple connections

rly simulators and netlisters such as HILO (from GenRad) were used.

om GA to FPGA

wever simple gate arrays although very generic, were used by many different u

for example to implement two level logic functions, flip-flops and registers btraction functions.

r a GA once a layer(s) of metal had been laid on a device - that’s it! No change

then we move to field programmable gate arrays. Two key differences betwe

They can be reprogrammed in the “field”, i.e. the logic specified is changeab

They no longer are just composed of NAND gates. A carefully balanced selecflops, multiplexors and memory.

Z AB CD+=

ABCD

Z

Page 17: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ic Fabric) 8

e refered to as the

Row

Column

interconnects

interconnects

logicblock

logicblock

logicblock

logicblock

R. Stewar

Generic FPGA Architecture (Log• Arrays of gates and higher level logic blocks might b

logic fabric...

logicblock

I/O

I/O I/O I/O I/O

I/O

I/O

I/O

I/O

I/O I/O I/O

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

logicblock

Page 18: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th t manufacturers will includedif

A

Of manufacturer and device tode

Inte

rcon

nect

s

lement

tes:

10.423 R. Stewart,

e logic block in this generic FPGA contains a few logic elements. Differenferent elements (and use different terms for logic block, e.g. slices etc).

simple logic block might contain the following:

course the actual contents of a logic element will vary from manufacturer to vice.

FLIPFLOP

CarryLogic

SelectMUXLogic

Cascade/FLIPFLOP

FLIPFLOP

CarryLogic

SelectMUXLogic

Cascade/FLIPFLOP

LUT

Logic E

Page 19: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

Devices) 9

we also find blockseful for DSP!

rows

R. Stewar

FPGA Architecture (Xilinx DSP • Looking more specifically at recent Xilinx FPGAs,

RAMs and dedicated arithmetic blocks... both very u

columns

Block RAM

ArithmeticBlock

Logic Fabric

Input/OutputBlocks

Page 20: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

f more recent, DSP-targetedion of dedicated arithmetic power and higher clocke logic fabric (i.e. the arrayfigured to perform a numbernd are especially suited toAC) operations prevalent in

xtensively in DSP. Examplecoefficients, encoding and

se additional resources, theajority of the FPGA. We willh comprise the logic fabric,d together, in further detail.

tes:

10.423 R. Stewart,

One of the major features oXilinx FPGAs is the provisblocks, which offer lowerfrequency operation than thof CLBs). These can be conof different computations, athe Multiply Accumulate (Mdigital filtering.

Block RAMs are also used euses are for storing filter decoding, and other tasks.

Despite the inclusion of thelogic fabric still forms the mnow look at the CLBs whicand how they are connecte

Input / Output Block (IOB)

Block RAM

DSP48 / DSP48A / DSP48E

Configurable Logic Block (CLB)

FPGA

Diagram Key

Page 21: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

Routing 10

gic Blocks (CLBs), CLB).

) routing.

CLB

Slices

other CLBs

esources is depicted above.

R. Stewar

Example: Xilinx Logic Blocks and• Xilinx FPGA logic fabric comprises Configurable Lo

which are groups of SLICEs (e.g. 2 or 4 SLICEs per

• Signals travel between CLBs via routing resources.

• Each CLB has an adjacent switch matrix for (most

SwitchMatrix

NOTE: Only a subset of routing r

Page 22: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th and the Altera and Latticearc ver, in all cases, their LogicBlo quired for connecting blockstog

Co Tables (LUTs), and in mostde LUTs can be utilised in fourmo

Th

Ov s will be described.

tes:

10.423 R. Stewart,

e example in the main slide features a typical Xilinx FPGA architecture, hitectures are different. Logic units differ in size, composition and name! Howecks include both combinational logic and registers, and routing resources are reether.

ntinuing with the Xilinx example, the combinational blocks are termed Lookup vices have 4 inputs (some of the more recent devices have 6-input LUTs). Thesedes:

To implement a combinatorial logic function

As Read Only Memory (ROM)

As Random Access Memory (RAM)

As shift registers

e register can be used as:

A flip-flop

A latch

er the next few slides, the functionality of LUTs and registers in the above mode

Page 23: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

11

put addresses the of A, B, C and D.

01 11 100 0 0

000 0 0

0

00 1

Z

R. Stewar

The Lookup Table• When used to implement a logic function, the 4-bit in

LUT to find the correct output, Z, for that combination

00011110

00ABCD

0

0

11

Z = B C D + A B C D

ABCD

LookupTable

Example logic function:

Page 24: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Thcoinpcastoap

In Bitthe

D Z0 11 10 01 10 11 00 01 00 01 00 01 00 01 00 11 1

tes:

10.423 R. Stewart,

e lookup table can also implement a ROM,ntaining sixteen 1-bit values. Instead of the fouruts representing inputs of a logic function, they

n be thought of as a 4-bit address. A 1-bit value isred within each memory location, and thepropriate output is supplied for any input address.

this example, A is considered the Most Significant (MSB) and D the Least Significant Bit (LSB), and output is Z.

A B C0 0 00 0 00 0 10 0 10 1 00 1 00 1 10 1 11 0 01 0 01 0 11 0 11 1 01 1 01 1 11 1 1

Page 25: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

12

either single port

chronous writeions.

s write operationsd 1 address for

wo or more LUTs

port RAM.

addresses, 1 bitn equivalent Dual

R. Stewar

LUTs as Distributed RAM• LUTs can also be configured as distributed RAM, in

or dual port modes.

• Single port: 1 address for both synoperations and asynchronous read operat

• Dual port: 1 address for both synchronouand asynchronous read operations, anasynchronous reads only.

• Larger RAMs can be constructed by connecting ttogether.

• Dual port RAM requires more resources than single

• For example, a 32x1 Single Port RAM (32data), requires two 4-bit LUTs, whereas aPort RAM requires four 4-bit LUTs.

Page 26: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th ort RAMs in the Virtex II Prode ces as the single port RAM.

SC

ual Port

tes:

10.423 R. Stewart,

e two diagrams below demonstrate the implementation of 16x1 single and dual pvice, respectively. Notice that the dual port RAM requires twice as many resour

ource: Virtex-II Pro and Virtex-II Pro X Platform FPGAs:omplete Data Sheet, DS083 (v4.7), November 5, 2007.

Single Port

D

Page 27: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

16s) 13

ter of up to 16 bits.

d the 4-bit addresschronously read.

utput from the 10th

n be used to add synchronises the

SHIFT OUT

13 14 15

D Q D Q D Q

R. Stewar

LUTs as Shift Registers (SRL• A final alternative is to use the LUT as a Shift Regis

• Additional Shift In and Shift Out ports are used, anis used to define the memory location which is asyn

• For example, if the LUT input is the word 1001, the oregister is read, as depicted below.

• The slice register at the output from the LUT caanother 1 clock cycle delay. Using the register alsoread operation.

LUT INPUT

D OUT

SHIFT IN

CLK

0 1 2 3 4 5 6 7 8 9 10 11 124

D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q

(e.g. 1001)

Page 28: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

As ombining several LUTs. Forex -bit Shift Registers together,as ift Registers.

Slice 2

FF

FF

Cascadable Out

C

tes:

10.423 R. Stewart,

with the other LUT configurations, larger Shift Registers can be constructed by cample, a 64-bit shift register segment can be constructed by combining four 16 shown below. The cascadable ports allow further interconnections for larger Sh

Slice 1

FF

FF

SRL16

SRL16

DI D

MSB

DI D

MSB

SRL16

SRL16

DI D

MSB

DI D

MSB

ascadable In

Page 29: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

14

kup table can be

rom the LUT, or bypassed).

LUT / RegisterPair

Output

R. Stewar

Registers• The sequential logic element which follows the loo

configured as either:

• An edge-triggered D-type flip flop; or

• A level-sensitive latch

• The input to the register may be the output falternatively a direct input to the slice (i.e. the LUT is

LUT Carry Logic

REG

BypassInput

LUTInputs

Page 30: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

A le below (D(t) is the registerinp ntrol inputs (set, reset, etc.)are

W “captured” and stored withinthe d.

Fli

tes:

10.423 R. Stewart,

D-type flip flop provides a delay of one clock cycle, as confirmed by the truth tabut at time t, and Q(t=1) is the output 1 clock cycle later). A clock signal and co also provided.

hen configured as a latch, the control inputs define when data on the D input is register. The Q output thereafter remains unchanged until new data is capture

p flops and registers are discussed in the Digital Logic Review notes chapter.

D(t) Q(t+1)

0 0

1 1

Page 31: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

16s 15

Therefore, addingource utilisation.

ettable, then eachL16 can be used.

OUTPUTD Q

R

Slice 4

UT

R. Stewar

Resets: Registers and SRL• Whereas registers can be reset, SRL16s cannot.

reset capabilities to a design has implications for res

• For example, consider an 8-bit shift register. If reselement requires a slice register. If not, then an SR

D Q

R

INPUT

CLOCK

RESET

D Q

R

D Q

R

D Q

R

D Q

R

D Q

R

D Q

R

Slice 1 Slice 2 Slice 3

INPUT

CLOCK

OUTPSRL16

Slice 1

Resettable Implementation

Non Resettable Implementation

Page 32: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

W more sophisticated design.Ins ing a slice register, and thesu , which allows the 0 input topro quire 2 slices at most.

OUTPUT

e 1/2

tes:

10.423 R. Stewart,

e can still design a resettable shift register with an SRL16, by using a slightlytead of making all elements resettable, we can implement the first element usbsequent ones using an SRL16. The reset signal is held high for 8 clock cyclespagate through the shift register. Instead of using 4 slices, this design would re

SRL16

Slic

D Q

R

INPUT

CLOCK

RESET

Slice 1

Hold RESET signal highfor 8 clock cycles to resetthe shift register...

Page 33: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

tion 16.1

A hardware is an

er:

ned integers, 2’s

on and rounding

and Square Root;

d arithmetic

R. Stewar

FPGA Arithmetic Implementa• The implementation of arithmetic operations in FPG

integral and important aspect of DSP design.

• The following key issues are presented in this chapt

• Number representations: binary word formats for signed and unsigcomplement, fixed point and floating point;

• Binary arithmetic, including:Overflow and underflow, saturation, truncati

• Structures for arithmetic operations: Addition/Subtraction, Multiplication, Division

• Complex arithmetic operations;

• Mapping to Xilinx FPGA hardware... including special resources for high spee

Page 34: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th

Int .

No

Qu

Ad int, hardware structures forad

Mu int, hardware structures formu

Di

Sq

Co

Co

tes:

10.423 R. Stewart,

is section of the course will introduce the following concepts:

eger number representations - unsigned, one’s complement, two’s complement

n-integer number representations - fixed point, floating point.

antisation of signals, truncation, rounding, overflow, underflow and saturation.

dition - decimal, two’s complement integer binary, two’s complement fixed podition, Xilinx-specific FPGA structures for addition.

ltiplication - decimal, 2s complement integer binary, two’s complement fixed poltiplication, Xilinx-specific FPGA structures for multiplication.

vision.

uare root.

mplex addition.

mplex multiplication.

Page 35: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

17

resented digitally -.

the “real-world”

fficient in terms of

rithmetic operator

cision and range.

R. Stewar

Number Representations• DSP, by its very nature, requires quantities to be rep

using a number representation with finite precision

• This representation must be specified to handle inputs and outputs of the DSP system.

• Sufficient resolution

• Large enough dynamic range

• The number representation specified must also be eits implementation in hardware.

• The hardware implementation cost of an aincreases with wordlength.

• The relationship is not always linear!

• There is a trade-off between cost, and numeric pre

Page 36: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th is well understood by mosten literally billions of multipliesan er of bits for representationis

Fo rithmetic. We will show later(se oduct) can be approximatedas e cost is of the order of 16 x16 e the designer at sometimede Probably not! Its likely thatwe rs and we are creatures ofha fore, if it was demonstratedtha 9 cells = 81 cells. This isap

Th ces, and too few bits losesres DSP.

tes:

10.423 R. Stewart,

e use of binary numbers is a fundamental of any digital systems course, and gineers. However when dealing with large complex DSP systems, there can bed adds per second. Therefore any possible cost reduction by reducing the numblikely to be of significant value.

r example, assume we have a DSP filtering application using 16 bit resolution ae Slide 42) that the cost of a parallel multiplier (in terms of silicon area - speed pr

the number of full adder cells. Therefore for a 16 bit by 16 bit parallel multiply th= 256 “cells”. The wordlength of 16 bits has been chosen (presumably) becaus

monstrated that 17 bits was too many bits, and 15 was not enough - or did they? are using 16 bits because... well, that’s what we usually use in DSP processobit! In the world of FPGA DSP arithmetic you can choose the resolution. Theret in fact 9 bits was sufficient resolution, then the cost of a multiplier is 9 xproximately 30% of the cost of using 16 bits arithmetic.

erefore its important to get the wordlength right: too many bits wastes resourolution. So how do we get it right? Well, you need to know your algorithms and

Page 37: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

rs 18

ive numbers.

it word are:

ormat as shown:

is: 0 to 2n - 1. Form 0 to 255.

21 20LSB

01

1 0

012021

1x21

2

2 0x20

= 82

R. Stewar

Unsigned Integer Numbe• Unsigned integers are used to represent non-negat

• The weightings of individual bits within a generic n-b

• The decimal number 82 is “01010010” in unsigned f

• The numerical range of an n-bit unsigned number example, an 8-bit word can represent all integers fro

2n-1 2n-2 2n-3 22bit weighting:MSB

2n-3n-2n-1bit index:

0 1 0 1 0 0

234567222324252627bit weighting:

bit index:

example binary number:

1x26 1x24

64 160x27 0x25 0x23 0x2

+ +decimal representation:

Page 38: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Ta s is 0 to 255:

No e powers of two between an

i.e

0

tes:

10.423 R. Stewart,

king the example of an 8-bit unsigned number, the range of representable value

te that the minimum value is 0, and the maximum value ( ) is the sum of thd , where 8 is the number of bits in the binary word:

. 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 = 255 = 28 - 1

Integer Value Binary Representation0 00000000

1 00000001

2 00000010

3 00000011

4 00000100

64 01000000

65 01000001

131 10000011

255 11111111

2558

Page 39: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

gers) 19

e numbers in thetation of 0 (zero).

ting:

e represented by:

21 20LSB

01

0 0LSB

01

1 1LSB

01

R. Stewar

2’s Complement (Signed Inte• 2’s Complement caters for positive and negativ

range -2n-1 to +2n-1 -1, and has only one represen

• In 2’s complement, the MSB has a negative weigh

• The most negative and most positive numbers ar

-2n-1 2n-2 2n-3 22bit weighting:MSB

2n-3n-2n-1bit index:

1 0 0 0MSB

2n-3n-2n-1bit index:

most negative:

0 1 1 1MSB

2n-3n-2n-1bit index:

most positive:

Page 40: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

As imal.

As mplement signed format:

Me t:

1 0

012021

1x21

2

2 0x20

= 82

1 0

012021

1x21

2

2 0x20

= -82+

tes:

10.423 R. Stewart,

examples, we can convert the following two 8-bit 2’s complement words to dec

for the unsigned representation, the decimal number 82 is “01010010” in 2’s co

anwhile the decimal number -82 is “10101110” in 2’s complement signed forma

0 1 0 1 0 0

2345672223242526-27bit weighting:

bit index:

example binary number:

1x26 1x24

64 160x-27 0x25 0x23 0x2

+ +decimal representation:

1 0 1 0 1 1

2345672223242526-27bit weighting:

bit index:

example binary number:

1x26 1x24

32 81x-27 1x25 0x23 0x2

+ +decimal representation:

-128 4+

Page 41: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

on 20

ive and positive.

nt representationserted as shown:

0 1 1 1 0

+82

1 0 0 0 1

0 0 0 0 0 0 0 1

1 0 0 1 0

rt all bits

R. Stewar

2’s Complement Conversi• For 2’s Complement, converting between negat

numbers involves inverting all bits, and adding 1

• For example, we have just considered 2’s complemeof the decimal numbers -82 and +82. They are conv

0 1 0 1 0 0 1 0

+82 -82

-82

1 0 1 0 1 1 0 1

0 0 0 0 0 0 0 1

+82

1 0 1 0 1 1 1 0

+

invert all bits

add 1

1 0 1

-82

+82

0 1 0

-82

0 1 0

+

inve

add 1

Page 42: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

No zero. However, if we simplyign e representation for positiveze

No

ve Numbers

Binary 100000000

11111111

11111110

11111101

10000011

10000010

10000001

10000000

tes:

10.423 R. Stewart,

te that when negating positive values, a ninth bit is required to represent negativeore this ninth bit, the representation for the negative zero becomes identical to th

ro.

tice from the above that -128 can be represented but +128 cannot.

Positive Numbers

Integer Binary 0 00000000

1 00000001

2 00000010

3 00000011

125 01111101

126 01111110

127 01111111

Negati

Integer 0-1-2-3

-125-126-127-128

Invert all bitsand ADD 1

Page 43: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

rs 21

number:

point.

r bits, and bits onits.

n integer bits and

for integer words).

2-b+1 2-bLSB

-b-b+12

2

actional bits

R. Stewar

Fixed-point Binary Numbe• We can now define what is known as a “fixed-point”

...a number with a fixed position for the binary

• Bits on the left of the binary point are termed integethe right of the binary point are termed fractional b

• The format of a generic fixed point word, comprisingb fractional bits, is: :

• The MSB has -ve weighting for 2’s complement (as

2n-1 2n-2 21 2-1bit weighting:MSB

-11n-2n-1bit index:

200

2--

n integer bits b fr

binary point

Page 44: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

As int in two different places.

Fir nal bits:

...a d 5 fractional bits:

No

0

-3-22-3-2

2-2

.250x2-3

= -5.25

0

-5-42-5-4

2-4

.06250x2-5

= -1.3125

tes:

10.423 R. Stewart,

examples, we consider the 2’s complement word “11010110” with the binary po

stly, with the binary point to the left of the third bit, i.e. 5 integer bits and 3 fractio

nd secondly, with the binary point to the left of the third bit, i.e. 3 integer bits an

te that these results are related by a factor of 22 = 4, i.e. 4 x -1.3125 = -5.25.

1 1 0 1 0 1 1

-10123422-120212223-24bit weighting:

bit index:

binary number:

1x23 1x21 1x8 2 0

1x-24 0x22 0x20 1x2-1

+ +decimal

-16 + 0.5 +representation:

1 1 0 1 0 1 1

-3-2-101222-32-22-12021-22bit weighting:

bit index:

binary number:

1x21 1x2-1 1x00.52

1x-22 1x20 0x2-2 1x2-3

+ +decimal

-4 0.125++representation:

Page 45: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ision 22

ctively fixed pointe binary range of

t, e.g. for an 8-bit

int position, e.g.:

of the binary point

11111.....1111

01111.....1111

1.

.1

(+127)

(+63.5)1 x0.5

R. Stewar

Fixed Point Range and Prec• As with integer representations (which are also effe

numbers, but with the binary point at position 0), thfixed point numbers extends from:

• The same number of quantisation levels is presenbinary word, 256 levels can be represented.

• Numerical range scales according to the binary po

• Dynamic range (range / interval) is independent position, e.g. (127-(-128))/1 = 255 = (63.5-(-64))/0.5

00000.....0000

10000.....0000

unsigned:

signed (2’s comp.):

1000 0000. 0111 111

1000 000.0 0111 111

(-128)

(-64)1 x0.5

Page 46: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

To t word, with the binary pointin nary point position, but therel

+

+

+

-0.5+0.375

binary pointposition = 3

all fractional bits

.25 -0.5 +0.25+0.125

25 interval = 0.125

tes:

10.423 R. Stewart,

illustrate this further, lets consider the very simple case of a 3-bit 2’s complemenall four possible positions. Clearly the numerical range is affected by the biationship between the interval and range remains the same.

-4

-3

-2

-1

0

1

2

3

+1.5

-2

+0.75

-1

binary pointposition = 0

binary pointposition = 1

binary pointposition = 2

all integer bits

-4 +2 +1 -2 +1 +0.5 +0.5 +0-1

interval = 1 interval = 0.5 interval = 0.

Page 47: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

23

ented, in steps offerent values.

oint format:

n error is her than 0.5!

LSB2

-----------±

R. Stewar

Fixed-point Quantisation• Consider the fixed point number format:

• Numbers between and can be repres. As there are 8 bits, there are dif

• Revisiting our sine wave example, using this fixed-p

• This looks much more accurate. The quantisatio(where LSB = least significant bit)... so 0.015625 rat

b b bb bn nn3 integer bits 5 fractional bits

4– 3.967850.03125 28 256=

+2

-2

Page 48: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Qu ecision numbers with finitepre number of decimal places.Th uantise or represent to 4de

If w or is larger:

Cl comes at a cost. Albeit theco

W f places. For example, if wewo

ca

On fficult to see that these smallerr

π

3.1

3.1

tes:

10.423 R. Stewart,

antisation is simply the DSP term for the process of representing infinite prcision numbers. In the decimal world, it is familiar to most to work with a given

e real number can be represented as 3.14159265.... and so on. We can qcimal places as 3.1416. If we use “rounding” here and the error is:

e truncated (just chopped off the bits below the 4th decimal place) then the err

early rounding is most desirable to maintain best possible accuracy. However itst is relatively small, but it is however not “free”.

hen multiplying fractional numbers we will choose to work to a given number ork to two decimal places then the calculation:

0.57 x 0.43 = 0.2451

n be rounded to 0.25, or truncated to 0.24. The result are different.

ce we start performing billions of multiplies and adds in a DSP system it is not diors can begin to stack up.

π

4159265… 3.1416– 0.00000735=

4159265… 3.1415– 0.00009265=

Page 49: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ry Shifts 24

tionship with theer it represents.

point number by a the numbers with

5.75)

ft by 1 place

)

(decimal 1.4375)

R. Stewar

Multiplication & Division via Bina• The binary point position has a power-of-2 rela

numerical range of the word format, and any numb

• Therefore if we want to multiply or divide a fixed power-of-two, this is achieved by simply shiftingrespect to the binary point!

2 1 0.54-8 0.25

0 1 1 110

-2 1 0.50.0625

0.1250.25

1 11 100shift right by 2 places

4

2 1 0.548-16

1 0 10 11

original (decimal

2 shift le

(decimal 11.5

number:

Page 50: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Of binary point is moved, rathertha ual - having no effect on theha ep track of it.

Re e move the binary point, thewe ge by a power-of-two factor.

(decimal 5.75)

(decimal 1.4375)

(decimal 11.5)

tes:

10.423 R. Stewart,

course, looking at the example in the main slide, we could also consider that the n the bits - it amounts to the same thing! Ultimately the binary point is concept

rdware produced - and it falls to the DSP design tool and/or DSP designer to ke

viewing the divide-by-4 and multiply-by-2 examples from the main slide... if wightings of the bits comprising the word, and hence the value it represents, chan

2 1 0.54-8 0.25

0 1 1 110

-2 1 0.50.0625

0.1250.25

1 11 100

4

2 1 0.548-16

1 0 10 11

originalnumber:

2

Page 51: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

bers 25

mber of fractional

n be represented.

cause it makes thermalised numbers

28

-1

+0.9921875

decimal

R. Stewar

Normalised Fixed Point Num• Fixed point word formats with 1 integer bit and a nu

bits are often adopted.

• Numbers from -1 to +1-2-b (i.e. just less than 1) caSome examples:

• Limiting the numeric range to is advantageous bearithmetic easier to work with... multiplying two notogether cannot produce a result greater than 1!

1/41/2-1 1/8 1/16 1/32

000 01 0

1/64 1/1

0 0

111 10 1 1 1

000 00 0 0 0

most -ve

most +ve

Page 52: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th ally in the context of DSPpro tually the same thing.

Q- its, and n is the number offra resentation, which Q-formatco 1 + m + n, whereas in 2’sco s, and n fractional bits.

Fo al bits, and hence can berep aving 3 integer bits and 5fra

Th ers the normalised range ofnu ntation with the binary pointat

2-42 2-3 2-5

actional bits

n = 5 ctional bits

2d

Qd

tes:

10.423 R. Stewart,

e term Q-format is often used to describe fixed point number formats, usucessors. However, it is useful to note that Q-format and 2’s complement are ac

format notation is given in the form Qm.n, where m is the number of integer bctional bits. Notably this description excludes the MSB of the 2’s complement repnsiders a sign bit. Therefore the total number of bits in a Q-format number ismplement, the same word format would be described as having m+1 integer bit

r example, a Q2.5 number has a sign bit, 2 other integer bits, and 5 fractionresented as shown below. In 2’s complement, this would be described as h

ctional bits.

e Q0.15 format (often abbreviated to Q15) is used extensively in DSP as it covmbers from -1 to +1 - 2-15, and is equivalent to a 16 bit 2’s complement represe15, i.e. 1 integer bit and 15 fractional bits.

2-2-1-22 2021

3 integer bits 5 fr

bit weightings:

m = 2 integer bits fra

1 sign bit

’s complementescription

-formatescription

Page 53: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

isation 26

thmetic “easier” to

hine” using 4 digit99 to +9999.

significant digits.

6 2849

e machine (wheree down by 10000,

9999.

849

ch “easier”.

Tr

R. Stewar

Fractional Motivation - Normal• Working with fractional binary values makes the ari

work with and to account for wordlength growth.

• As an example take the case of working with a “macdecimals and a 4 digit arithmetic unit - range -99

• Multiplying two 4 digit numbers will result in up to 8

6787 x 4198 = 28491826 2849.182

If we want to pass this number to a next stage in tharithmetic is 4 digits accuracy) then we need scalthen truncate.

• Consider normalising to the range -0.9999 to +0.

0.6787 x 0.4198 = 0.28491826 0.2

now the procedure for truncating back to 4 bits is mu

Scale

Tr

Page 54: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Of ndle the truncate and scale.

Ho range from -1 to +1 then itsea ult also in the range of -1 to+1

Ex int is implicitly used in mostDS

Co

10

If w n the binary range is:

1. 921875).

Th binary.1101 1010 0100

Co 00 1101 1010 0100

In 1 = 0.00110110100100

wh

No signer. There is no physicalco keeping track of wordlengthgro gers and would like to keeptra ost is the same...

tes:

10.423 R. Stewart,

course the two results are exactly identical and the differences are in how we ha

wever using the normalised values, where all inputs are contrained to be in thesy to note that multiplying ANY two numbers in this range together will give a res.

actly the same idea of normalisation is applied to binary, and the the binary poP systems.

nsider 8 bit values in 2’s complement. The range is therefore:

000000 to 01111111 (-128 to +127)

e normalise these numbers to between -1 and 1 (i.e. divide through by 128) the

0000000 to 0.1111111 ( -1 to 0.9921875, where 127/128 = 0.9

erefore we apply the same normalising ideas as for decimal for multiplication in

nsider multiplying 36 x 97 = 3492 equivalent to 00100100 x 01100001 = 00

binary normalising the values would be the calculation 0.0100100 x 0.110000

ich in decimal is equvalent to: 0.28125 x 0.7578125 = 0.213134765625

te very clearly that in a DSP system then the binary poiint is all in the eye of the dennection or wire for the binary point. It just makes things significantly easier in wth, and truncating just by dropping fractional bits. Of course if you prefer inte

ck of the scaling etc you can do this,..... you will get the same answer and the c

Page 55: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

(ADC) 27

a binary number,

two’s complement

BinaryOutput

10

1

0

1

0

1

1

s

R. Stewar

Analogue to Digital Converter• An ADC is a device that can convert a voltage to

according to its specific input-output characteristic.

• We can generally assume ADCs operate using arithmetic.

21-1-2

00100000

01000000

0110000001111111

32

64

96

127

-128

-96

-64

-32 11001000

11000000

10100000

10000000

Binary Output

Voltage

8 bit

ADCVoltageInput

Input

f

Page 56: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Vie teristic as “linear”. Howevera q tion of a linear system frombe ove and below the maximuman of steps large, then we aretem

No . In telecommunications forex and μ-law). Speech signals,for a large amplitude, whereasso e were used then althoughthe ow the threshold of the LSBan uantisers are used such thatthe s. A-law quantisers are oftenim emes are widely in use: theA- AC can have a non-linearch

tes:

10.423 R. Stewart,

wing the straight line portion of the device we are tempted to refer to the characuick consideration clearly shows that the device is non-linear (recall the defini

fore) as a result of the discrete (staircase) steps, and also that the device clips abd minimum voltage swings. However if the step sizes are small and the numberpted to call the device “piecewise linear over its normal operating range”.

te that the ADC does not necessarily have a linear (straight line) characteristicample a defined standard nonlinear quantiser characteristic is often used (A-law example, have a very wide dynamic range: Harsh “oh” and “b” type sounds havefter sounds such as “sh” have small amplitudes. If a uniform quantisation schem loud sounds would be represented adequately the quieter sounds may fall bel

d therefore be quantised to zero and the information lost. Therefore non-linear q quantisation level at low input levels is much smaller than for higher level signal

plemented by using a nonlinear circuit followed by a uniform quantiser. Two schlaw in Europe, and the -law in the USA and Japan. Similarly for the Daracteristic.

μ

Voltage Input

Binary Output

Page 57: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

28

d data values aree they are not, as

pled data value is

r.

sample, n

ts

2 …,

R. Stewar

ADC Sampling “Error”• Perfect signal reconstruction assumes that sample

exact (i.e. infinite precision real numbers). In practican ADC will have a number of discrete levels.

• The ADC samples at the Nyquist rate, and the samthe closest (discrete) ADC level to the actual value:

• Hence every sample has a “small” quantisation erro

time

1234

-1-2-3-4

0

s(t)

ADC

fs

Vol

tage

Bin

ary

valu

e

1234

-1-2-3-4

0

v̂ n( )

v̂ n( ) Quantise s nts( ){ }= for n, 0 1, ,=

Page 58: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Fo ion and maximum/minimumvo :

In 9998..., however our ADCqu

lts Voltage Input

tes:

10.423 R. Stewart,

r example purposes, we can assume our ADC or quantiser has 5 bits of resolutltage swing of +15 and -16 volts. The input/output characteristic is shown below

the above slide figure, for the second sample the true sample value is 1.58antises to a value of 2.

Bin

ary

Out

put

1 volts

Vmax = 15 vo

01111 (+15)

10000 (-16)

Vmin = -15 volts

Page 59: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

29

en the error of any

x Voltage Input

R. Stewar

Quantisation Error • If the smallest step size of a linear ADC is q volts, th

one sample is at worst q/2 volts.

Bin

ary

Out

put

q volts

Vma-Vmax

01111 (+15)

10000 (-16)

Page 60: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Qu quantisation process can beco

y

q

tes:

10.423 R. Stewart,

antisation error is often modelled an additive noise component, and indeed the nsidered purely as the addition of this noise:

yx ADCx

n

Page 61: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

30

nds

time/seconds

n Error

R. Stewar

An Example• Here is an example using a 3-bit ADC:

4

3

2

1

0

-1

-2

-3

-4

time/seco

Am

plitu

de/v

olts

3

2

1

0

-1

-2

-3

-4

Out

put

Input

4 3 2 1 -1 -2 -3 -4

4

3

2

1

0

-1

-2

-3

-4

Am

plitu

de/v

olts

ADC Characteristic

Quantisatio

Page 62: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

In

tes:

10.423 R. Stewart,

this case worst case error is 1/2.

Page 63: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

s 31

multi-bit numbers.

Note that the last

arry out of the lastr will be incorrectly

A3 A2 A1 A0B3 B2 B1 B0

S4 S3 S2 S1 S0

C3 C2 C1 C0

+

R. Stewar

Adding multi-bit number• The full adder circuit can be used in a chain to add

The following example shows 4 bits:

• This chain can be extended to any number of bits.carry output forms an extra bit in the sum.

• If we do not allow for an extra bit in the sum, if a cadder occurs, an “overflow” will result i.e. the numberepresented.

Σ

A3 B3

S3

Σ

A2 B2

S2

Σ

A1 B1

S1

Σ

A0 B0

S0

‘0’

S4

C0C1C2C3

LSBMSB

Page 64: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th

Fu

Th

BC ABC ABC+ +

C

BC ABC ABC+ +

C A B⊕( )=

tes:

10.423 R. Stewart,

e truth table for the full adder is:

ll adder circuitry can be therefore produced with gates:

e longest propagation delay path in the above full adder is “two gates”.

A B S COUT00 0 0

0 0 1 00 1 1 0

0 1 0 1

0 0 0+ + 0=

0 0 1+ + 1=0 1 0+ + 1=

0 1 1+ + 2=

CIN010

101 1 0

1 0 0 11 1 0 1

1 1 1 1

1 0 0+ + 1=

1 0 1+ + 2=1 1 0+ + 2=

1 1 1+ + 3=

010

1

A

B

S

COUTC

Sout ABC A+=

A B⊕ ⊕=

Cout ABC A+=

AB AC BC+ +=

Page 65: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

32

. Remember 2’smber is invert the

n D = A + (-B):

y in to be 1.

Σ ‘1’

0 B0

D0

add 1

Invert

R. Stewar

Subtraction• Subtraction is very readily derived from addition

complement? All we need to do to get a negative nubits and add 1.

• Then if we add these numbers, we’ll get a subtractio

• The addition of the 1 is done by setting the LSB carr

Σ Σ ΣC4

A3 B3 A2 B2 A1 B1 A

D3 D2 D1Discard

4 bit2’s comp

Page 66: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

No 15. For 2’s complement thenu

Ad

So n modes.

Th

Fo

Fo

Th

Σ

A0B0

MUX

K

0 1

tes:

10.423 R. Stewart,

te for 4 bit positive numbers (i.e. NOT 2’s complement) the range is from 0 to merical range is from -8 to 7.

dition/Subtraction (using 2’s complement representation)

metimes we need a combined adder/subtractor with the ability to switch betwee

is can be achieved quite easily:

r: A + B, K = 0

r: A - B, K= 1

is structure will be seen again in the Division/Square Root slides!

Σ Σ Σ

A3 A2 A1B1

MUX

B2

MUX

B3

MUX0 10 10 1

Page 67: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

plement 33

ult to be produced

+127 (or in binary

er of bits availableefore in the aboveor even 10 bits.

check for overflow

01100100+00100101

10001001

sult the result “wraps a negative value:

.001 119–=

R. Stewar

Wraparound Overflow & 2’s Com• With 2’s complement overflow will occur when the res

lies outside the range of the number of bits.

• Therefore for an 8 bit example the range is -128 tothis is 100000002 to 011111112:

• One solution to overflow is to ensure that the numbis always sufficient for the worst case result. Therexample perhaps allow the wordlength to grow to 9

• Using Xilinx SystemGenerator we can specifically in every addition calculation.

-65+ -112

-177

10111111+10010000101001111

With an 8 bit result we lose the 9th bit and the result “wraps around” to a

positive value: .01001111 79=

100+ 37137

With an 8 bit rearound” to

10001

Page 68: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Re t need to keep an eye on theMS

Fo

AdAd verflowAd overflow

011001000100000010100100

+

sult! Overflow

tes:

10.423 R. Stewart,

call from previously that overflow detect circuitry is relatively easy to design. JusB bits (indicating whether number is +ve or -ve):

r example

ding +ve and -ve will never overflow!ding +ve and +ve if a -ve result then oding -ve and -ve if a +ve result then

1011011101111111

1 00110110+

Discard final 9th bit carry

(-73) + 127 = 54

MSB bit indicate -ve re

100 + 64 = 164

No overflow

Page 69: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

34

turate technique.

e 33

ose possible value

ith an adder block.ox choice to allow

select” circuitry.

01100100+00100101

10001001

ow and saturate

01111111

R. Stewar

Saturation• One method to try to address overflow is to use a sa

• Taking the previous overflowing examples from Slid

• When overflow is detected, the result is set to the cl(i.e for the 8 bit case either -128 or +127).

• Therefore for every addition that is explicitly done wIn Xilinx System Generator the user will get a checkbresults to either (i) Wraparound or (ii) Saturate.

• Implementing saturate will require “detect overflow &

-65+ -112

-177

10111111+10010000101001111

100+ 37137

Detect overflow and saturate

1000000-128

Detect overfl

127

Page 70: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

On ens as the user has ensuredthe magnitude result.

Ge cks give adders with 48 bitsof ver 48 bits is unlikely. Henceov ces and using general slicelog must be taken, and whereap

Sa st Mean Squares algorithm(LM

W rm is added tothe

Th at the sign of the term wouldflip

W , saturation will limit it to thema n, and at the fastest speedpo of the algorithm.

2μe k( )x k( )

tes:

10.423 R. Stewart,

ce again, design of a DSP system might be done such that overflow never happre are enough bits to cater for the worse possible case leading to the maximum

nerally for some later FPGAs such as the Virtex-4, using some of the DSP48 bloprecisions therefore the likelihood of say working with 16 bit values that grow to oerflow has been “designed out”. Of course not applications will use these deviic and attempting to make adders as small as possible, would mean care propriate to efficient design, saturate might be included.

turation is extremely useful for adaptive algorithms. For example, in the LeaS), the filter weights are updated according to the equation:

ithout further concern over the meaning of this equation, we can see that the te weights at time epoch to generate the new weights at time epoch .

e the operations that form were to overflow, there is a high chance th and drive the weights in completely the wrong direction, leading to instability.

ith saturation however, if the term gets very big and would overflowximum value representable, causing the weights to change in the right directiossible in the current representation. The result is a huge increase in the stability

w

w k( ) w k 1–( ) 2μe k( )x k( )+=

k 1– k

2μe k( )x k( )

2μe k( )x k( )

Page 71: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

n 35

Sout

R. Stewar

Xilinx Virtex-II Pro Additio• The used components of the slide are outlined below

B

Cin

A

SoutCout

Σ

A B

Sout

CinCout

D

Page 72: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Pic Introduction and Overview”,DS

Lo

.

t

So DDER implementation:

tes:

10.423 R. Stewart,

ture of Xilinx-II Pro slice (upper half) taken from “Virtex-II Pro Platform FPGAs: 083-1 (v2.4) January 20, 2003. http://www.xilinx.com

okUp Table (LUT) programmed with two-input XOR function:

G1 (A) G2 (B) D0 0 00 1 11 0 11 1 0

G1 (A) G2 (B) Cin D Sout Cou0 0 0 0 0 00 0 1 0 1 00 1 0 1 1 00 1 1 1 0 11 0 0 1 1 01 0 1 1 0 11 1 0 0 0 11 1 1 0 1 1

ut = Cin xor D , Cout = DA + Cin D (multiplex operation). Result is the FULL A

Page 73: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ponents 36

nents on one slice

e

Mux

e

Mux

Upper

Lower

R. Stewar

Xilinx Virtex-II Pro Slice Main Com• A (very) high level diagram of the main logic compo

D-typFF4 input

LUTRAM

ShiftReg

MULTAND

ORCY

XORG Mux

Mux

Mux

Inputs

Outputs

D-typFF4 input

LUTRAM

ShiftReg

MULTAND

ORCY

XORG Mux

Mux

Mux

Inputs

Outputs

Page 74: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Ju ly the top half of the slice issh

• le as RAM/memory)

“S

“L her components!)

tes:

10.423 R. Stewart,

st reviewing the logic circuitry on one half of the slice (note that in Slide 35 onown, whereas the above slide shows the top and bottom halfs), we can note:

One D-type Flip Flop

One 4 input Look-Up-Table (LUT) (can be configured as shift register or simp

One XOR gate

One AND gate

One OR gate

A few 2 input MUX (multiplexors) to route signals

Clock inputs

mall” FPGAs will have just a few hundred (100’s) slices;

arge” FPGAs will have many tens of thousands (10000’s) of slices (and ot

Page 75: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

der 37

cascade the carry

ogrammed for anlf’s CIN. Hence we

slice.

bit addition 2 slices

3

Σ

A2 B2

S2

Σ

A1 B1

S1

Σ

A0 B0

S0

‘0’C0C1C2

R. Stewar

Xilinx Virtex-II Pro 4 bit Ad• To produce larger adders the Xilinx tools will simply

bits in adjacent (where possible!) slices.

• The bottom half of a Virtex-II Pro slice can be pridentical operation, with its COUT wired to the top-hacan get two bits of addition per standard Xilinx slice.

• To produce a 4 bit adder, we cascade with another

42 bit addition 1 slice

Σ

A3 B

S3S4

C3

FA

FA

FA

FA

FA

FA

Σ

A1 B1

S1

Σ

A0 B0

S0

‘0’C0C1

A0

B0

B1A1

C1‘0’S1

S0

A2

B2

B3A3

C3

S3

S2A0

B0

B1A1

C1

S1

S0

‘0’

Page 76: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

No d as a LUT, any four inputBo

Fo

Th

To ddress the LUT with valuesof

Th input LUT. (Of course if theeq stant)

tes:

10.423 R. Stewart,

te the importance of the LUT (look up table) in the Xilinx slice. When configureolean equation can be implemented.

r example take the equation

e truth table for this equation is:

implement this function, simply store the values of Y in the Slice LUT, and the aABCD to get the output

erefore ANY 4 variable Boolean function can be simply implemented with a fouruation is only 3 variables then we can also implement and just set one input con

Y ABC ABCD+=

ABCD Y

00001 0001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 11111 1

4 inputLUTRAM

ShiftReg

YABCD

Page 77: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

38

imal:

cting, shifting andr or not a shifted the sum.

rs and a little logic

00

0

R. Stewar

Multiplication in binary• Multiplying in binary follows the same form as in dec

• Note that the product is composed purely of seleadding . The th column of indicates whetheversion of is to be selected or not in the th row of

• So we can perform multiplication using just full addefor selection, in a layout which performs the shifting.

11010110

11010110x00101101

0000000001101011000

11010110000000000000000

110101100000000000000000000

0000000000000000010010110011110

A7…AB7…B

P15…P

PA i B

A i

Page 78: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Mu

St

No ht-shifted by one column.

Fo f that column with the firstop

tes:

10.423 R. Stewart,

ltiplication in decimal

arting with an example in decimal:

te that we do and then add to it the result of rig

r each additional column in the second operand, we shift the multiplication oerand by another place.

214

1070+8560

x45

9630

214 5× 1070= 214 4× 856=

zzz

bbbb+cccc0

xaaaa

+dddd00+eeee000 etc...

Page 79: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ion 39

remember to sign

-42x45

890

-42x45

890

R. Stewar

2’s complement Multiplicat• For one negative and one positive operand just

extend the negative operand.

11010110

1111111111010110x00101101

00000000000000001111111101011000111111101011000000000000000000001111101011000000000000000000000000000000000000001111100010011110 -1

11010110

1111111111010110x00101101

00000000000000001111111101011000111111101011000000000000000000001111101011000000000000000000000000000000000000001111100010011110 -1

signextends

Page 80: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

2’s

Fo

W and adding it rather thansu

Of

Th are being necessary. DSPpro

10110

1011001101

000001100010000000000000000000

11110

-42x-83

348600000

tes:

10.423 R. Stewart,

complement multiplication (II)

r both operands negative, subtract the last partial product.

e use the trick of inverting (negating and adding 1) the last partial productbtracting.

course, if both operands are positive, just use the unsigned technique!

e difference between signed and unsigned multiplies results in different hardwcessors typically have separate unsigned and signed multiply instructions.

110

11111111110x101

000000000001111111101011111110101000000000001111101011000000000000

-111010110000000000001101100

+00010101000two’s

complement

form last partial product negative

Page 81: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

n 40

rd than integer

sition of the binary

0500000

R. Stewar

Fixed Point multiplicatio• Fixed point multiplication is no more awkwa

multiplication:

• Again we just need to remember to interpret the popoint correctly.

11010.110

11.010110x00101.101

000.0000001101.011000

11010.110000000000.000000

1101011.00000000000000.000000

000000000.0000000010010110.011110

26.75x5.62

0.133750.53500

16.05000133.75000150.46875

Page 82: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

tes:

10.423 R. Stewart,

Page 83: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

tions 41

R. Stewar

Multiplier Implementation Op• Distributed multipliers

• Constant multipliers

• Using the logic fabric (LUTs)

• Using block RAM

• Shift-and-add “multipliers”

• High speed embedded multipliers

• 18 x 18 bit multipliers

• High speed integrated arithmetic slices (DSP48s)

• Multiply, accumulate

• Add, multiply, accumulate

Page 84: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Ov ariety of different ways. Asmu consideration.

Th fabric, i.e. the lookup tableswi cause the implementation isdis

In SP, the knowledge of onemu han a conventional 2-inputmu ed constant multipliers, and“sh

Th multipliers, and as a result,the ince then the sophisticationof t adders and in many caseslon slices, rather than simplymu

tes:

10.423 R. Stewart,

er the next few slides we will see that multipliers can be implemented in a vltipliers are used extensively in DSP, implementing them efficiently is a priority

e most basic multiplier is a 2-input version which is implemented using the logicthin the slices of the device. This type is referred to as a distributed multiplier, betributed over the resouces in several slices.

the case of multiplication with a constant, which is commonly required in Dltiplicand can be exploited to create a cheaper hardware implementation tltiplier. Two approaches that will be discussed in the coming pages are ROM-basift-and-add” multipliers which sum the outputs from binary shift operations.

e FPGA companies are well aware that DSP engineers desire fast and efficienty began incorporating embedded multipliers into their devices in the year 2000. Sthese components has increased, and they have been extended to feature fasger wordlengths, too. We can now think of them as embedded arithmeticltipliers.

Page 85: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

42

selection for eachnserts zeros in thee right.

’s complement!

0

a0

b0

110110111101

11010000

1101

10001111

1311

143

Example:

R. Stewar

Distributed Multipliers• This figure shows a four-bit multiplication:

• The AND gate connected to and performs the bit. The diagonal structure of the multiplier implicitly iappropriate columns and shifts the operands to th

• Note that this structure does not work for signed two

0

0

0

0

000 a1

b3

a2a3

b2

b1

p0p7 p6 p5 p4 p3 p2 p1

0

0

0

a

aout

bbout

s

sout

ccout

FA

FA is full adder

a b

a

Page 86: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

No

Th

A 0 0 1 1

Z =

He of the multiplier is as shownbe

0

a0

b0

tes:

10.423 R. Stewart,

te the function of the simple AND gate.

e operation of multiplying 1’s and 0’s is the same AND 1’s and 0’s

B Z0 01 00 01 1

A x B (where x = multiply) or in Boolean algebra Z = A and B = AB

nce the AND gate is the bit multiplier. The function of one partial product stage low.

x0x1x2x3a1a2a3a

aout

bbout

s

sout

ccout

FA

FA is full adder

y4 y3 y2 y1 y0 = b0(a3 a2 a1a0) + x3 x2 x1 x0

y0y1y2y3y4

Page 87: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ll 43

one multiplier cell.

Sout

n

NOTE: This implementationfeatures a Virtex-II Pro FPGA.

R. Stewar

Distributed Multiplier Ce• This shows the top half of a slice, which implements

B

A

Cout

Ci

A

B

S

Sout

CinCout

FA

S D

Page 88: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Pic Introduction and Overview”,DS

LU

Th t be obtained from within theLU multiply each, and the resultis

Th

No ropagating from the top andrig logic results in a differentco

So

tes:

10.423 R. Stewart,

ture of Xilinx-II Pro slice (upper half) taken from “Virtex-II Pro Platform FPGAs: 083-1 (v2.4) January 20, 2003. http://www.xilinx.com

T implements the XOR of two ANDs:

e dedicated MULTAND unit is required as the intermediate product G1G2 cannoT, but is required as an input to MUXCY. The two AND gates perform a one-bit added by the XOR plus the external logic (MUXCY, XORG):

is structure will perform one cell of the multiplier (see the next slide...).

te that whereas the signal flow graph of the distributed multiplier shows signals pht of the diagram to the bottom, the internal structure of the FPGA slice nfiguration when implemented on a device.

G1 (B)G2 (A)G3 (S)

D

ut = CIN xor D, COUT = DAB + CIND

Page 89: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

44

in a LUT as showno other operations.

sible multiplication

the address withat address is the

0

111

1110

decimal -18

P

8 bits

R. Stewar

ROM-based Multipliers• Just as logical functions such as XOR can be stored

for addition, we can use storage-based methods to d

• By using a ROM, we can store the result of every posof two operands.

• The two operands A and B are concatenated to formwhich to access the ROM. The value stored at thmultiplication result, P:

A

B0011

1010

0000 0000

1111 1111

1010 0011

address data (product)

0000 0000

0000 00011

001

1010

A:B

8 bits

4 bits

4 bits

ROM-based multiplier

28 = 256 8-bit addresses8-bit data

1110 1110decimal -6

decimal 3

Page 90: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th M size grows exponentially.Fo has entries. The outputres

Fo uired - a large quantity. Forbig require 128Gbits of storagean !

tal ROM StorageN x 22N)

Kbits

Kbits

Mbit

Mbits

4 Mbits

Gbits

8 Gbits

25 Tbits

Tbits

22N

tes:

10.423 R. Stewart,

ere is one serious problem with this technique: as the operand size grows, the ROr two N bit input operands, there are possible results, and hence the ROM ult is bits long, and in total bits of storage are required.

r example, with 8 bits operands (a fairly reasonable) size, 1Mbit of storage is reqger operands e.g. 16 bits, a huge quantity of storage is required. 16 bit operandsd hence a ROM-based multiplier is clearly not a realistic implementation choice

Input Wordlength(N)

Output Wordlength(2N)

No. of ROM entries(22N)

To(2

4 8 28 = 256 2

6 12 212 = 4,096 48

8 16 216 = 65,536 1

10 20 220 = 1,048,576 20

12 24 224 = 16,777,216 38

14 28 228 = 268,435,456 7

16 32 232 = 4,294,967,296 12

18 36 236 = 68,719,476,736 2.

20 40 240 = 1,099,511,627,776 40

22N

2N 2N 22N×

Page 91: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

resses 45

8-bit locations aretorage is needed!

t)00

01

data

0

110

0001

decimal 7,169

P

16 bits

1

000

0000

01

R. Stewar

Input Wordlength and ROM Add• Consider a ROM multiplier with 8-bit inputs: 65,536

required to store all possible outputs... so 1Mbit of s

A

B

0100

1011

0000 0000 0000 0000

1111 1111 1111 1111

0110 1011 0100 0011

address data (produc0000 0000 0000 00

0000 0000 0000 00

8 bits

8 bits

ROM-based multiplier

216 = 65,536 16-bit addresses 16-bit

decimal 107

decimal 67

0110

0011

A:B

16 bits

0001 1100 0000 00

address 27,459

1

101

0110

1

001

0100

Page 92: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Fo would be composed of 256po -bit binary word 0100 1011,as tually accessed.

ThROthelocpreremreq

Hocotheinsthethewi

Asthasto

Th ould therefore be 3 kbits...sig nds are unknown!

addresses (decimal):

0 x 28 + 75 = 751 x 28 + 75 = 3312 x 28 + 75 = 5873 x 28 + 75 = 843

254 x 28 + 75 = 65,099255 x 28 + 75 = 65,355

253 x 28 + 75 = 64,8431–

tes:

10.423 R. Stewart,

r example, if the B input was the constant value 75, the possible input words ssible combinations of the upper 8-bits of the address, concatenated with the 8 shown below. The result is that only 256 of the 65,536 memory locations are ac

erefore, when one of the inputs to theM-based multiplier is fixed, the size of required ROM can be reduced to 256ations of 16-bit data (note that thecision of the stored output wordsains 8 bits + 8 bits). The total memoryuired is thus 256 x 16 = 4kbits.

wever, depending on the value of thenstant, it may also be possible to reduce length of the stored results. Fortance, if the value of B is (decimal) 10, maximum output product generated by multiplication of B with any 8-bit input A

ll be:

-1280 can be represented with 12 bits,t represents a further saving of 4 bitsrage x 256 memory locations = 1kbit.

e total storage requirement for this example constant coefficient multiplier wnificantly smaller than the 1Mbit needed for a 16-bit multiplier where both opera

0100

????

8 bits

8 bits

????

1011

A:B

16 bits

?

???

????

1

001

0100

A=?

B=7528 10× 1280–=

Page 93: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

liers 46

fewer addresses.

be reduced, if thel range of:

calculated for theised accordingly...

further.

-128

? ? ? ? ? ? ? ? ?

m product = 10,624 15-bit representation

red! 1-bit saved!

signed result

R. Stewar

ROM-based Constant Multip• ROM-based multipliers with a constant input require

• The storage required for output words may also maximum result does not require the full numerica

• The maximum product and output wordlength can beparticular constant value, and the multiplier optim

• Additional optimisations allow cost to be reduced

22N 1– result 22N 1– 1–≤ ≤–

8 bit representation (min.)

A = ?

0 1 0 1 0 0 1 1

8 bit signed number

? ? ? ? ? ? ? ?

B = -83

maximum absolute value =

? ? ? ? ?P = ?

maximuso maximum

requi

16 bit

? ?

Page 94: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Co stributed ROM”), or with oneor y the other demands placedon

In nstant Multiplier dialog box,alo

tes:

10.423 R. Stewart,

nstant multipliers can be implemented using the LUTs within the logic fabric (“dimore of the Block RAMs available on most devices. The selection is influenced b these resources by the rest of the system being designed.

System Generator, the designer can specify the implementation style via the Cong with the constant value, the output wordlength, and other parameters.

Page 95: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

dd 47

ply by shifting ther of binary places.

numbers can bem shifts, and then

= -496

= 0.15625

x16

x0.25

+ 21 = 189 x9

= 1.3125 x1.3125

R. Stewar

Multiplication by Shift and A• Multiplication by a power-of-2 can be achieved sim

number to the right or left by the appropriate numbe

• Extending this a little, multiplications by other performed at low cost by creating partial products froadding them together.

0 0 0 1 0 12

4 1 0 0 0 0 1 0 0 0 0 -31 x 241 0 0 0 0 1

0.625 x 2-2 0 1 0 1

3

0 1 0 1 1 1 1 0 1

0 1 0 1 0 1 21 x 23

4

0 1 0 1 0 1 0 0 0

0 1 0 0 0 0

2

0

(1 x 2-4) + 1 + (1 x 2-2)

Page 96: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Sh emented only using routing.Th that multiplications by othernu arbitrary multiplication canbe multiplication is to a power-of- ed, and hence the lower theco

Thsumutheanusthe

Thpaapmusaprosecathe in this way.

Ta other x24, it is clear that thesh

4x16

x8

x24

x9x1

fewer partial products

3

tes:

10.423 R. Stewart,

ift operations are effectively “free” in terms of logic resources, as they are implerefore multiplications by power-of-two numbers are very cheap! By recognisingmbers can be achieved by summing partial products of power-of-two shifts, any decomposed into a series of shifts and add operations. The “closer” the desiredtwo, i.e. the fewer partial products that are required, the fewer adders are requirst of the multiplier.

is type of multiplier isitable only for constantltiplications, becausere is only one input,d the result is achieveding the configuration of hardware.

e technique can berticularly powerful whenplied to parallelltiplications of the

me input. The partialduct terms common to

veral multiplicationsn be shared and thus overall effort reduced. Transpose form filters are very suitable for optimisation

king the above simple example of two concurrent multiplications, one x9 and theift right by three places can be shared as x8 is common to both operations.

3x8

x1x9

4x16

x8x243

combined -

x24 and x9 calculated separately

Page 97: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

48

first to provide “on-

ctually in the useravailable, and theyan a slice-based

of the device.

36-bit product, i.e.

d more than 2000 are available.

R. Stewar

Embedded multipliers• The Xilinx Virtex-II and Virtex-II Pro series were the

chip” multipliers in early 2000s.

• These are in hardware on the FPGA ASIC, not aFPGA “slice-logic-area”. Therefore are permanently use no slices. They also consume less power thequivalent and can be clocked at the maximum rate

• A and B are 18-bit input operands, and P is the .

• Depending upon the actual FPGA, between 12 an(Virtex 6 top of range) of these dedicated multipliers

A

B

P18x18 bit multiplier

P A B×=

Page 98: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Lo h are located next to BlockRA utation.

Inf ntroduction and Overview”,DS

tes:

10.423 R. Stewart,

oking at a device floorplan, you can clearly see the embedded multipliers, whicMs on the FPGA in order to support high speed data fetching/writing and comp

ormation on dedicated multipliers taken from “Virtex-II Pro Platform FPGAs: I083-1 (v2.4) January 20, 2003. http://www.xilinx.com.

Block RAM slices

18x18 multiplier

Page 99: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ncy 49

liers inefficiently

reful

8

ltiplier ~5% utilised

38

4 embedded mults

R. Stewar

Embedded Multiplier Efficie• It can be easy to utilise on chip embedded multip

through choice of wordlengths...

• When using multipliers in System Generator....be ca

x18

1836

4

4

1 embedded multiplier 100% utilised 1 embedded mu

18

1836

18 36

SysGen will use 1 embedded mult SysGen will use

18

19

19

Page 100: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

If y Generator, the tool will doex However, depending on thewo

Th s sense to use them as fullyas

It i ies of the multiplier, and thispa tion, which would leave theem made in the context of somelar he FPGA being targeted.

Pe e input operands are slightlylon wing implementation for areq f the expected 1!

1 x 18

1 x 1

tes:

10.423 R. Stewart,

ou specify the use of embedded multipliers for a particular multiplier in Systemactly as you have asked, and implement it entirely using embedded multipliers. rdlengths involved, this may lead to an inefficient implementation.

e wordlengths of the embedded multipliers are fixed at 18 x 18 bits, and it make possible.

s relatively easy to see that a 4 x 4 bit multiply will greatly underuse the capabilitrticular multiply operation might be better mapped to a distributed implementabedded multiplier free for use somewhere else. Of course, these decisions are ger design with its own particular needs for the various resources available on t

rhaps less obviously, mapping a multiplication to embedded multipliers where thger than 18 bits is also inefficient. This may result in, for example, the follouested 19 x 19 bits multiplier, where 4 embedded multipliers are used instead o

3819

19

18 x 18

18 x 1

Page 101: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

SP48s) 50

C) operation, soonthe Virtex-4).

bit accumulator.

and fast.

at whole filters can

8

rtex-4P48

R. Stewar

High Speed Arithmetic Slices (D• As much DSP involves the Multiply-Accumulate (MA

after embedded multipliers came DSP48 slices (on

• These feature an 18 x 18 bit adder followed by a 48

• Like the embedded multipliers, these are low power

• The ability to cascade slices together also means thbe constructed without having to use any slices.

18

1836

48

4

ViDS

Page 102: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th with the DSP48E.

Th actor unit, and an extendedwo in line with the speed of thede

8

tex-5P48E

tes:

10.423 R. Stewart,

e next series of FPGAs (the Virtex-5) enhanced the capabilities of the DSP slice

e major improvements of this slice are logic capabilities within the adder/subtrrdlength of one input to 25 bits. The maximum clock frequency also increased vice.

18

2543

48

4

VirDS

Page 103: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

51

artan-6 feature aultiplier.

s like symmetricltiplications to be

48

P DSP48A DSP48A1

R. Stewar

DSP48s with Pre-Adders• The Spartan-3A DSP series and subsequent Sp

version of the DSP48 with a pre-adder, prior to the m

• This feature is especially useful for DSP structurefilters, because it allows the total number of mureduced.

18

1836

48

18

18

Spartan-3A DSSpartan-6

Page 104: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Th length and arithmetic unit),tog utationally powerful device,es em!

48

DSP48E1

tes:

10.423 R. Stewart,

e Virtex-6 offers a combination of the benefits of the Virtex-5 (the longer wordether with the pre-adder from the Spartan series. This results in a very comp

pecially as it can be clocked at 600MHz, and the largest chips have 2000+ of th

25

1843

48

25

25

Virtex-6

Page 105: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

52

not very often.

btraction as shownbe selected.

0

0

FA

sin

sout

bout

in

Bout

cint

R. Stewar

Division (i)• Divisions are sometimes required in DSP, although

• 6 bit non-restoring division array:

• Note that each cell can perform either addition or suin an earlier slide either Sin+ Bin or Sin - Bin can

0

0

0

b0b1b2b3b4b5

a0a1a2a3a4

q5

q4

q3

q2

q1

q0

a5

bin

B

cou

Q = B / A

1

Page 106: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

A y look familiar as it is oftentau an addition or subtraction ofthe ion. If the quotient bit is a 0,the difficult to map this exampleint

Ex

tes:

10.423 R. Stewart,

Direct method of computing division exists. This “paper and pencil” method maght in school. A binary example is given below. Note that each stage computes divisor A. The quotient is made up of the carry bits from each addition/subtract next computation is an addition, and if it is a 1, the divisor is subtracted. It is not

o the structure shown on the slide.

ample: B = 01011 (11), A = 01101 (13) -A = 10011. Compute Q = B / A. ⇒

010111001111110

q4 = 0

111000110101001

q3 = 1

100101001100101

q2 = 1

010101001111101

q1 = 0

110100110100111

q0 = 1

R0 = B-AR1

2.R1+AR2

2.R2-AR3

2.R3-AR4

2.R4+AR5

Q = B / A = 01101 x 2-4 = 0.8125

carry

carry

carry

carry

carry

0

0

0

0

Page 107: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

53

ing another paper

VHDL Design

R. Stewar

Division (ii)• There is an alternative way to compute division us

and pencil technique.

01101 01011.0000001000100000101000001011

00000.1101

0000001011 000110 100100 1000011 0100001 01000000 00000001 010000000 110100000 0111

divisor_in

remdsh1

divisor_in

Page 108: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

tes:

10.423 R. Stewart,

Page 109: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

n 54

otient is generatedn!

he next stage untilsystem!

hrough adders.

e the next row can

d as result division on a parallel arraye!

R. Stewar

The Problem With Divisio• An important aspect of division is to note that the qu

MSB first - unlike multiplication or addition/subtractio

• This has implications for the rest of the system.

• It is unlikely that the quotient can be passed on to tall the bits are computed - hence slowing down the

• Also, an N by N array has another problem - ripple t

• Note that we must wait for N full adder delays beforbegin its calculations.

• Unlike multiplication there is no way around this, anis always slower than multiply even when performed- a N by N multiply will run faster than a N by N divid

Page 110: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

By to get generated is the MSBof where the LSB is generatedfirs to start a computation andhe n.

An e next row can start. In theex ultiplier, the first cell on these n the second row is only the5th h.

q

FA

sin

sout

bout

Bout

cin

0

0

0

0 a0a1a2

b0

b1

p0p1

12

3

tes:

10.423 R. Stewart,

looking at the top two rows of a 4 x 4 division array we can see that the first bit the quotient. This is unlike the multiplication array that can also be seen below, t. This is a problem when using division as most operations require the LSBsnce the whole solution will have to be generated before the next stage can begi

other problem for division is the fact that it takes N full adder delays before thamples below, the order in which the cells can start has been shown. So for the mcond row is the 3rd cell to start working. However, for the divider, the first cell o cell to start working because it has to wait for the 4 cells on the first row to finis

b0b1b2b3

a0a1a2

3

q2

a3

bin

Bin

cout

1

01234

56

00 a3

0

a

aout

bbout

s

sout

ccout

FA

FA is full adder

34

45

Page 111: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

ray 55

lined to increase

0

0

FA

sin

sout

boutin

Bin

Bout

cinout

elay

R. Stewar

Pipelining The Division Ar• The division array shown earlier can be pipe

throughput.

0

0

0

b0b1b2b3b4b5

a0a1a2a3a4

q5

q4

q3

q2

q1

q0

a5

b

c

Q = B / A

1pipeline d

b0b1b2b3b4b5

a0a1a2a3a4a5

b0b1b2b3b4b5

a0a1a2a3a4a5

Operands

Page 112: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

To ipeline delays at appropriatepo egistering the full quotient isN2 ter the array. However, bypip the rate at which new dataca

q

0

0

0

0

l path is only N full adders.

tes:

10.423 R. Stewart,

increase the throughput, the critical path can be broken down by implementing pints. If pipelining is not used, the delay (critical path) from new data arriving to r full adders. This delay represents the maximum rate that new data can enelining the array, the critical path is broken down to just N full adders and thus

n arrive is increased dramatically.

b0b1b2b3

a0a1a2

3

q2

q1

q0

a3

10

0

0

bb1b2b3

a0a1a2

q3

q2

q1

q0

a3

1

Without pipelining the critical path is through N2 full adders.

The longest path from registerto register is the Critical Path.

With pipelining the critica

Page 113: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

56

rithms such as QR communications

FA

sin

bout

Bout

cin

sout

1 00

1 00

1 00

0

0

R. Stewar

Square Root (i)• 6 bit non-restoring square root array.

• The square root is found (with divides) in DSP in algoalgorithms, vector magnitude calculations andconstellation rotation.

1

b5

b4

b3

b2

b1

b0

1

bin

Bin

cout

a7

000 1

0

10

10

10

10

10

10

1 00

0

0

0

0

a6

a5 a4

a3 a2

a1 a0

B A=

Page 114: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

Lo y is essentially “half” of thediv the cells that are needed forthe which can be simplified. Sosq are!

1

b5

sout

1 00

1 00

1 00

0

0

11 01 01011111011010010011001111100

110001011011001100

a3 a2 a1

0a4111R1R1<<1 & a31b311R2R2<<1 & a21b3b211R3

R3<<1 & a10b3b2b111R4

tes:

10.423 R. Stewart,

oking carefully at the non-restoring square root array, we can note that this arraision array! If the division array above is cut diagonally from the left we can see square root array. The 2 extra cells on the right hand side are standard cells uare root can be performed twice as fast as divide using half of the hardw

b4

b3

b2

b1

b0

1a7

000 1

0

10

10

10

10

10

10

1 00

0

0

0

0

a6

a5 a4

a3 a2

a1 a0

A = 10 011100

010

b3 = 1 carry

b2 = 1 carry

b1 = 0 carry

b0 = 1 carry

a4

Page 115: fpga_notes_april23

t, Dept EEE, University of Strathclyde, 2010

goras! 57

s is in advancedns.

form:

s, a divide and altiply!)

can be used toinvariably require time.)

bout square roots:r and cheaper to

yy2+

------------

R. Stewar

Square Root and Divide - Pytha• The main appearance of square roots and divide

adaptive algorithms such as QR using givens rotatio

• For these techniques we often find equations of the

• So in fact we actually have to perform two squaresquare root. (Note that squaring is “simpler” than mu

• There are a number of iterative techniques thatcalculate square root. (However these routines multiplies and divides and do not converge in a fixed

• There seems to be some misinformation out there aFor FPGA implementation square roots are easieimplement than divides....!

θcos xx2 y2+

----------------------= and θsinx2

----------=

Page 116: fpga_notes_april23

No Introduction to FPGA

Ver Dept EEE, University of Strathclyde, 2010

tes:

10.423 R. Stewart,