fpga_notes_april23
-
Upload
cststudio2009510 -
Category
Documents
-
view
1.233 -
download
3
Transcript of fpga_notes_april23
Introduction to FPGA
DSPedia Notes 1
THIS SLIDE IS BLANK
t, Dept EEE, University of Strathclyde, 2010
As 1
ave been enabled
alog Devices
P
have been widely
high speed DSP (FPGA)
h FPGAs!
R. Stewar
Introduction: DSP and FPG• In the last 20 years the majority of DSP applications h
by DSP processors:
• Texas Instruments Motorola An
• A number of DSP cores have been available.
• Oak Core LSI Logic ZSP 3DS
• ASICs (Application specific integrated circuits) used for specific (high volume) DSP applications
• But the most recent technology platform for applications is the Field Programmable Gate Array
This course is all about the why and how of DSP wit
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
DS the course, we will see thatmo ve filters, Fourier transformsan uare root is quite a rare thingin
He ements. In particular, whenco s than the other, then clearlythe ns. One is that the requiredMA processor based situationswe 6 bit digital filter coefficientsetc are required. Therefore weca y.
e Input
tes:
10.423 R. Stewart,
P is all about multiplies and accumulates/adds (MACs). As we progress throughst algorithms that are used for different applications employ digital filters, adaptid so on. These algorithms all require multiplies and adds (note that a divide or sqDSP).
nce a DSP algorithm or problem is often specified in terms of its MAC requirmparing two algorithms, if they both perform the same job but one with less MAC “cheaper” one would be the best choice. However this implies some assumptioCs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 1. With FPGAs this constraint is removed - we can use as many, or as few bits, as
n choose to optimise and schedule DSP algorithms in a completely different wa
Voltag
Circuit Board
Amplifiers/Filters
ADC
General Purpose Input/Output Bus
Voltage Output
DACDSP56307
DSP Processor
t, Dept EEE, University of Strathclyde, 2010
n 2
DSP market hassuch as the ever
e implemented inice are possible.
ed multipliers ontoMHz. Number of
rithms signal flow) FIR SFGs filters
robably more DSPuare root, divide),ured more DSP is
R. Stewar
The FPGA DSP Evolutio• Since around 1998 the evolution of FPGAs into the
been sustained by classic technology progress present Moore’s law.
• Late 1990s FPGAs allow multipliers to bFPGA logic fabric. A few multipliers per dev
• Early 2000s FPGAs: vendors place hardwirthe device with clocking speeds of > 100multpliers from 4 to > 500.
• Mid 2000s FPGA vendors place DSP algographs (SFGs) onto devices. Full (pipelinedfor example are available
• Late 2000s to early 2010s - who knows! Ppower, more arithmetic capability (fast sqFFTs, more floating point support. Rest asscoming....
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Te
An the next quarter you will getthe it another quarter and in 6mo ely to be cheaper also! Suchis
DS ill be bring out prepackagedalg - higher level design tools,de
So plementation of a softwarerad But of course who can wait?
Th igning DSP for FPGAs. Likeall
tes:
10.423 R. Stewart,
chnology just keeps moving.
yone who has purchased a new laptop knows the feeling. If you just wait, then in new model with integrated WiFi or WiMax, a faster processor. Of course wanths it will be improved again - also, the new faster, better, bigger machine is lik
technology
P for FPGAs is just the same. If you wait another year its likely the vendors worithms for precisely what you want to do. And they will be easier to work withsign wizards and so on.
if you are planning to design a QR adaptive equalizing beamformer for MIMO imio for 802.16 - then if you wait, it will probably be a free download in a few years.
erefore in this course, we discuss and review the fundamental strategies of des technologies you still need to know how it works if you really want to use it.
t, Dept EEE, University of Strathclyde, 2010
cks 3
epositories of DSP
available are finite.
ful about runningconsiderations are.
build it:
R. Stewar
FPGAs: A “Box” of DSP blo• We might be tempted to this of the latest FPGAs as r
components just waiting to be connected together.
• Of course the resource is finite and the connections
• In the days of circuits boards one had to be carebusses close together, lengths of wires etc. Similiar required for FPGAs (albeit out of your direct control)
• However, the high level concept, take the blocks, &
“Connectors” Logic Arithmetic
Registers and Memory
DesignVerify
Place and Route
Clocks Input/Output
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th nect them together and thealg
Do
Do
Ye the same person.
Th t (ie overflows, underflows,sa
Fo te. What device do we need,an sign flows will give differentres
As ch allows a complete FIR tobe aken care of.
tes:
10.423 R. Stewart,
is is undoubtedly the modern concept of FPGA design. Take the blocks, conorithm is in place.
we actually need an FPGA/IC engineer then?
we actually need a DSP engineer?
s in both cases, but moderm toolsets and design flows are such that it might be
ere is lots to worry about. In terms of the DSP design; is the arithmetic correcturates etc). Do the latency or delays used allow the integrity to be maintained.
r the FPGA, can we clock at a high enough rate? Does the device place and roud how efficient is the implementation (just like compilers, different vendors deults (some better than others).
vendors provide higher level components (like the DSP48 slice from XIlinx whi implemented) then issues such as overflow, numerical integrity and so on are t
t, Dept EEE, University of Strathclyde, 2010
ply 4
lots of them!
bit number:
N bit number:
N bit numbers welength.
R. Stewar
Binary Addition and Multi• The bottom line for DSP is multiplies and adds - and
• Adding two N bit numbers will produce up to an N+1
• Multiplying two N bit numbers can produce up to a 2
• So with a MAC (multiply and accumulate/add) of twocould, in the worst case, end up with 2N+1 bits word
+ =N+1N N
=
x =2NN N
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
If t e the situation of numericalov
W for.
Fo ulator, i.e. the largest resultof tion we may require to take,sa . The result of each multiplywi t happen to be large positiveva rs together (and they just allha of quite a few bits. So onemu ely that the result of addingthe hosen. (Of course if you didha to the code to catch this.)
tes:
10.423 R. Stewart,
he wordlength grows beyond the maximum value you can store we clearly haverflow which is a non-linear operation and not desirable.
ithin traditional DSP processors this wordlength growth is well known an catered
r example, the Motorola 56000 series is so called because it has a 56 bit accumany “addition” operation can have 56 bits. For a typical DSP filtering type operay an array of 24 bit numbers and multiply by an array of another 24 bits numbersll be a 48 bit number. If we then add two 48 bit numbers together, if they both juslues then the result could be a 49 bit number. Now if we add many 48 bit numbeppen to be large positive values), then the final result may have a word growthst assume that Motorola had a good look at this, and realised it was fairly unlikse 48 bit products together would ever be larger that 56 bits - so 56 bits was c
ve a problem that grew beyond 56 bits you would have to put special trapping in
t, Dept EEE, University of Strathclyde, 2010
5
ple adder:
rallel at a cost of N
0 B0
S0
‘0’
LSB
0 carry in
R. Stewar
The “Cost” of Addition• A 4 bit addition can be performed using a simple rip
• Therefore an N bit addition could be performed in pafull adders.
Σ
A3 B3
S3
Σ
A2 B2
S2
Σ
A1 B1
S1
Σ
A
S4
A3 A2 A1 A0B3 B2 B1 B0
S4 S3 S2 S1 S0
C3 C2 C1 C0
C0C1C2C3
MSB
+
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th
Ad
BC
BC
Cin
tes:
10.423 R. Stewart,
e simple Full Adder (FA):
ds two bits + one carry in bit, to produce sum and carry out
1011 +11 + 1101 +13
11000 +24
0 0 00 0 10 1 00 1 11 0 01 0 11 1 01 1 1
0 00 10 11 00 11 01 01 1
A B Cin Cout Sout
Sout ABC ABC ABC A+ + +=
A B C⊕ ⊕=
Cout ABC ABC ABC A+ + +=
AB AC BC+ +=
ΣCout
A B
Sout
t, Dept EEE, University of Strathclyde, 2010
6
ltiply/add cells:
ly 4 times more
0
0
0
0 a0a1
b0
2
b1
p0p1
R. Stewar
The “Cost” of Multiply• A 4 bit multiply operation requires an array of 16 mu
• Therefore an by multiply requires cells......
......so for example a 16 bit multiply is nominalexpensive to perform than an 8 bit multiply.
0
0
00
b3
a2a3
b
p7 p6 p5 p4 p3 p2
a3 a2 a1 a0b3 b2 b1 b0c3 c2 c1 c0
d3 d2 d1 d0e3 e2 e1 e0
f3 f2 f1 f0+
p7 p6 p5 p4 p3 p2 p1 p0
N N N2
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Ea t wires:
An
Partial Product
tes:
10.423 R. Stewart,
ch cell is composed of a Full Adder (FA) and an AND gate, plus some broadcas
8 bit by 8 bit multiplier would require 8 x 8 = 64 cells
a
aout
bbout
s
sout
cout = s.z.c + s.z.c + s.z.c + s.z.c
ccout
z = a.bbout = b
aout = asout = (s ⊕ z) ⊕ c
1 0 1 11 0 0 11 0 1 1
0 0 0 00 0 0 0
1 0 1 1+1 1 0 0 0 1 1
11x9
99
t, Dept EEE, University of Strathclyde, 2010
7
s:
gates to form
R. Stewar
The Gate Array (GA)• Early gate-arrays were simply arrays of NAND gate
• Designs were produced by interconnecting thecombinational and sequential functions.
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th produce any Boolean logicfun
• ction and test. Metal layers
Ea
Fr
Ho sers for similar systems.....
.... and perhaps addition andsu
Fo s, no updates, no fixes.
So en these and gate arrays:
• le
• tion of multi-input logic, flips-
tes:
10.423 R. Stewart,
e NAND gate is often called the Universal gate, meaning that it can be used toction.
Early gate array design flow would be design, simulate/verify, device produmake simple connections
rly simulators and netlisters such as HILO (from GenRad) were used.
om GA to FPGA
wever simple gate arrays although very generic, were used by many different u
for example to implement two level logic functions, flip-flops and registers btraction functions.
r a GA once a layer(s) of metal had been laid on a device - that’s it! No change
then we move to field programmable gate arrays. Two key differences betwe
They can be reprogrammed in the “field”, i.e. the logic specified is changeab
They no longer are just composed of NAND gates. A carefully balanced selecflops, multiplexors and memory.
Z AB CD+=
ABCD
Z
t, Dept EEE, University of Strathclyde, 2010
ic Fabric) 8
e refered to as the
Row
Column
interconnects
interconnects
logicblock
logicblock
logicblock
logicblock
R. Stewar
Generic FPGA Architecture (Log• Arrays of gates and higher level logic blocks might b
logic fabric...
logicblock
I/O
I/O I/O I/O I/O
I/O
I/O
I/O
I/O
I/O I/O I/O
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
logicblock
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th t manufacturers will includedif
A
Of manufacturer and device tode
Inte
rcon
nect
s
lement
tes:
10.423 R. Stewart,
e logic block in this generic FPGA contains a few logic elements. Differenferent elements (and use different terms for logic block, e.g. slices etc).
simple logic block might contain the following:
course the actual contents of a logic element will vary from manufacturer to vice.
FLIPFLOP
CarryLogic
SelectMUXLogic
Cascade/FLIPFLOP
FLIPFLOP
CarryLogic
SelectMUXLogic
Cascade/FLIPFLOP
LUT
Logic E
t, Dept EEE, University of Strathclyde, 2010
Devices) 9
we also find blockseful for DSP!
rows
R. Stewar
FPGA Architecture (Xilinx DSP • Looking more specifically at recent Xilinx FPGAs,
RAMs and dedicated arithmetic blocks... both very u
columns
Block RAM
ArithmeticBlock
Logic Fabric
Input/OutputBlocks
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
f more recent, DSP-targetedion of dedicated arithmetic power and higher clocke logic fabric (i.e. the arrayfigured to perform a numbernd are especially suited toAC) operations prevalent in
xtensively in DSP. Examplecoefficients, encoding and
se additional resources, theajority of the FPGA. We willh comprise the logic fabric,d together, in further detail.
tes:
10.423 R. Stewart,
One of the major features oXilinx FPGAs is the provisblocks, which offer lowerfrequency operation than thof CLBs). These can be conof different computations, athe Multiply Accumulate (Mdigital filtering.
Block RAMs are also used euses are for storing filter decoding, and other tasks.
Despite the inclusion of thelogic fabric still forms the mnow look at the CLBs whicand how they are connecte
Input / Output Block (IOB)
Block RAM
DSP48 / DSP48A / DSP48E
Configurable Logic Block (CLB)
FPGA
Diagram Key
t, Dept EEE, University of Strathclyde, 2010
Routing 10
gic Blocks (CLBs), CLB).
) routing.
CLB
Slices
other CLBs
esources is depicted above.
R. StewarExample: Xilinx Logic Blocks and• Xilinx FPGA logic fabric comprises Configurable Lo
which are groups of SLICEs (e.g. 2 or 4 SLICEs per
• Signals travel between CLBs via routing resources.
• Each CLB has an adjacent switch matrix for (most
SwitchMatrix
NOTE: Only a subset of routing r
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th and the Altera and Latticearc ver, in all cases, their LogicBlo quired for connecting blockstog
Co Tables (LUTs), and in mostde LUTs can be utilised in fourmo
•
•
•
•
Th
•
•
Ov s will be described.
tes:
10.423 R. Stewart,
e example in the main slide features a typical Xilinx FPGA architecture, hitectures are different. Logic units differ in size, composition and name! Howecks include both combinational logic and registers, and routing resources are reether.
ntinuing with the Xilinx example, the combinational blocks are termed Lookup vices have 4 inputs (some of the more recent devices have 6-input LUTs). Thesedes:
To implement a combinatorial logic function
As Read Only Memory (ROM)
As Random Access Memory (RAM)
As shift registers
e register can be used as:
A flip-flop
A latch
er the next few slides, the functionality of LUTs and registers in the above mode
t, Dept EEE, University of Strathclyde, 2010
11
put addresses the of A, B, C and D.
01 11 100 0 0
000 0 0
0
00 1
Z
R. Stewar
The Lookup Table• When used to implement a logic function, the 4-bit in
LUT to find the correct output, Z, for that combination
00011110
00ABCD
0
0
11
Z = B C D + A B C D
ABCD
LookupTable
Example logic function:
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Thcoinpcastoap
In Bitthe
D Z0 11 10 01 10 11 00 01 00 01 00 01 00 01 00 11 1
tes:
10.423 R. Stewart,
e lookup table can also implement a ROM,ntaining sixteen 1-bit values. Instead of the fouruts representing inputs of a logic function, they
n be thought of as a 4-bit address. A 1-bit value isred within each memory location, and thepropriate output is supplied for any input address.
this example, A is considered the Most Significant (MSB) and D the Least Significant Bit (LSB), and output is Z.
A B C0 0 00 0 00 0 10 0 10 1 00 1 00 1 10 1 11 0 01 0 01 0 11 0 11 1 01 1 01 1 11 1 1
t, Dept EEE, University of Strathclyde, 2010
12
either single port
chronous writeions.
s write operationsd 1 address for
wo or more LUTs
port RAM.
addresses, 1 bitn equivalent Dual
R. Stewar
LUTs as Distributed RAM• LUTs can also be configured as distributed RAM, in
or dual port modes.
• Single port: 1 address for both synoperations and asynchronous read operat
• Dual port: 1 address for both synchronouand asynchronous read operations, anasynchronous reads only.
• Larger RAMs can be constructed by connecting ttogether.
• Dual port RAM requires more resources than single
• For example, a 32x1 Single Port RAM (32data), requires two 4-bit LUTs, whereas aPort RAM requires four 4-bit LUTs.
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th ort RAMs in the Virtex II Prode ces as the single port RAM.
SC
ual Port
tes:
10.423 R. Stewart,
e two diagrams below demonstrate the implementation of 16x1 single and dual pvice, respectively. Notice that the dual port RAM requires twice as many resour
ource: Virtex-II Pro and Virtex-II Pro X Platform FPGAs:omplete Data Sheet, DS083 (v4.7), November 5, 2007.
Single Port
D
t, Dept EEE, University of Strathclyde, 2010
16s) 13
ter of up to 16 bits.
d the 4-bit addresschronously read.
utput from the 10th
n be used to add synchronises the
SHIFT OUT
13 14 15
D Q D Q D Q
R. Stewar
LUTs as Shift Registers (SRL• A final alternative is to use the LUT as a Shift Regis
• Additional Shift In and Shift Out ports are used, anis used to define the memory location which is asyn
• For example, if the LUT input is the word 1001, the oregister is read, as depicted below.
• The slice register at the output from the LUT caanother 1 clock cycle delay. Using the register alsoread operation.
LUT INPUT
D OUT
SHIFT IN
CLK
0 1 2 3 4 5 6 7 8 9 10 11 124
D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q
(e.g. 1001)
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
As ombining several LUTs. Forex -bit Shift Registers together,as ift Registers.
Slice 2
FF
FF
Cascadable Out
C
tes:
10.423 R. Stewart,
with the other LUT configurations, larger Shift Registers can be constructed by cample, a 64-bit shift register segment can be constructed by combining four 16 shown below. The cascadable ports allow further interconnections for larger Sh
Slice 1
FF
FF
SRL16
SRL16
DI D
MSB
DI D
MSB
SRL16
SRL16
DI D
MSB
DI D
MSB
ascadable In
t, Dept EEE, University of Strathclyde, 2010
14
kup table can be
rom the LUT, or bypassed).
LUT / RegisterPair
Output
R. Stewar
Registers• The sequential logic element which follows the loo
configured as either:
• An edge-triggered D-type flip flop; or
• A level-sensitive latch
• The input to the register may be the output falternatively a direct input to the slice (i.e. the LUT is
LUT Carry Logic
REG
BypassInput
LUTInputs
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
A le below (D(t) is the registerinp ntrol inputs (set, reset, etc.)are
W “captured” and stored withinthe d.
Fli
tes:
10.423 R. Stewart,
D-type flip flop provides a delay of one clock cycle, as confirmed by the truth tabut at time t, and Q(t=1) is the output 1 clock cycle later). A clock signal and co also provided.
hen configured as a latch, the control inputs define when data on the D input is register. The Q output thereafter remains unchanged until new data is capture
p flops and registers are discussed in the Digital Logic Review notes chapter.
D(t) Q(t+1)
0 0
1 1
t, Dept EEE, University of Strathclyde, 2010
16s 15
Therefore, addingource utilisation.
ettable, then eachL16 can be used.
OUTPUTD Q
R
Slice 4
UT
R. Stewar
Resets: Registers and SRL• Whereas registers can be reset, SRL16s cannot.
reset capabilities to a design has implications for res
• For example, consider an 8-bit shift register. If reselement requires a slice register. If not, then an SR
D Q
R
INPUT
CLOCK
RESET
D Q
R
D Q
R
D Q
R
D Q
R
D Q
R
D Q
R
Slice 1 Slice 2 Slice 3
INPUT
CLOCK
OUTPSRL16
Slice 1
Resettable Implementation
Non Resettable Implementation
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
W more sophisticated design.Ins ing a slice register, and thesu , which allows the 0 input topro quire 2 slices at most.
OUTPUT
e 1/2
tes:
10.423 R. Stewart,
e can still design a resettable shift register with an SRL16, by using a slightlytead of making all elements resettable, we can implement the first element usbsequent ones using an SRL16. The reset signal is held high for 8 clock cyclespagate through the shift register. Instead of using 4 slices, this design would re
SRL16
Slic
D Q
R
INPUT
CLOCK
RESET
Slice 1
Hold RESET signal highfor 8 clock cycles to resetthe shift register...
t, Dept EEE, University of Strathclyde, 2010
tion 16.1
A hardware is an
er:
ned integers, 2’s
on and rounding
and Square Root;
d arithmetic
R. Stewar
FPGA Arithmetic Implementa• The implementation of arithmetic operations in FPG
integral and important aspect of DSP design.
• The following key issues are presented in this chapt
• Number representations: binary word formats for signed and unsigcomplement, fixed point and floating point;
• Binary arithmetic, including:Overflow and underflow, saturation, truncati
• Structures for arithmetic operations: Addition/Subtraction, Multiplication, Division
• Complex arithmetic operations;
• Mapping to Xilinx FPGA hardware... including special resources for high spee
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th
Int .
No
Qu
Ad int, hardware structures forad
Mu int, hardware structures formu
Di
Sq
Co
Co
tes:
10.423 R. Stewart,
is section of the course will introduce the following concepts:
eger number representations - unsigned, one’s complement, two’s complement
n-integer number representations - fixed point, floating point.
antisation of signals, truncation, rounding, overflow, underflow and saturation.
dition - decimal, two’s complement integer binary, two’s complement fixed podition, Xilinx-specific FPGA structures for addition.
ltiplication - decimal, 2s complement integer binary, two’s complement fixed poltiplication, Xilinx-specific FPGA structures for multiplication.
vision.
uare root.
mplex addition.
mplex multiplication.
t, Dept EEE, University of Strathclyde, 2010
17
resented digitally -.
the “real-world”
fficient in terms of
rithmetic operator
cision and range.
R. Stewar
Number Representations• DSP, by its very nature, requires quantities to be rep
using a number representation with finite precision
• This representation must be specified to handle inputs and outputs of the DSP system.
• Sufficient resolution
• Large enough dynamic range
• The number representation specified must also be eits implementation in hardware.
• The hardware implementation cost of an aincreases with wordlength.
• The relationship is not always linear!
• There is a trade-off between cost, and numeric pre
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th is well understood by mosten literally billions of multipliesan er of bits for representationis
Fo rithmetic. We will show later(se oduct) can be approximatedas e cost is of the order of 16 x16 e the designer at sometimede Probably not! Its likely thatwe rs and we are creatures ofha fore, if it was demonstratedtha 9 cells = 81 cells. This isap
Th ces, and too few bits losesres DSP.
tes:
10.423 R. Stewart,
e use of binary numbers is a fundamental of any digital systems course, and gineers. However when dealing with large complex DSP systems, there can bed adds per second. Therefore any possible cost reduction by reducing the numblikely to be of significant value.
r example, assume we have a DSP filtering application using 16 bit resolution ae Slide 42) that the cost of a parallel multiplier (in terms of silicon area - speed pr
the number of full adder cells. Therefore for a 16 bit by 16 bit parallel multiply th= 256 “cells”. The wordlength of 16 bits has been chosen (presumably) becaus
monstrated that 17 bits was too many bits, and 15 was not enough - or did they? are using 16 bits because... well, that’s what we usually use in DSP processobit! In the world of FPGA DSP arithmetic you can choose the resolution. Theret in fact 9 bits was sufficient resolution, then the cost of a multiplier is 9 xproximately 30% of the cost of using 16 bits arithmetic.
erefore its important to get the wordlength right: too many bits wastes resourolution. So how do we get it right? Well, you need to know your algorithms and
t, Dept EEE, University of Strathclyde, 2010
rs 18
ive numbers.
it word are:
ormat as shown:
is: 0 to 2n - 1. Form 0 to 255.
21 20LSB
01
1 0
012021
1x21
2
2 0x20
= 82
R. Stewar
Unsigned Integer Numbe• Unsigned integers are used to represent non-negat
• The weightings of individual bits within a generic n-b
• The decimal number 82 is “01010010” in unsigned f
• The numerical range of an n-bit unsigned number example, an 8-bit word can represent all integers fro
2n-1 2n-2 2n-3 22bit weighting:MSB
2n-3n-2n-1bit index:
0 1 0 1 0 0
234567222324252627bit weighting:
bit index:
example binary number:
1x26 1x24
64 160x27 0x25 0x23 0x2
+ +decimal representation:
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Ta s is 0 to 255:
No e powers of two between an
i.e
0
tes:
10.423 R. Stewart,
king the example of an 8-bit unsigned number, the range of representable value
te that the minimum value is 0, and the maximum value ( ) is the sum of thd , where 8 is the number of bits in the binary word:
. 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 = 255 = 28 - 1
Integer Value Binary Representation0 00000000
1 00000001
2 00000010
3 00000011
4 00000100
64 01000000
65 01000001
131 10000011
255 11111111
2558
t, Dept EEE, University of Strathclyde, 2010
gers) 19
e numbers in thetation of 0 (zero).
ting:
e represented by:
21 20LSB
01
0 0LSB
01
1 1LSB
01
R. Stewar
2’s Complement (Signed Inte• 2’s Complement caters for positive and negativ
range -2n-1 to +2n-1 -1, and has only one represen
• In 2’s complement, the MSB has a negative weigh
• The most negative and most positive numbers ar
-2n-1 2n-2 2n-3 22bit weighting:MSB
2n-3n-2n-1bit index:
1 0 0 0MSB
2n-3n-2n-1bit index:
most negative:
0 1 1 1MSB
2n-3n-2n-1bit index:
most positive:
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
As imal.
As mplement signed format:
Me t:
1 0
012021
1x21
2
2 0x20
= 82
1 0
012021
1x21
2
2 0x20
= -82+
tes:
10.423 R. Stewart,
examples, we can convert the following two 8-bit 2’s complement words to dec
for the unsigned representation, the decimal number 82 is “01010010” in 2’s co
anwhile the decimal number -82 is “10101110” in 2’s complement signed forma
0 1 0 1 0 0
2345672223242526-27bit weighting:
bit index:
example binary number:
1x26 1x24
64 160x-27 0x25 0x23 0x2
+ +decimal representation:
1 0 1 0 1 1
2345672223242526-27bit weighting:
bit index:
example binary number:
1x26 1x24
32 81x-27 1x25 0x23 0x2
+ +decimal representation:
-128 4+
t, Dept EEE, University of Strathclyde, 2010
on 20
ive and positive.
nt representationserted as shown:
0 1 1 1 0
+82
1 0 0 0 1
0 0 0 0 0 0 0 1
1 0 0 1 0
rt all bits
R. Stewar
2’s Complement Conversi• For 2’s Complement, converting between negat
numbers involves inverting all bits, and adding 1
• For example, we have just considered 2’s complemeof the decimal numbers -82 and +82. They are conv
0 1 0 1 0 0 1 0
+82 -82
-82
1 0 1 0 1 1 0 1
0 0 0 0 0 0 0 1
+82
1 0 1 0 1 1 1 0
+
invert all bits
add 1
1 0 1
-82
+82
0 1 0
-82
0 1 0
+
inve
add 1
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
No zero. However, if we simplyign e representation for positiveze
No
ve Numbers
Binary 100000000
11111111
11111110
11111101
10000011
10000010
10000001
10000000
tes:
10.423 R. Stewart,
te that when negating positive values, a ninth bit is required to represent negativeore this ninth bit, the representation for the negative zero becomes identical to th
ro.
tice from the above that -128 can be represented but +128 cannot.
Positive Numbers
Integer Binary 0 00000000
1 00000001
2 00000010
3 00000011
125 01111101
126 01111110
127 01111111
Negati
Integer 0-1-2-3
-125-126-127-128
Invert all bitsand ADD 1
t, Dept EEE, University of Strathclyde, 2010
rs 21
number:
point.
r bits, and bits onits.
n integer bits and
for integer words).
2-b+1 2-bLSB
-b-b+12
2
actional bits
R. Stewar
Fixed-point Binary Numbe• We can now define what is known as a “fixed-point”
...a number with a fixed position for the binary
• Bits on the left of the binary point are termed integethe right of the binary point are termed fractional b
• The format of a generic fixed point word, comprisingb fractional bits, is: :
• The MSB has -ve weighting for 2’s complement (as
2n-1 2n-2 21 2-1bit weighting:MSB
-11n-2n-1bit index:
200
2--
n integer bits b fr
binary point
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
As int in two different places.
Fir nal bits:
...a d 5 fractional bits:
No
0
-3-22-3-2
2-2
.250x2-3
= -5.25
0
-5-42-5-4
2-4
.06250x2-5
= -1.3125
tes:
10.423 R. Stewart,
examples, we consider the 2’s complement word “11010110” with the binary po
stly, with the binary point to the left of the third bit, i.e. 5 integer bits and 3 fractio
nd secondly, with the binary point to the left of the third bit, i.e. 3 integer bits an
te that these results are related by a factor of 22 = 4, i.e. 4 x -1.3125 = -5.25.
1 1 0 1 0 1 1
-10123422-120212223-24bit weighting:
bit index:
binary number:
1x23 1x21 1x8 2 0
1x-24 0x22 0x20 1x2-1
+ +decimal
-16 + 0.5 +representation:
1 1 0 1 0 1 1
-3-2-101222-32-22-12021-22bit weighting:
bit index:
binary number:
1x21 1x2-1 1x00.52
1x-22 1x20 0x2-2 1x2-3
+ +decimal
-4 0.125++representation:
t, Dept EEE, University of Strathclyde, 2010
ision 22
ctively fixed pointe binary range of
t, e.g. for an 8-bit
int position, e.g.:
of the binary point
11111.....1111
01111.....1111
1.
.1
(+127)
(+63.5)1 x0.5
R. Stewar
Fixed Point Range and Prec• As with integer representations (which are also effe
numbers, but with the binary point at position 0), thfixed point numbers extends from:
• The same number of quantisation levels is presenbinary word, 256 levels can be represented.
• Numerical range scales according to the binary po
• Dynamic range (range / interval) is independent position, e.g. (127-(-128))/1 = 255 = (63.5-(-64))/0.5
00000.....0000
10000.....0000
unsigned:
signed (2’s comp.):
1000 0000. 0111 111
1000 000.0 0111 111
(-128)
(-64)1 x0.5
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
To t word, with the binary pointin nary point position, but therel
+
+
+
-0.5+0.375
binary pointposition = 3
all fractional bits
.25 -0.5 +0.25+0.125
25 interval = 0.125
tes:
10.423 R. Stewart,
illustrate this further, lets consider the very simple case of a 3-bit 2’s complemenall four possible positions. Clearly the numerical range is affected by the biationship between the interval and range remains the same.
-4
-3
-2
-1
0
1
2
3
+1.5
-2
+0.75
-1
binary pointposition = 0
binary pointposition = 1
binary pointposition = 2
all integer bits
-4 +2 +1 -2 +1 +0.5 +0.5 +0-1
interval = 1 interval = 0.5 interval = 0.
t, Dept EEE, University of Strathclyde, 2010
23
ented, in steps offerent values.
oint format:
n error is her than 0.5!
LSB2
-----------±
R. Stewar
Fixed-point Quantisation• Consider the fixed point number format:
• Numbers between and can be repres. As there are 8 bits, there are dif
• Revisiting our sine wave example, using this fixed-p
• This looks much more accurate. The quantisatio(where LSB = least significant bit)... so 0.015625 rat
b b bb bn nn3 integer bits 5 fractional bits
4– 3.967850.03125 28 256=
+2
-2
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Qu ecision numbers with finitepre number of decimal places.Th uantise or represent to 4de
If w or is larger:
Cl comes at a cost. Albeit theco
W f places. For example, if wewo
ca
On fficult to see that these smallerr
π
3.1
3.1
tes:
10.423 R. Stewart,
antisation is simply the DSP term for the process of representing infinite prcision numbers. In the decimal world, it is familiar to most to work with a given
e real number can be represented as 3.14159265.... and so on. We can qcimal places as 3.1416. If we use “rounding” here and the error is:
e truncated (just chopped off the bits below the 4th decimal place) then the err
early rounding is most desirable to maintain best possible accuracy. However itst is relatively small, but it is however not “free”.
hen multiplying fractional numbers we will choose to work to a given number ork to two decimal places then the calculation:
0.57 x 0.43 = 0.2451
n be rounded to 0.25, or truncated to 0.24. The result are different.
ce we start performing billions of multiplies and adds in a DSP system it is not diors can begin to stack up.
π
4159265… 3.1416– 0.00000735=
4159265… 3.1415– 0.00009265=
t, Dept EEE, University of Strathclyde, 2010
ry Shifts 24
tionship with theer it represents.
point number by a the numbers with
5.75)
ft by 1 place
)
(decimal 1.4375)
R. Stewar
Multiplication & Division via Bina• The binary point position has a power-of-2 rela
numerical range of the word format, and any numb
• Therefore if we want to multiply or divide a fixed power-of-two, this is achieved by simply shiftingrespect to the binary point!
2 1 0.54-8 0.25
0 1 1 110
-2 1 0.50.0625
0.1250.25
1 11 100shift right by 2 places
4
2 1 0.548-16
1 0 10 11
original (decimal
2 shift le
(decimal 11.5
number:
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Of binary point is moved, rathertha ual - having no effect on theha ep track of it.
Re e move the binary point, thewe ge by a power-of-two factor.
(decimal 5.75)
(decimal 1.4375)
(decimal 11.5)
tes:
10.423 R. Stewart,
course, looking at the example in the main slide, we could also consider that the n the bits - it amounts to the same thing! Ultimately the binary point is concept
rdware produced - and it falls to the DSP design tool and/or DSP designer to ke
viewing the divide-by-4 and multiply-by-2 examples from the main slide... if wightings of the bits comprising the word, and hence the value it represents, chan
2 1 0.54-8 0.25
0 1 1 110
-2 1 0.50.0625
0.1250.25
1 11 100
4
2 1 0.548-16
1 0 10 11
originalnumber:
2
t, Dept EEE, University of Strathclyde, 2010
bers 25
mber of fractional
n be represented.
cause it makes thermalised numbers
28
-1
+0.9921875
decimal
R. Stewar
Normalised Fixed Point Num• Fixed point word formats with 1 integer bit and a nu
bits are often adopted.
• Numbers from -1 to +1-2-b (i.e. just less than 1) caSome examples:
• Limiting the numeric range to is advantageous bearithmetic easier to work with... multiplying two notogether cannot produce a result greater than 1!
1/41/2-1 1/8 1/16 1/32
000 01 0
1/64 1/1
0 0
111 10 1 1 1
000 00 0 0 0
most -ve
most +ve
1±
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th ally in the context of DSPpro tually the same thing.
Q- its, and n is the number offra resentation, which Q-formatco 1 + m + n, whereas in 2’sco s, and n fractional bits.
Fo al bits, and hence can berep aving 3 integer bits and 5fra
Th ers the normalised range ofnu ntation with the binary pointat
2-42 2-3 2-5
actional bits
n = 5 ctional bits
2d
Qd
tes:
10.423 R. Stewart,
e term Q-format is often used to describe fixed point number formats, usucessors. However, it is useful to note that Q-format and 2’s complement are ac
format notation is given in the form Qm.n, where m is the number of integer bctional bits. Notably this description excludes the MSB of the 2’s complement repnsiders a sign bit. Therefore the total number of bits in a Q-format number ismplement, the same word format would be described as having m+1 integer bit
r example, a Q2.5 number has a sign bit, 2 other integer bits, and 5 fractionresented as shown below. In 2’s complement, this would be described as h
ctional bits.
e Q0.15 format (often abbreviated to Q15) is used extensively in DSP as it covmbers from -1 to +1 - 2-15, and is equivalent to a 16 bit 2’s complement represe15, i.e. 1 integer bit and 15 fractional bits.
2-2-1-22 2021
3 integer bits 5 fr
bit weightings:
m = 2 integer bits fra
1 sign bit
’s complementescription
-formatescription
t, Dept EEE, University of Strathclyde, 2010
isation 26
thmetic “easier” to
hine” using 4 digit99 to +9999.
significant digits.
6 2849
e machine (wheree down by 10000,
9999.
849
ch “easier”.
Tr
R. Stewar
Fractional Motivation - Normal• Working with fractional binary values makes the ari
work with and to account for wordlength growth.
• As an example take the case of working with a “macdecimals and a 4 digit arithmetic unit - range -99
• Multiplying two 4 digit numbers will result in up to 8
6787 x 4198 = 28491826 2849.182
If we want to pass this number to a next stage in tharithmetic is 4 digits accuracy) then we need scalthen truncate.
• Consider normalising to the range -0.9999 to +0.
0.6787 x 0.4198 = 0.28491826 0.2
now the procedure for truncating back to 4 bits is mu
Scale
Tr
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Of ndle the truncate and scale.
Ho range from -1 to +1 then itsea ult also in the range of -1 to+1
Ex int is implicitly used in mostDS
Co
10
If w n the binary range is:
1. 921875).
Th binary.1101 1010 0100
Co 00 1101 1010 0100
In 1 = 0.00110110100100
wh
No signer. There is no physicalco keeping track of wordlengthgro gers and would like to keeptra ost is the same...
tes:
10.423 R. Stewart,
course the two results are exactly identical and the differences are in how we ha
wever using the normalised values, where all inputs are contrained to be in thesy to note that multiplying ANY two numbers in this range together will give a res.
actly the same idea of normalisation is applied to binary, and the the binary poP systems.
nsider 8 bit values in 2’s complement. The range is therefore:
000000 to 01111111 (-128 to +127)
e normalise these numbers to between -1 and 1 (i.e. divide through by 128) the
0000000 to 0.1111111 ( -1 to 0.9921875, where 127/128 = 0.9
erefore we apply the same normalising ideas as for decimal for multiplication in
nsider multiplying 36 x 97 = 3492 equivalent to 00100100 x 01100001 = 00
binary normalising the values would be the calculation 0.0100100 x 0.110000
ich in decimal is equvalent to: 0.28125 x 0.7578125 = 0.213134765625
te very clearly that in a DSP system then the binary poiint is all in the eye of the dennection or wire for the binary point. It just makes things significantly easier in wth, and truncating just by dropping fractional bits. Of course if you prefer inte
ck of the scaling etc you can do this,..... you will get the same answer and the c
t, Dept EEE, University of Strathclyde, 2010
(ADC) 27
a binary number,
two’s complement
BinaryOutput
10
1
0
1
0
1
1
s
R. Stewar
Analogue to Digital Converter• An ADC is a device that can convert a voltage to
according to its specific input-output characteristic.
• We can generally assume ADCs operate using arithmetic.
21-1-2
00100000
01000000
0110000001111111
32
64
96
127
-128
-96
-64
-32 11001000
11000000
10100000
10000000
Binary Output
Voltage
8 bit
ADCVoltageInput
Input
f
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Vie teristic as “linear”. Howevera q tion of a linear system frombe ove and below the maximuman of steps large, then we aretem
No . In telecommunications forex and μ-law). Speech signals,for a large amplitude, whereasso e were used then althoughthe ow the threshold of the LSBan uantisers are used such thatthe s. A-law quantisers are oftenim emes are widely in use: theA- AC can have a non-linearch
tes:
10.423 R. Stewart,
wing the straight line portion of the device we are tempted to refer to the characuick consideration clearly shows that the device is non-linear (recall the defini
fore) as a result of the discrete (staircase) steps, and also that the device clips abd minimum voltage swings. However if the step sizes are small and the numberpted to call the device “piecewise linear over its normal operating range”.
te that the ADC does not necessarily have a linear (straight line) characteristicample a defined standard nonlinear quantiser characteristic is often used (A-law example, have a very wide dynamic range: Harsh “oh” and “b” type sounds havefter sounds such as “sh” have small amplitudes. If a uniform quantisation schem loud sounds would be represented adequately the quieter sounds may fall bel
d therefore be quantised to zero and the information lost. Therefore non-linear q quantisation level at low input levels is much smaller than for higher level signal
plemented by using a nonlinear circuit followed by a uniform quantiser. Two schlaw in Europe, and the -law in the USA and Japan. Similarly for the Daracteristic.
μ
Voltage Input
Binary Output
t, Dept EEE, University of Strathclyde, 2010
28
d data values aree they are not, as
pled data value is
r.
sample, n
ts
2 …,
R. Stewar
ADC Sampling “Error”• Perfect signal reconstruction assumes that sample
exact (i.e. infinite precision real numbers). In practican ADC will have a number of discrete levels.
• The ADC samples at the Nyquist rate, and the samthe closest (discrete) ADC level to the actual value:
• Hence every sample has a “small” quantisation erro
time
1234
-1-2-3-4
0
s(t)
ADC
fs
Vol
tage
Bin
ary
valu
e
1234
-1-2-3-4
0
v̂ n( )
v̂ n( ) Quantise s nts( ){ }= for n, 0 1, ,=
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Fo ion and maximum/minimumvo :
In 9998..., however our ADCqu
lts Voltage Input
tes:
10.423 R. Stewart,
r example purposes, we can assume our ADC or quantiser has 5 bits of resolutltage swing of +15 and -16 volts. The input/output characteristic is shown below
the above slide figure, for the second sample the true sample value is 1.58antises to a value of 2.
Bin
ary
Out
put
1 volts
Vmax = 15 vo
01111 (+15)
10000 (-16)
Vmin = -15 volts
t, Dept EEE, University of Strathclyde, 2010
29
en the error of any
x Voltage Input
R. Stewar
Quantisation Error • If the smallest step size of a linear ADC is q volts, th
one sample is at worst q/2 volts.
Bin
ary
Out
put
q volts
Vma-Vmax
01111 (+15)
10000 (-16)
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Qu quantisation process can beco
y
q
tes:
10.423 R. Stewart,
antisation error is often modelled an additive noise component, and indeed the nsidered purely as the addition of this noise:
yx ADCx
n
t, Dept EEE, University of Strathclyde, 2010
30
nds
time/seconds
n Error
R. Stewar
An Example• Here is an example using a 3-bit ADC:
4
3
2
1
0
-1
-2
-3
-4
time/seco
Am
plitu
de/v
olts
3
2
1
0
-1
-2
-3
-4
Out
put
Input
4 3 2 1 -1 -2 -3 -4
4
3
2
1
0
-1
-2
-3
-4
Am
plitu
de/v
olts
ADC Characteristic
Quantisatio
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
In
tes:10.423 R. Stewart,
this case worst case error is 1/2.
t, Dept EEE, University of Strathclyde, 2010
s 31
multi-bit numbers.
Note that the last
arry out of the lastr will be incorrectly
A3 A2 A1 A0B3 B2 B1 B0
S4 S3 S2 S1 S0
C3 C2 C1 C0
+
R. Stewar
Adding multi-bit number• The full adder circuit can be used in a chain to add
The following example shows 4 bits:
• This chain can be extended to any number of bits.carry output forms an extra bit in the sum.
• If we do not allow for an extra bit in the sum, if a cadder occurs, an “overflow” will result i.e. the numberepresented.
Σ
A3 B3
S3
Σ
A2 B2
S2
Σ
A1 B1
S1
Σ
A0 B0
S0
‘0’
S4
C0C1C2C3
LSBMSB
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th
Fu
Th
BC ABC ABC+ +
C
BC ABC ABC+ +
C A B⊕( )=
tes:
10.423 R. Stewart,
e truth table for the full adder is:
ll adder circuitry can be therefore produced with gates:
e longest propagation delay path in the above full adder is “two gates”.
A B S COUT00 0 0
0 0 1 00 1 1 0
0 1 0 1
0 0 0+ + 0=
0 0 1+ + 1=0 1 0+ + 1=
0 1 1+ + 2=
CIN010
101 1 0
1 0 0 11 1 0 1
1 1 1 1
1 0 0+ + 1=
1 0 1+ + 2=1 1 0+ + 2=
1 1 1+ + 3=
010
1
A
B
S
COUTC
Sout ABC A+=
A B⊕ ⊕=
Cout ABC A+=
AB AC BC+ +=
t, Dept EEE, University of Strathclyde, 2010
32
. Remember 2’smber is invert the
n D = A + (-B):
y in to be 1.
Σ ‘1’
0 B0
D0
add 1
Invert
R. Stewar
Subtraction• Subtraction is very readily derived from addition
complement? All we need to do to get a negative nubits and add 1.
• Then if we add these numbers, we’ll get a subtractio
• The addition of the 1 is done by setting the LSB carr
Σ Σ ΣC4
A3 B3 A2 B2 A1 B1 A
D3 D2 D1Discard
4 bit2’s comp
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
No 15. For 2’s complement thenu
Ad
So n modes.
Th
Fo
Fo
Th
Σ
A0B0
MUX
K
0 1
tes:
10.423 R. Stewart,
te for 4 bit positive numbers (i.e. NOT 2’s complement) the range is from 0 to merical range is from -8 to 7.
dition/Subtraction (using 2’s complement representation)
metimes we need a combined adder/subtractor with the ability to switch betwee
is can be achieved quite easily:
r: A + B, K = 0
r: A - B, K= 1
is structure will be seen again in the Division/Square Root slides!
Σ Σ Σ
A3 A2 A1B1
MUX
B2
MUX
B3
MUX0 10 10 1
t, Dept EEE, University of Strathclyde, 2010
plement 33
ult to be produced
+127 (or in binary
er of bits availableefore in the aboveor even 10 bits.
check for overflow
01100100+00100101
10001001
sult the result “wraps a negative value:
.001 119–=
R. Stewar
Wraparound Overflow & 2’s Com• With 2’s complement overflow will occur when the res
lies outside the range of the number of bits.
• Therefore for an 8 bit example the range is -128 tothis is 100000002 to 011111112:
• One solution to overflow is to ensure that the numbis always sufficient for the worst case result. Therexample perhaps allow the wordlength to grow to 9
• Using Xilinx SystemGenerator we can specifically in every addition calculation.
-65+ -112
-177
10111111+10010000101001111
With an 8 bit result we lose the 9th bit and the result “wraps around” to a
positive value: .01001111 79=
100+ 37137
With an 8 bit rearound” to
10001
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Re t need to keep an eye on theMS
Fo
AdAd verflowAd overflow
011001000100000010100100
+
sult! Overflow
tes:
10.423 R. Stewart,
call from previously that overflow detect circuitry is relatively easy to design. JusB bits (indicating whether number is +ve or -ve):
r example
ding +ve and -ve will never overflow!ding +ve and +ve if a -ve result then oding -ve and -ve if a +ve result then
1011011101111111
1 00110110+
Discard final 9th bit carry
(-73) + 127 = 54
MSB bit indicate -ve re
100 + 64 = 164
No overflow
t, Dept EEE, University of Strathclyde, 2010
34
turate technique.
e 33
ose possible value
ith an adder block.ox choice to allow
select” circuitry.
01100100+00100101
10001001
ow and saturate
01111111
R. Stewar
Saturation• One method to try to address overflow is to use a sa
• Taking the previous overflowing examples from Slid
• When overflow is detected, the result is set to the cl(i.e for the 8 bit case either -128 or +127).
• Therefore for every addition that is explicitly done wIn Xilinx System Generator the user will get a checkbresults to either (i) Wraparound or (ii) Saturate.
• Implementing saturate will require “detect overflow &
-65+ -112
-177
10111111+10010000101001111
100+ 37137
Detect overflow and saturate
1000000-128
Detect overfl
127
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
On ens as the user has ensuredthe magnitude result.
Ge cks give adders with 48 bitsof ver 48 bits is unlikely. Henceov ces and using general slicelog must be taken, and whereap
Sa st Mean Squares algorithm(LM
W rm is added tothe
Th at the sign of the term wouldflip
W , saturation will limit it to thema n, and at the fastest speedpo of the algorithm.
2μe k( )x k( )
tes:
10.423 R. Stewart,
ce again, design of a DSP system might be done such that overflow never happre are enough bits to cater for the worse possible case leading to the maximum
nerally for some later FPGAs such as the Virtex-4, using some of the DSP48 bloprecisions therefore the likelihood of say working with 16 bit values that grow to oerflow has been “designed out”. Of course not applications will use these deviic and attempting to make adders as small as possible, would mean care propriate to efficient design, saturate might be included.
turation is extremely useful for adaptive algorithms. For example, in the LeaS), the filter weights are updated according to the equation:
ithout further concern over the meaning of this equation, we can see that the te weights at time epoch to generate the new weights at time epoch .
e the operations that form were to overflow, there is a high chance th and drive the weights in completely the wrong direction, leading to instability.
ith saturation however, if the term gets very big and would overflowximum value representable, causing the weights to change in the right directiossible in the current representation. The result is a huge increase in the stability
w
w k( ) w k 1–( ) 2μe k( )x k( )+=
k 1– k
2μe k( )x k( )
2μe k( )x k( )
t, Dept EEE, University of Strathclyde, 2010
n 35
Sout
R. Stewar
Xilinx Virtex-II Pro Additio• The used components of the slide are outlined below
B
Cin
A
SoutCout
Σ
A B
Sout
CinCout
D
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Pic Introduction and Overview”,DS
Lo
.
t
So DDER implementation:
tes:
10.423 R. Stewart,
ture of Xilinx-II Pro slice (upper half) taken from “Virtex-II Pro Platform FPGAs: 083-1 (v2.4) January 20, 2003. http://www.xilinx.com
okUp Table (LUT) programmed with two-input XOR function:
G1 (A) G2 (B) D0 0 00 1 11 0 11 1 0
G1 (A) G2 (B) Cin D Sout Cou0 0 0 0 0 00 0 1 0 1 00 1 0 1 1 00 1 1 1 0 11 0 0 1 1 01 0 1 1 0 11 1 0 0 0 11 1 1 0 1 1
ut = Cin xor D , Cout = DA + Cin D (multiplex operation). Result is the FULL A
t, Dept EEE, University of Strathclyde, 2010
ponents 36
nents on one slice
e
Mux
e
Mux
Upper
Lower
R. Stewar
Xilinx Virtex-II Pro Slice Main Com• A (very) high level diagram of the main logic compo
D-typFF4 input
LUTRAM
ShiftReg
MULTAND
ORCY
XORG Mux
Mux
Mux
Inputs
Outputs
D-typFF4 input
LUTRAM
ShiftReg
MULTAND
ORCY
XORG Mux
Mux
Mux
Inputs
Outputs
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Ju ly the top half of the slice issh
•
• le as RAM/memory)
•
•
•
•
•
“S
“L her components!)
tes:
10.423 R. Stewart,
st reviewing the logic circuitry on one half of the slice (note that in Slide 35 onown, whereas the above slide shows the top and bottom halfs), we can note:
One D-type Flip Flop
One 4 input Look-Up-Table (LUT) (can be configured as shift register or simp
One XOR gate
One AND gate
One OR gate
A few 2 input MUX (multiplexors) to route signals
Clock inputs
mall” FPGAs will have just a few hundred (100’s) slices;
arge” FPGAs will have many tens of thousands (10000’s) of slices (and ot
t, Dept EEE, University of Strathclyde, 2010
der 37
cascade the carry
ogrammed for anlf’s CIN. Hence we
slice.
bit addition 2 slices
3
Σ
A2 B2
S2
Σ
A1 B1
S1
Σ
A0 B0
S0
‘0’C0C1C2
R. Stewar
Xilinx Virtex-II Pro 4 bit Ad• To produce larger adders the Xilinx tools will simply
bits in adjacent (where possible!) slices.
• The bottom half of a Virtex-II Pro slice can be pridentical operation, with its COUT wired to the top-hacan get two bits of addition per standard Xilinx slice.
• To produce a 4 bit adder, we cascade with another
42 bit addition 1 slice
Σ
A3 B
S3S4
C3
FA
FA
FA
FA
FA
FA
Σ
A1 B1
S1
Σ
A0 B0
S0
‘0’C0C1
A0
B0
B1A1
C1‘0’S1
S0
A2
B2
B3A3
C3
S3
S2A0
B0
B1A1
C1
S1
S0
‘0’
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
No d as a LUT, any four inputBo
Fo
Th
To ddress the LUT with valuesof
Th input LUT. (Of course if theeq stant)
tes:
10.423 R. Stewart,
te the importance of the LUT (look up table) in the Xilinx slice. When configureolean equation can be implemented.
r example take the equation
e truth table for this equation is:
implement this function, simply store the values of Y in the Slice LUT, and the aABCD to get the output
erefore ANY 4 variable Boolean function can be simply implemented with a fouruation is only 3 variables then we can also implement and just set one input con
Y ABC ABCD+=
ABCD Y
00001 0001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 11111 1
4 inputLUTRAM
ShiftReg
YABCD
t, Dept EEE, University of Strathclyde, 2010
38
imal:
cting, shifting andr or not a shifted the sum.
rs and a little logic
00
0
R. Stewar
Multiplication in binary• Multiplying in binary follows the same form as in dec
• Note that the product is composed purely of seleadding . The th column of indicates whetheversion of is to be selected or not in the th row of
• So we can perform multiplication using just full addefor selection, in a layout which performs the shifting.
11010110
11010110x00101101
0000000001101011000
11010110000000000000000
110101100000000000000000000
0000000000000000010010110011110
A7…AB7…B
P15…P
PA i B
A i
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Mu
St
No ht-shifted by one column.
Fo f that column with the firstop
tes:
10.423 R. Stewart,
ltiplication in decimal
arting with an example in decimal:
te that we do and then add to it the result of rig
r each additional column in the second operand, we shift the multiplication oerand by another place.
214
1070+8560
x45
9630
214 5× 1070= 214 4× 856=
zzz
bbbb+cccc0
xaaaa
+dddd00+eeee000 etc...
t, Dept EEE, University of Strathclyde, 2010
ion 39
remember to sign
-42x45
890
-42x45
890
R. Stewar
2’s complement Multiplicat• For one negative and one positive operand just
extend the negative operand.
11010110
1111111111010110x00101101
00000000000000001111111101011000111111101011000000000000000000001111101011000000000000000000000000000000000000001111100010011110 -1
11010110
1111111111010110x00101101
00000000000000001111111101011000111111101011000000000000000000001111101011000000000000000000000000000000000000001111100010011110 -1
signextends
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
2’s
Fo
W and adding it rather thansu
Of
Th are being necessary. DSPpro
10110
1011001101
000001100010000000000000000000
11110
-42x-83
348600000
tes:
10.423 R. Stewart,
complement multiplication (II)
r both operands negative, subtract the last partial product.
e use the trick of inverting (negating and adding 1) the last partial productbtracting.
course, if both operands are positive, just use the unsigned technique!
e difference between signed and unsigned multiplies results in different hardwcessors typically have separate unsigned and signed multiply instructions.
110
11111111110x101
000000000001111111101011111110101000000000001111101011000000000000
-111010110000000000001101100
+00010101000two’s
complement
form last partial product negative
t, Dept EEE, University of Strathclyde, 2010
n 40
rd than integer
sition of the binary
0500000
R. Stewar
Fixed Point multiplicatio• Fixed point multiplication is no more awkwa
multiplication:
• Again we just need to remember to interpret the popoint correctly.
11010.110
11.010110x00101.101
000.0000001101.011000
11010.110000000000.000000
1101011.00000000000000.000000
000000000.0000000010010110.011110
26.75x5.62
0.133750.53500
16.05000133.75000150.46875
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
tes:
10.423 R. Stewart,
t, Dept EEE, University of Strathclyde, 2010
tions 41
R. Stewar
Multiplier Implementation Op• Distributed multipliers
• Constant multipliers
• Using the logic fabric (LUTs)
• Using block RAM
• Shift-and-add “multipliers”
• High speed embedded multipliers
• 18 x 18 bit multipliers
• High speed integrated arithmetic slices (DSP48s)
• Multiply, accumulate
• Add, multiply, accumulate
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Ov ariety of different ways. Asmu consideration.
Th fabric, i.e. the lookup tableswi cause the implementation isdis
In SP, the knowledge of onemu han a conventional 2-inputmu ed constant multipliers, and“sh
Th multipliers, and as a result,the ince then the sophisticationof t adders and in many caseslon slices, rather than simplymu
tes:
10.423 R. Stewart,
er the next few slides we will see that multipliers can be implemented in a vltipliers are used extensively in DSP, implementing them efficiently is a priority
e most basic multiplier is a 2-input version which is implemented using the logicthin the slices of the device. This type is referred to as a distributed multiplier, betributed over the resouces in several slices.
the case of multiplication with a constant, which is commonly required in Dltiplicand can be exploited to create a cheaper hardware implementation tltiplier. Two approaches that will be discussed in the coming pages are ROM-basift-and-add” multipliers which sum the outputs from binary shift operations.
e FPGA companies are well aware that DSP engineers desire fast and efficienty began incorporating embedded multipliers into their devices in the year 2000. Sthese components has increased, and they have been extended to feature fasger wordlengths, too. We can now think of them as embedded arithmeticltipliers.
t, Dept EEE, University of Strathclyde, 2010
42
selection for eachnserts zeros in thee right.
’s complement!
0
a0
b0
110110111101
11010000
1101
10001111
1311
143
Example:
R. Stewar
Distributed Multipliers• This figure shows a four-bit multiplication:
• The AND gate connected to and performs the bit. The diagonal structure of the multiplier implicitly iappropriate columns and shifts the operands to th
• Note that this structure does not work for signed two
0
0
0
0
000 a1
b3
a2a3
b2
b1
p0p7 p6 p5 p4 p3 p2 p1
0
0
0
a
aout
bbout
s
sout
ccout
FA
FA is full adder
a b
a
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
No
Th
A 0 0 1 1
Z =
He of the multiplier is as shownbe
0
a0
b0
tes:
10.423 R. Stewart,
te the function of the simple AND gate.
e operation of multiplying 1’s and 0’s is the same AND 1’s and 0’s
B Z0 01 00 01 1
A x B (where x = multiply) or in Boolean algebra Z = A and B = AB
nce the AND gate is the bit multiplier. The function of one partial product stage low.
x0x1x2x3a1a2a3a
aout
bbout
s
sout
ccout
FA
FA is full adder
y4 y3 y2 y1 y0 = b0(a3 a2 a1a0) + x3 x2 x1 x0
y0y1y2y3y4
t, Dept EEE, University of Strathclyde, 2010
ll 43
one multiplier cell.
Sout
n
NOTE: This implementationfeatures a Virtex-II Pro FPGA.
R. Stewar
Distributed Multiplier Ce• This shows the top half of a slice, which implements
B
A
Cout
Ci
A
B
S
Sout
CinCout
FA
S D
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Pic Introduction and Overview”,DS
LU
Th t be obtained from within theLU multiply each, and the resultis
Th
No ropagating from the top andrig logic results in a differentco
So
tes:
10.423 R. Stewart,
ture of Xilinx-II Pro slice (upper half) taken from “Virtex-II Pro Platform FPGAs: 083-1 (v2.4) January 20, 2003. http://www.xilinx.com
T implements the XOR of two ANDs:
e dedicated MULTAND unit is required as the intermediate product G1G2 cannoT, but is required as an input to MUXCY. The two AND gates perform a one-bit added by the XOR plus the external logic (MUXCY, XORG):
is structure will perform one cell of the multiplier (see the next slide...).
te that whereas the signal flow graph of the distributed multiplier shows signals pht of the diagram to the bottom, the internal structure of the FPGA slice nfiguration when implemented on a device.
G1 (B)G2 (A)G3 (S)
D
ut = CIN xor D, COUT = DAB + CIND
t, Dept EEE, University of Strathclyde, 2010
44
in a LUT as showno other operations.
sible multiplication
the address withat address is the
0
111
1110
decimal -18
P
8 bits
R. Stewar
ROM-based Multipliers• Just as logical functions such as XOR can be stored
for addition, we can use storage-based methods to d
• By using a ROM, we can store the result of every posof two operands.
• The two operands A and B are concatenated to formwhich to access the ROM. The value stored at thmultiplication result, P:
A
B0011
1010
0000 0000
1111 1111
1010 0011
address data (product)
0000 0000
0000 00011
001
1010
A:B
8 bits
4 bits
4 bits
ROM-based multiplier
28 = 256 8-bit addresses8-bit data
1110 1110decimal -6
decimal 3
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th M size grows exponentially.Fo has entries. The outputres
Fo uired - a large quantity. Forbig require 128Gbits of storagean !
tal ROM StorageN x 22N)
Kbits
Kbits
Mbit
Mbits
4 Mbits
Gbits
8 Gbits
25 Tbits
Tbits
22N
tes:
10.423 R. Stewart,
ere is one serious problem with this technique: as the operand size grows, the ROr two N bit input operands, there are possible results, and hence the ROM ult is bits long, and in total bits of storage are required.
r example, with 8 bits operands (a fairly reasonable) size, 1Mbit of storage is reqger operands e.g. 16 bits, a huge quantity of storage is required. 16 bit operandsd hence a ROM-based multiplier is clearly not a realistic implementation choice
Input Wordlength(N)
Output Wordlength(2N)
No. of ROM entries(22N)
To(2
4 8 28 = 256 2
6 12 212 = 4,096 48
8 16 216 = 65,536 1
10 20 220 = 1,048,576 20
12 24 224 = 16,777,216 38
14 28 228 = 268,435,456 7
16 32 232 = 4,294,967,296 12
18 36 236 = 68,719,476,736 2.
20 40 240 = 1,099,511,627,776 40
22N
2N 2N 22N×
t, Dept EEE, University of Strathclyde, 2010
resses 45
8-bit locations aretorage is needed!
t)00
01
data
0
110
0001
decimal 7,169
P
16 bits
1
000
0000
01
R. Stewar
Input Wordlength and ROM Add• Consider a ROM multiplier with 8-bit inputs: 65,536
required to store all possible outputs... so 1Mbit of s
A
B
0100
1011
0000 0000 0000 0000
1111 1111 1111 1111
0110 1011 0100 0011
address data (produc0000 0000 0000 00
0000 0000 0000 00
8 bits
8 bits
ROM-based multiplier
216 = 65,536 16-bit addresses 16-bit
decimal 107
decimal 67
0110
0011
A:B
16 bits
0001 1100 0000 00
address 27,459
1
101
0110
1
001
0100
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Fo would be composed of 256po -bit binary word 0100 1011,as tually accessed.
ThROthelocpreremreq
Hocotheinsthethewi
Asthasto
Th ould therefore be 3 kbits...sig nds are unknown!
addresses (decimal):
0 x 28 + 75 = 751 x 28 + 75 = 3312 x 28 + 75 = 5873 x 28 + 75 = 843
254 x 28 + 75 = 65,099255 x 28 + 75 = 65,355
253 x 28 + 75 = 64,8431–
tes:
10.423 R. Stewart,
r example, if the B input was the constant value 75, the possible input words ssible combinations of the upper 8-bits of the address, concatenated with the 8 shown below. The result is that only 256 of the 65,536 memory locations are ac
erefore, when one of the inputs to theM-based multiplier is fixed, the size of required ROM can be reduced to 256ations of 16-bit data (note that thecision of the stored output wordsains 8 bits + 8 bits). The total memoryuired is thus 256 x 16 = 4kbits.
wever, depending on the value of thenstant, it may also be possible to reduce length of the stored results. Fortance, if the value of B is (decimal) 10, maximum output product generated by multiplication of B with any 8-bit input A
ll be:
-1280 can be represented with 12 bits,t represents a further saving of 4 bitsrage x 256 memory locations = 1kbit.
e total storage requirement for this example constant coefficient multiplier wnificantly smaller than the 1Mbit needed for a 16-bit multiplier where both opera
0100
????
8 bits
8 bits
????
1011
A:B
16 bits
?
???
????
1
001
0100
A=?
B=7528 10× 1280–=
t, Dept EEE, University of Strathclyde, 2010
liers 46
fewer addresses.
be reduced, if thel range of:
calculated for theised accordingly...
further.
-128
? ? ? ? ? ? ? ? ?
m product = 10,624 15-bit representation
red! 1-bit saved!
signed result
R. Stewar
ROM-based Constant Multip• ROM-based multipliers with a constant input require
• The storage required for output words may also maximum result does not require the full numerica
• The maximum product and output wordlength can beparticular constant value, and the multiplier optim
• Additional optimisations allow cost to be reduced
22N 1– result 22N 1– 1–≤ ≤–
8 bit representation (min.)
A = ?
0 1 0 1 0 0 1 1
8 bit signed number
? ? ? ? ? ? ? ?
B = -83
maximum absolute value =
? ? ? ? ?P = ?
maximuso maximum
requi
16 bit
? ?
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Co stributed ROM”), or with oneor y the other demands placedon
In nstant Multiplier dialog box,alo
tes:
10.423 R. Stewart,
nstant multipliers can be implemented using the LUTs within the logic fabric (“dimore of the Block RAMs available on most devices. The selection is influenced b these resources by the rest of the system being designed.
System Generator, the designer can specify the implementation style via the Cong with the constant value, the output wordlength, and other parameters.
t, Dept EEE, University of Strathclyde, 2010
dd 47
ply by shifting ther of binary places.
numbers can bem shifts, and then
= -496
= 0.15625
x16
x0.25
+ 21 = 189 x9
= 1.3125 x1.3125
R. Stewar
Multiplication by Shift and A• Multiplication by a power-of-2 can be achieved sim
number to the right or left by the appropriate numbe
• Extending this a little, multiplications by other performed at low cost by creating partial products froadding them together.
0 0 0 1 0 12
4 1 0 0 0 0 1 0 0 0 0 -31 x 241 0 0 0 0 1
0.625 x 2-2 0 1 0 1
3
0 1 0 1 1 1 1 0 1
0 1 0 1 0 1 21 x 23
4
0 1 0 1 0 1 0 0 0
0 1 0 0 0 0
2
0
(1 x 2-4) + 1 + (1 x 2-2)
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Sh emented only using routing.Th that multiplications by othernu arbitrary multiplication canbe multiplication is to a power-of- ed, and hence the lower theco
Thsumutheanusthe
Thpaapmusaprosecathe in this way.
Ta other x24, it is clear that thesh
4x16
x8
x24
x9x1
fewer partial products
3
tes:
10.423 R. Stewart,
ift operations are effectively “free” in terms of logic resources, as they are implerefore multiplications by power-of-two numbers are very cheap! By recognisingmbers can be achieved by summing partial products of power-of-two shifts, any decomposed into a series of shifts and add operations. The “closer” the desiredtwo, i.e. the fewer partial products that are required, the fewer adders are requirst of the multiplier.
is type of multiplier isitable only for constantltiplications, becausere is only one input,d the result is achieveding the configuration of hardware.
e technique can berticularly powerful whenplied to parallelltiplications of the
me input. The partialduct terms common to
veral multiplicationsn be shared and thus overall effort reduced. Transpose form filters are very suitable for optimisation
king the above simple example of two concurrent multiplications, one x9 and theift right by three places can be shared as x8 is common to both operations.
3x8
x1x9
4x16
x8x243
combined -
x24 and x9 calculated separately
t, Dept EEE, University of Strathclyde, 2010
48
first to provide “on-
ctually in the useravailable, and theyan a slice-based
of the device.
36-bit product, i.e.
d more than 2000 are available.
R. Stewar
Embedded multipliers• The Xilinx Virtex-II and Virtex-II Pro series were the
chip” multipliers in early 2000s.
• These are in hardware on the FPGA ASIC, not aFPGA “slice-logic-area”. Therefore are permanently use no slices. They also consume less power thequivalent and can be clocked at the maximum rate
• A and B are 18-bit input operands, and P is the .
• Depending upon the actual FPGA, between 12 an(Virtex 6 top of range) of these dedicated multipliers
A
B
P18x18 bit multiplier
P A B×=
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Lo h are located next to BlockRA utation.
Inf ntroduction and Overview”,DS
tes:
10.423 R. Stewart,
oking at a device floorplan, you can clearly see the embedded multipliers, whicMs on the FPGA in order to support high speed data fetching/writing and comp
ormation on dedicated multipliers taken from “Virtex-II Pro Platform FPGAs: I083-1 (v2.4) January 20, 2003. http://www.xilinx.com.
Block RAM slices
18x18 multiplier
t, Dept EEE, University of Strathclyde, 2010
ncy 49
liers inefficiently
reful
8
ltiplier ~5% utilised
38
4 embedded mults
R. Stewar
Embedded Multiplier Efficie• It can be easy to utilise on chip embedded multip
through choice of wordlengths...
• When using multipliers in System Generator....be ca
x18
1836
4
4
1 embedded multiplier 100% utilised 1 embedded mu
18
1836
18 36
SysGen will use 1 embedded mult SysGen will use
18
19
19
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
If y Generator, the tool will doex However, depending on thewo
Th s sense to use them as fullyas
It i ies of the multiplier, and thispa tion, which would leave theem made in the context of somelar he FPGA being targeted.
Pe e input operands are slightlylon wing implementation for areq f the expected 1!
1 x 18
1 x 1
tes:
10.423 R. Stewart,
ou specify the use of embedded multipliers for a particular multiplier in Systemactly as you have asked, and implement it entirely using embedded multipliers. rdlengths involved, this may lead to an inefficient implementation.
e wordlengths of the embedded multipliers are fixed at 18 x 18 bits, and it make possible.
s relatively easy to see that a 4 x 4 bit multiply will greatly underuse the capabilitrticular multiply operation might be better mapped to a distributed implementabedded multiplier free for use somewhere else. Of course, these decisions are ger design with its own particular needs for the various resources available on t
rhaps less obviously, mapping a multiplication to embedded multipliers where thger than 18 bits is also inefficient. This may result in, for example, the follouested 19 x 19 bits multiplier, where 4 embedded multipliers are used instead o
3819
19
18 x 18
18 x 1
t, Dept EEE, University of Strathclyde, 2010
SP48s) 50
C) operation, soonthe Virtex-4).
bit accumulator.
and fast.
at whole filters can
8
rtex-4P48
R. Stewar
High Speed Arithmetic Slices (D• As much DSP involves the Multiply-Accumulate (MA
after embedded multipliers came DSP48 slices (on
• These feature an 18 x 18 bit adder followed by a 48
• Like the embedded multipliers, these are low power
• The ability to cascade slices together also means thbe constructed without having to use any slices.
18
1836
48
4
ViDS
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th with the DSP48E.
Th actor unit, and an extendedwo in line with the speed of thede
8
tex-5P48E
tes:
10.423 R. Stewart,
e next series of FPGAs (the Virtex-5) enhanced the capabilities of the DSP slice
e major improvements of this slice are logic capabilities within the adder/subtrrdlength of one input to 25 bits. The maximum clock frequency also increased vice.
18
2543
48
4
VirDS
t, Dept EEE, University of Strathclyde, 2010
51
artan-6 feature aultiplier.
s like symmetricltiplications to be
48
P DSP48A DSP48A1
R. Stewar
DSP48s with Pre-Adders• The Spartan-3A DSP series and subsequent Sp
version of the DSP48 with a pre-adder, prior to the m
• This feature is especially useful for DSP structurefilters, because it allows the total number of mureduced.
18
1836
48
18
18
Spartan-3A DSSpartan-6
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Th length and arithmetic unit),tog utationally powerful device,es em!
48
DSP48E1
tes:
10.423 R. Stewart,
e Virtex-6 offers a combination of the benefits of the Virtex-5 (the longer wordether with the pre-adder from the Spartan series. This results in a very comp
pecially as it can be clocked at 600MHz, and the largest chips have 2000+ of th
25
1843
48
25
25
Virtex-6
t, Dept EEE, University of Strathclyde, 2010
52
not very often.
btraction as shownbe selected.
0
0
FA
sin
sout
bout
in
Bout
cint
R. Stewar
Division (i)• Divisions are sometimes required in DSP, although
• 6 bit non-restoring division array:
• Note that each cell can perform either addition or suin an earlier slide either Sin+ Bin or Sin - Bin can
0
0
0
b0b1b2b3b4b5
a0a1a2a3a4
q5
q4
q3
q2
q1
q0
a5
bin
B
cou
Q = B / A
1
⇒
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
A y look familiar as it is oftentau an addition or subtraction ofthe ion. If the quotient bit is a 0,the difficult to map this exampleint
Ex
tes:
10.423 R. Stewart,
Direct method of computing division exists. This “paper and pencil” method maght in school. A binary example is given below. Note that each stage computes divisor A. The quotient is made up of the carry bits from each addition/subtract next computation is an addition, and if it is a 1, the divisor is subtracted. It is not
o the structure shown on the slide.
ample: B = 01011 (11), A = 01101 (13) -A = 10011. Compute Q = B / A. ⇒
010111001111110
q4 = 0
111000110101001
q3 = 1
100101001100101
q2 = 1
010101001111101
q1 = 0
110100110100111
q0 = 1
R0 = B-AR1
2.R1+AR2
2.R2-AR3
2.R3-AR4
2.R4+AR5
Q = B / A = 01101 x 2-4 = 0.8125
carry
carry
carry
carry
carry
0
0
0
0
t, Dept EEE, University of Strathclyde, 2010
53
ing another paper
VHDL Design
R. Stewar
Division (ii)• There is an alternative way to compute division us
and pencil technique.
01101 01011.0000001000100000101000001011
00000.1101
0000001011 000110 100100 1000011 0100001 01000000 00000001 010000000 110100000 0111
divisor_in
remdsh1
divisor_in
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
tes:
10.423 R. Stewart,
t, Dept EEE, University of Strathclyde, 2010
n 54
otient is generatedn!
he next stage untilsystem!
hrough adders.
e the next row can
d as result division on a parallel arraye!
R. Stewar
The Problem With Divisio• An important aspect of division is to note that the qu
MSB first - unlike multiplication or addition/subtractio
• This has implications for the rest of the system.
• It is unlikely that the quotient can be passed on to tall the bits are computed - hence slowing down the
• Also, an N by N array has another problem - ripple t
• Note that we must wait for N full adder delays beforbegin its calculations.
• Unlike multiplication there is no way around this, anis always slower than multiply even when performed- a N by N multiply will run faster than a N by N divid
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
By to get generated is the MSBof where the LSB is generatedfirs to start a computation andhe n.
An e next row can start. In theex ultiplier, the first cell on these n the second row is only the5th h.
q
FA
sin
sout
bout
Bout
cin
0
0
0
0 a0a1a2
b0
b1
p0p1
12
3
tes:
10.423 R. Stewart,
looking at the top two rows of a 4 x 4 division array we can see that the first bit the quotient. This is unlike the multiplication array that can also be seen below, t. This is a problem when using division as most operations require the LSBsnce the whole solution will have to be generated before the next stage can begi
other problem for division is the fact that it takes N full adder delays before thamples below, the order in which the cells can start has been shown. So for the mcond row is the 3rd cell to start working. However, for the divider, the first cell o cell to start working because it has to wait for the 4 cells on the first row to finis
b0b1b2b3
a0a1a2
3
q2
a3
bin
Bin
cout
1
01234
56
00 a3
0
a
aout
bbout
s
sout
ccout
FA
FA is full adder
34
45
t, Dept EEE, University of Strathclyde, 2010
ray 55
lined to increase
0
0
FA
sin
sout
boutin
Bin
Bout
cinout
elay
R. Stewar
Pipelining The Division Ar• The division array shown earlier can be pipe
throughput.
0
0
0
b0b1b2b3b4b5
a0a1a2a3a4
q5
q4
q3
q2
q1
q0
a5
b
c
Q = B / A
1pipeline d
b0b1b2b3b4b5
a0a1a2a3a4a5
b0b1b2b3b4b5
a0a1a2a3a4a5
Operands
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
To ipeline delays at appropriatepo egistering the full quotient isN2 ter the array. However, bypip the rate at which new dataca
q
0
0
0
0
l path is only N full adders.
tes:
10.423 R. Stewart,
increase the throughput, the critical path can be broken down by implementing pints. If pipelining is not used, the delay (critical path) from new data arriving to r full adders. This delay represents the maximum rate that new data can enelining the array, the critical path is broken down to just N full adders and thus
n arrive is increased dramatically.
b0b1b2b3
a0a1a2
3
q2
q1
q0
a3
10
0
0
bb1b2b3
a0a1a2
q3
q2
q1
q0
a3
1
Without pipelining the critical path is through N2 full adders.
The longest path from registerto register is the Critical Path.
With pipelining the critica
t, Dept EEE, University of Strathclyde, 2010
56
rithms such as QR communications
FA
sin
bout
Bout
cin
sout
1 00
1 00
1 00
0
0
R. Stewar
Square Root (i)• 6 bit non-restoring square root array.
• The square root is found (with divides) in DSP in algoalgorithms, vector magnitude calculations andconstellation rotation.
1
b5
b4
b3
b2
b1
b0
1
bin
Bin
cout
a7
000 1
0
10
10
10
10
10
10
1 00
0
0
0
0
a6
a5 a4
a3 a2
a1 a0
B A=
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
Lo y is essentially “half” of thediv the cells that are needed forthe which can be simplified. Sosq are!
1
b5
sout
1 00
1 00
1 00
0
0
11 01 01011111011010010011001111100
110001011011001100
a3 a2 a1
0a4111R1R1<<1 & a31b311R2R2<<1 & a21b3b211R3
R3<<1 & a10b3b2b111R4
tes:
10.423 R. Stewart,
oking carefully at the non-restoring square root array, we can note that this arraision array! If the division array above is cut diagonally from the left we can see square root array. The 2 extra cells on the right hand side are standard cells uare root can be performed twice as fast as divide using half of the hardw
b4
b3
b2
b1
b0
1a7
000 1
0
10
10
10
10
10
10
1 00
0
0
0
0
a6
a5 a4
a3 a2
a1 a0
A = 10 011100
010
b3 = 1 carry
b2 = 1 carry
b1 = 0 carry
b0 = 1 carry
a4
t, Dept EEE, University of Strathclyde, 2010
goras! 57
s is in advancedns.
form:
s, a divide and altiply!)
can be used toinvariably require time.)
bout square roots:r and cheaper to
yy2+
------------
R. Stewar
Square Root and Divide - Pytha• The main appearance of square roots and divide
adaptive algorithms such as QR using givens rotatio
• For these techniques we often find equations of the
• So in fact we actually have to perform two squaresquare root. (Note that squaring is “simpler” than mu
• There are a number of iterative techniques thatcalculate square root. (However these routines multiplies and divides and do not converge in a fixed
• There seems to be some misinformation out there aFor FPGA implementation square roots are easieimplement than divides....!
θcos xx2 y2+
----------------------= and θsinx2
----------=
No Introduction to FPGA
Ver Dept EEE, University of Strathclyde, 2010
tes:
10.423 R. Stewart,