Digital VLSI ADigital VLSI ADigitalVLSIADigitalVLSIA
Pipelining&ParaPipelining&Para
MahdiS
fSharifUniversity
M.Shabany,Digital
Architectures:Architectures:Architectures:Architectures:
allelProcessingallelProcessing
Shabany
fyofTechnology
lVLSIArchitectures
ArchitecturalTechniques: Criticalpathinanydesignisthelonge
1. Anytwointernallatches/flipflops
2. An input pad and an internal latch2. Aninputpadandaninternallatch
3. Aninternallatchandanoutputpad
4. Aninputpadandanoutputpad
UseFFsrightafter/before
input/outpadstoavoid
th l t th
InputPad
thelastthreecases
(offchipandpackagingdelay)
The maximum delay between anyThemaximumdelaybetweenanytwosequentialelementsinadesignwilldeterminethemax
clockspeed
M.Shabany,Digital
:CriticalPathestpathbetween
d
Comb.Logic
2
31OutputPad
4
lVLSIArchitectures
DigitalDesignMetrics
Threeprimaryphysicalcharacter Speed Speed
Throughput Latency Timing Timing
Area Power
M.Shabany,Digital
isticsofadigitaldesign:
lVLSIArchitectures
DigitalDesignMetrics
Speed Throughput :oug put
Theamountofdatathatispro
Latency Latency Thetimebetweendatainput
Timing Timing ThelogicdelaysbetweensequWhenadesigndoesnotmeetcritical path is greater than the tcriticalpathisgreaterthanthet
M.Shabany,Digital
ocessedperclockcycle(bitspersecond)
andprocesseddataoutput(clockcycle)
uentialelements(clockperiod)tthetimingitmeansthedelayofthetarget clock periodtargetclockperiod
lVLSIArchitectures
MaximumClockFrequenc
MaximumClockFrequency:
logicqclkmax TTTF
Tclkq :timefromclockarrivaluntildataa Tlogic : propagation delay through logic be Tlogic :propagationdelaythroughlogicbe Trouting :routingdelaybetweenflipflops Tsetup :minimumtimedatamustarriveat Tskew :propagationdelayofclockbetwee
M.Shabany,Digital
cy:CriticalPath
1
skewroutingsetup TTT1
arrivesatQ
etween flipflopsetweenflip flops
tDbeforethenextrisingedgeofclock
enthelaunchflipflopandthecaptureflipflop.
lVLSIArchitectures
Pipelining(toImproveThr
Pipelining: Comesfromtheideaofawate
waiting the water in the pipe towaitingthewaterinthepipeto Usedtoreducethecriticalpath
Advantageous: Reductioninthecriticalpath Higher throughput (number of Higherthroughput(numberof Increasestheclockspeed(orsa Reducesthepowerconsumptio
M.Shabany,Digital
roughput)
rpipe:continuesendingwaterwithouto be outobeouthofthedesign
computed results in a give time)computedresultsinagivetime)amplingspeed)onatsamespeed
lVLSIArchitectures
ArchitecturalTechniques:
Pipelining: Verysimilartotheassemblylinei Thebeautyofapipelineddesignibeforethepriordatahasfinished,massemblyline.
M.Shabany,Digital
:Pipelining
ntheautoindustry
sthatnewdatacanbeginprocessingmuchlikecarsareprocessedonan
lVLSIArchitectures
ArchitecturalTechniques: OriginalSystem:(Criticalpath=1
ComX
Com
Clk
Criticalpath=1 Maxo
Pipelinedversion:(Criticalpath= SmallerCriticalPathhighert l Longerlatency
Comb.LogicX
Clk
Criticalpath=2
M.Shabany,Digital
Pipe
:PipeliningMaxoperatingfreq:f1=1/1)
2cyclesmb Logic
f
latermb.Logic
peratingfreq:f1=1/1
2 Maxoperatingfreq:f2=1/2)throughput(2f1)
3cycleslater
Comb.Logicf
later
Maxoperatingfreq:f2=1/2
lVLSIArchitectures
liningRegister
ArchitecturalTechniques:
Pipelinedepth:0(NoPipeline) Criticalpath:3Adders
Latency : 0 Latency:0
( ) (
t1 t2
X(1)
Y(1)
X(2
Y(2
M.Shabany,Digital
:Pipelinedepth
wire w1,w2;assign w1=X+a;assign w2=w1+b;g ;assign Y=w2+c;
)
timet3
X(3)2)
2)
X(3)
Y(3)
lVLSIArchitectures
ArchitecturalTechniques:
Pipelinedepth:1(OnePipelinereg Criticalpath:2Adders
a(n) b(n) c
2
Latency : 1
X(n)w1 w2
Latency:1
( ) (2)
t1 t2
X(1)
Y(1)
X(2)
M.Shabany,Digital
:Pipelinedepth
gisterAdded)
wire w1;reg w2;assign w1=X+a;assign Y=w2+c;
c(n)
g ;
always@(posedgeClk)w2
ArchitecturalTechniques:
Pipelinedepth:2(OnePipelinereg Criticalpath:1Adder
Latency : 2 Latency:2t1 t2 t3
X(1) X(2)
M.Shabany,Digital
:Pipelinedepth
gisterAdded)reg w1,w2;assign Y = w2 + c;assign Y=w2+c;
always@(posedgeClk)beginw1
ArchitecturalTechniques:
Clockperiodandthroughputasaf
1 Clockperiod:
h h Tn1
Clk Throughput: nT
Addi i t l iAddingregisterlayersimprovestimingbydividingthecriticalpathintotwopathsofsmallerdelay
M.Shabany,Digital
:Pipelining
functionofpipelinedepth:
Clock Period
Throughput
3 4 5 6Pipeline Depth
lVLSIArchitectures
Pipeline Depth
ArchitecturalTechniques:
GeneralRule: Pipelininglatchescanonlybeofthecircuit.
Cutset: Cutset:Asetofpathsofacircuitsuchtcircuitbecomesdisjoint(i.e.,twoj
FeedForwardCutset: Acutsetiscalledfeedforwardforwarddirectiononallthepaths
M.Shabany,Digital
:Pipelining
placedacrossfeedforwardcutsets
thatifthesepathsareremoved,theseparatepieces)p p
cutsetifthedatamoveinthesofthecutset
lVLSIArchitectures
ArchitecturalTechniques:
Example: FIRFilter Threefeedforwardcutsetsare
M.Shabany,Digital
:Pipelining
eshownNOTafeedforwardcutset
X(n) X(n1) X(n2)
a b c
Y(n)
lVLSIArchitectures
ArchitecturalTechniques:CriticalPath:1M+2A
X(n) X(n1) X(n2)
a b c
assign w1 = a*Xn;
Y(n)w1 w2
w3
w4
assign w1=a Xn;assign w2=b*Xn_1;assign w3=w1+w2;assign w4=c*Xn_2;assign Y=w3+w4;g ;always@(posedgeClk)beginXn_1
ArchitecturalTechniques:
X(n) X(n1)
a b
2
Cloc Input 1 2 3
1 3
Clock
Input 1 2 3
0 X(0) aX(0) aX(0)
1 X(1) X(1) bX(0) X(1)+bX(0)1 X(1) aX(1) bX(0) aX(1)+bX(0)
2 X(2) aX(2) bX(1) aX(2)+bX(1)
3 X(3) aX(3) bX(2) aX(3)+bX(2)
M.Shabany,Digital
:Pipelining
X(n2)
c
4
4 5 Output
Y(n)3
45
4 5 Output
aX(0) Y(0)
X(1)+bX(0) Y(1) aX(1)+bX(0) Y(1)
cX(0) aX(2)+bX(1)+cX(0) Y(2)
cX(1) aX(3)+bX(2)+cX(1) Y(3)
lVLSIArchitectures
ArchitecturalTechniques:X(n) X(n1)
a b
1 2
Cloc Input 1 2 3
1
Clock
Input 1 2 3
0 X(0)
1 X(1) aX(0) aX(0)1 X(1) aX(0) aX(0)
2 X(2) aX(1) bX(0) aX(1)+bX(0)
3 X(3) aX(2) bX(1) aX(2)+bX(1)
M.Shabany,Digital
:PipeliningX(n2)
c
2 3 4 5
4 5 Output
Y(n)3 5
4 5 Output
aX(0) Y(0) aX(0) Y(0)
) aX(1)+bX(0) Y(1)
) cX(0) aX(2)+bX(1)+cX(0) Y(2)
lVLSIArchitectures
ArchitecturalTechniques:
Evenmorepipelining
Clock Input 1 2 3
0 X(0)
1 X(1) aX(0) -2 X(2) aX(1) bX(0) aX(0)
3 X(3) aX(2) bX(1) aX(1)+bX(0)
4 X(3) aX(2) bX(1) aX(2)+bX(1)
M.Shabany,Digital
:Pipelining
4 5 Output
- - aX(0) Y(0)
aX(1)+bX(0) Y(1)
cX(0) aX(2)+bX(1)+cX(0) Y(2)
lVLSIArchitectures
ArchitecturalTechniques:
Pipeliningattheoperationlevel Breakthemultiplierintotwop
FinePiPipe
M.Shabany,Digital
:FineGrainPipelining
arts
X(n) X(n1) X(n2)
a b cm1 m1 m1
Grainli i
m2 m2 m2
lining
Y(n)
lVLSIArchitectures
UnrollingtheLoopUsingP
CalculationofX3Throughput=8/3,or2.7bits/clockLatency = 3 clocksLatency 3clocksTiming=Onemultiplierinthecritical
Iterativeimplementation:p Nonewcomputationscanbeginun
previouscomputationhascomplet
M.Shabany,Digital
Pipelining
module power3(outputreg[7:0]X3,output finished,
path input [7:0]X,input clk,start);reg [7:0]ncount;reg [7:0]Xpower,Xin;
ntiltheted
assign finished=(ncount ==0);always@(posedge clk)if (start)beginXPower
UnrollingtheLoopUsingP
CalculationofX3Throughput=8/1,or8bits/clock(3XLatency = 3 clocksLatency 3clocksTiming=Onemultiplierinthecritical
Penalty:MoreAreaUnrollinganalgorithmwithniterativeloo
throughputbyafactorofn
Clk
X2
X[0:7]
xpower1 xpower2
M.Shabany,Digital
Clk
Pipelining
improvement)module power3(output reg[7:0]XPower,input clk,
pathinput clk,input [7:0]X);reg [7:0]XPower1,XPower2;reg [7:0]X2;always @(posedge clk) beginalways @(posedgeclk)begin//Pipelinestage1XPower1
RemovingPipelineRegiste
CalculationofX3Throughput=8bits/clock(3XimprovLatency = 0 clocksLatency 0clocksTiming=Twomultipliersinthecritica
Latencycanbereducedbyremovingpipel
M.Shabany,Digital
ers(toImproveLatency)
vement)module power3(O t t [7 0] XP
lpathOutput [7:0]XPower,input [7:0]X);reg [7:0]XPower1,XPower2;reg [7:0]X1,X2;l @*always @*XPower1=X;
always @(*)beginX2 XP 1
lineregisters
X2=XPower1;XPower2=XPower1*XPower1;
end
i XP XP 2 * X2assign XPower =XPower2*X2;
endmodule
lVLSIArchitectures
ArchitecturalTechniques: I ll l i h h Inparallelprocessingthesamehar
Increasesthethroughputwithoutcha Increasesthesiliconarea
X(n)
a(n)
Pipelining
ClockFreq:2f
Throughput:2Msamples
M.Shabany,Digital
:ParallelProcessingd i d li drdwareisduplicatedtoangingthecriticalpath
b(n)
Y(n)
ClockFreq:f
Throughput:Msamples
a(2k) b(2k)
ParallelProcessing
X(2k) Y(2k)
a(2k+1) b(2k+1)
X(2k+1) Y(2k+1)
ClockFreq:f
lVLSIArchitectures
q
Throughput:2Msamples
ArchitecturalTechniques:
Parallelprocessingfora3tapFIRf Bothhavethesamecriticalpath(M
X
ParallelFactor:3
1)cx(3kbx(3k)1)ax(3k1)y(3k
2)cx(3k1)bx(3kax(3k)y(3k)
cx(3k)1)bx(3k2)ax(3k2)y(3k
1)cx(3kbx(3k)1)ax(3k1)y(3k
Not a simple dup
M.Shabany,Digital
Notasimpledup
:ParallelProcessing
filterM+2A)
X(3k)X(3k+1)X(3k+2)
a b c
Y(3k+2)
X(3k2) X(3k1)
c a
Y(3k+1)
b
b c a
plication!
lVLSIArchitectures
Y(3k)
plication!
SamplePeriodvs.ClockPe
Pipelinedsystem:Tclk =Tsample
MClksample TTT
ParallelSystem:Tclk Tsamplep
112T(T
31
T31
T AMClksample
Higher Sample rate than the
M.Shabany,Digital
HigherSampleratethanthe
eriod
X(3k)
a b c
X(3k+1)X(3k+2)
Y(3k+2)
c a b
X(3k2) X(3k1)
Y(3k+1)
b c a
)A
clock rate
lVLSIArchitectures
Y(3k)
clockrate
CompleteParallelSystem
ACompleteParallelSystem:
M.Shabany,Digital
withS/PandP/S
lVLSIArchitectures
S/PandP/SBlocks
S/PConverter:
M.Shabany,Digital
P/SConverter:
Y(3k)Y(3k+1)Y(3k+2)
T T T
Y(n)
T
T/3T/3
SamplePeriT/3
T T
ParalleltoSerialConverter
lVLSIArchitectures
WhenPipeliningWhenPa
Pipelinetechniqueisusedwhenth(Number1,2,3,4)
ParallelismisusedwhenthecriticacommunicationorI/Obound.(Numb
Pipeliningdoesnothelpinthiscas a.k.a CommunicationBounded
3
Comb.Logic
1
2
531
4
M.Shabany,Digital
arallelism?
hecriticalpathisinthedesign
alpathisboundedbytheber5)se!
Comb.Logic
lVLSIArchitectures
Top Related