Download - Pipelining Parallel Processing

Digital VLSI ADigital VLSI ADigitalVLSIADigitalVLSIA

Pipelining&ParaPipelining&Para

MahdiS

fSharifUniversity

M.Shabany,Digital

Architectures:Architectures:Architectures:Architectures:

allelProcessingallelProcessing

Shabany

fyofTechnology

lVLSIArchitectures

ArchitecturalTechniques: Criticalpathinanydesignisthelonge

1. Anytwointernallatches/flipflops

2. An input pad and an internal latch2. Aninputpadandaninternallatch

3. Aninternallatchandanoutputpad

4. Aninputpadandanoutputpad

UseFFsrightafter/before

input/outpadstoavoid

th l t th

InputPad

thelastthreecases

(offchipandpackagingdelay)

The maximum delay between anyThemaximumdelaybetweenanytwosequentialelementsinadesignwilldeterminethemax

clockspeed

M.Shabany,Digital

:CriticalPathestpathbetween

d

Comb.Logic

2

31OutputPad

4

lVLSIArchitectures

DigitalDesignMetrics

Threeprimaryphysicalcharacter Speed Speed

Throughput Latency Timing Timing

Area Power

M.Shabany,Digital

isticsofadigitaldesign:

lVLSIArchitectures

DigitalDesignMetrics

Speed Throughput :oug put

Theamountofdatathatispro

Latency Latency Thetimebetweendatainput

Timing Timing ThelogicdelaysbetweensequWhenadesigndoesnotmeetcritical path is greater than the tcriticalpathisgreaterthanthet

M.Shabany,Digital

ocessedperclockcycle(bitspersecond)

andprocesseddataoutput(clockcycle)

uentialelements(clockperiod)tthetimingitmeansthedelayofthetarget clock periodtargetclockperiod

lVLSIArchitectures

MaximumClockFrequenc

MaximumClockFrequency:

logicqclkmax TTTF

Tclkq :timefromclockarrivaluntildataa Tlogic : propagation delay through logic be Tlogic :propagationdelaythroughlogicbe Trouting :routingdelaybetweenflipflops Tsetup :minimumtimedatamustarriveat Tskew :propagationdelayofclockbetwee

M.Shabany,Digital

cy:CriticalPath

1

skewroutingsetup TTT1

arrivesatQ

etween flipflopsetweenflip flops

tDbeforethenextrisingedgeofclock

enthelaunchflipflopandthecaptureflipflop.

lVLSIArchitectures

Pipelining(toImproveThr

Pipelining: Comesfromtheideaofawate

waiting the water in the pipe towaitingthewaterinthepipeto Usedtoreducethecriticalpath

Advantageous: Reductioninthecriticalpath Higher throughput (number of Higherthroughput(numberof Increasestheclockspeed(orsa Reducesthepowerconsumptio

M.Shabany,Digital

roughput)

rpipe:continuesendingwaterwithouto be outobeouthofthedesign

computed results in a give time)computedresultsinagivetime)amplingspeed)onatsamespeed

lVLSIArchitectures

ArchitecturalTechniques:

Pipelining: Verysimilartotheassemblylinei Thebeautyofapipelineddesignibeforethepriordatahasfinished,massemblyline.

M.Shabany,Digital

:Pipelining

ntheautoindustry

sthatnewdatacanbeginprocessingmuchlikecarsareprocessedonan

lVLSIArchitectures

ArchitecturalTechniques: OriginalSystem:(Criticalpath=1

ComX

Com

Clk

Criticalpath=1 Maxo

Pipelinedversion:(Criticalpath= SmallerCriticalPathhighert l Longerlatency

Comb.LogicX

Clk

Criticalpath=2

M.Shabany,Digital

Pipe

:PipeliningMaxoperatingfreq:f1=1/1)

2cyclesmb Logic

f

latermb.Logic

peratingfreq:f1=1/1

2 Maxoperatingfreq:f2=1/2)throughput(2f1)

3cycleslater

Comb.Logicf

later

Maxoperatingfreq:f2=1/2

lVLSIArchitectures

liningRegister


Pipelinedepth:0(NoPipeline) Criticalpath:3Adders

Latency : 0 Latency:0

( ) (

t1 t2

X(1)

Y(1)

X(2

Y(2

M.Shabany,Digital

:Pipelinedepth

wire w1,w2;assign w1=X+a;assign w2=w1+b;g ;assign Y=w2+c;

)

timet3

X(3)2)

2)

X(3)

Y(3)

lVLSIArchitectures


Pipelinedepth:1(OnePipelinereg Criticalpath:2Adders

a(n) b(n) c

2

Latency : 1

X(n)w1 w2

Latency:1

( ) (2)

t1 t2

X(1)

Y(1)

X(2)

M.Shabany,Digital

:Pipelinedepth

gisterAdded)

wire w1;reg w2;assign w1=X+a;assign Y=w2+c;

c(n)

g ;

always@(posedgeClk)w2


Pipelinedepth:2(OnePipelinereg Criticalpath:1Adder

Latency : 2 Latency:2t1 t2 t3

X(1) X(2)

M.Shabany,Digital

:Pipelinedepth

gisterAdded)reg w1,w2;assign Y = w2 + c;assign Y=w2+c;

always@(posedgeClk)beginw1


Clockperiodandthroughputasaf

1 Clockperiod:

h h Tn1

Clk Throughput: nT

Addi i t l iAddingregisterlayersimprovestimingbydividingthecriticalpathintotwopathsofsmallerdelay

M.Shabany,Digital

:Pipelining

functionofpipelinedepth:

Clock Period

Throughput

3 4 5 6Pipeline Depth

lVLSIArchitectures

Pipeline Depth


GeneralRule: Pipelininglatchescanonlybeofthecircuit.

Cutset: Cutset:Asetofpathsofacircuitsuchtcircuitbecomesdisjoint(i.e.,twoj

FeedForwardCutset: Acutsetiscalledfeedforwardforwarddirectiononallthepaths

M.Shabany,Digital

:Pipelining

placedacrossfeedforwardcutsets

thatifthesepathsareremoved,theseparatepieces)p p

cutsetifthedatamoveinthesofthecutset

lVLSIArchitectures


Example: FIRFilter Threefeedforwardcutsetsare

M.Shabany,Digital

:Pipelining

eshownNOTafeedforwardcutset

X(n) X(n1) X(n2)

a b c

Y(n)

lVLSIArchitectures

ArchitecturalTechniques:CriticalPath:1M+2A

X(n) X(n1) X(n2)

a b c

assign w1 = a*Xn;

Y(n)w1 w2

w3

w4

assign w1=a Xn;assign w2=b*Xn_1;assign w3=w1+w2;assign w4=c*Xn_2;assign Y=w3+w4;g ;always@(posedgeClk)beginXn_1


X(n) X(n1)

a b

2

Cloc Input 1 2 3

1 3

Clock

Input 1 2 3

0 X(0) aX(0) aX(0)

1 X(1) X(1) bX(0) X(1)+bX(0)1 X(1) aX(1) bX(0) aX(1)+bX(0)

2 X(2) aX(2) bX(1) aX(2)+bX(1)

3 X(3) aX(3) bX(2) aX(3)+bX(2)

M.Shabany,Digital

:Pipelining

X(n2)

c

4

4 5 Output

Y(n)3

45

4 5 Output

aX(0) Y(0)

X(1)+bX(0) Y(1) aX(1)+bX(0) Y(1)

cX(0) aX(2)+bX(1)+cX(0) Y(2)

cX(1) aX(3)+bX(2)+cX(1) Y(3)

lVLSIArchitectures

ArchitecturalTechniques:X(n) X(n1)

a b

1 2

Cloc Input 1 2 3

1

Clock

Input 1 2 3

0 X(0)

1 X(1) aX(0) aX(0)1 X(1) aX(0) aX(0)

2 X(2) aX(1) bX(0) aX(1)+bX(0)

3 X(3) aX(2) bX(1) aX(2)+bX(1)

M.Shabany,Digital

:PipeliningX(n2)

c

2 3 4 5

4 5 Output

Y(n)3 5

4 5 Output

aX(0) Y(0) aX(0) Y(0)

) aX(1)+bX(0) Y(1)

) cX(0) aX(2)+bX(1)+cX(0) Y(2)

lVLSIArchitectures


Evenmorepipelining

Clock Input 1 2 3

0 X(0)

1 X(1) aX(0) -2 X(2) aX(1) bX(0) aX(0)

3 X(3) aX(2) bX(1) aX(1)+bX(0)

4 X(3) aX(2) bX(1) aX(2)+bX(1)

M.Shabany,Digital

:Pipelining

4 5 Output

- - aX(0) Y(0)

aX(1)+bX(0) Y(1)

cX(0) aX(2)+bX(1)+cX(0) Y(2)

lVLSIArchitectures


Pipeliningattheoperationlevel Breakthemultiplierintotwop

FinePiPipe

M.Shabany,Digital

:FineGrainPipelining

arts

X(n) X(n1) X(n2)

a b cm1 m1 m1

Grainli i

m2 m2 m2

lining

Y(n)

lVLSIArchitectures

UnrollingtheLoopUsingP

CalculationofX3Throughput=8/3,or2.7bits/clockLatency = 3 clocksLatency 3clocksTiming=Onemultiplierinthecritical

Iterativeimplementation:p Nonewcomputationscanbeginun

previouscomputationhascomplet

M.Shabany,Digital

Pipelining

module power3(outputreg[7:0]X3,output finished,

path input [7:0]X,input clk,start);reg [7:0]ncount;reg [7:0]Xpower,Xin;

ntiltheted

assign finished=(ncount ==0);always@(posedge clk)if (start)beginXPower

UnrollingtheLoopUsingP

CalculationofX3Throughput=8/1,or8bits/clock(3XLatency = 3 clocksLatency 3clocksTiming=Onemultiplierinthecritical

Penalty:MoreAreaUnrollinganalgorithmwithniterativeloo

throughputbyafactorofn

Clk

X2

X[0:7]

xpower1 xpower2

M.Shabany,Digital

Clk

Pipelining

improvement)module power3(output reg[7:0]XPower,input clk,

pathinput clk,input [7:0]X);reg [7:0]XPower1,XPower2;reg [7:0]X2;always @(posedge clk) beginalways @(posedgeclk)begin//Pipelinestage1XPower1

RemovingPipelineRegiste

CalculationofX3Throughput=8bits/clock(3XimprovLatency = 0 clocksLatency 0clocksTiming=Twomultipliersinthecritica

Latencycanbereducedbyremovingpipel

M.Shabany,Digital

ers(toImproveLatency)

vement)module power3(O t t [7 0] XP

lpathOutput [7:0]XPower,input [7:0]X);reg [7:0]XPower1,XPower2;reg [7:0]X1,X2;l @*always @*XPower1=X;

always @(*)beginX2 XP 1

lineregisters

X2=XPower1;XPower2=XPower1*XPower1;

end

i XP XP 2 * X2assign XPower =XPower2*X2;

endmodule

lVLSIArchitectures

ArchitecturalTechniques: I ll l i h h Inparallelprocessingthesamehar

Increasesthethroughputwithoutcha Increasesthesiliconarea

X(n)

a(n)

Pipelining

ClockFreq:2f

Throughput:2Msamples

M.Shabany,Digital

:ParallelProcessingd i d li drdwareisduplicatedtoangingthecriticalpath

b(n)

Y(n)

ClockFreq:f

Throughput:Msamples

a(2k) b(2k)

ParallelProcessing

X(2k) Y(2k)

a(2k+1) b(2k+1)

X(2k+1) Y(2k+1)

ClockFreq:f

lVLSIArchitectures

q

Throughput:2Msamples


Parallelprocessingfora3tapFIRf Bothhavethesamecriticalpath(M

X

ParallelFactor:3

1)cx(3kbx(3k)1)ax(3k1)y(3k

2)cx(3k1)bx(3kax(3k)y(3k)

cx(3k)1)bx(3k2)ax(3k2)y(3k

1)cx(3kbx(3k)1)ax(3k1)y(3k

Not a simple dup

M.Shabany,Digital

Notasimpledup

:ParallelProcessing

filterM+2A)

X(3k)X(3k+1)X(3k+2)

a b c

Y(3k+2)

X(3k2) X(3k1)

c a

Y(3k+1)

b

b c a

plication!

lVLSIArchitectures

Y(3k)

plication!

SamplePeriodvs.ClockPe

Pipelinedsystem:Tclk =Tsample

MClksample TTT

ParallelSystem:Tclk Tsamplep

112T(T

31

T31

T AMClksample

Higher Sample rate than the

M.Shabany,Digital

HigherSampleratethanthe

eriod

X(3k)

a b c

X(3k+1)X(3k+2)

Y(3k+2)

c a b

X(3k2) X(3k1)

Y(3k+1)

b c a

)A

clock rate

lVLSIArchitectures

Y(3k)

clockrate

CompleteParallelSystem

ACompleteParallelSystem:

M.Shabany,Digital

withS/PandP/S

lVLSIArchitectures

S/PandP/SBlocks

S/PConverter:

M.Shabany,Digital

P/SConverter:

Y(3k)Y(3k+1)Y(3k+2)

T T T

Y(n)

T

T/3T/3

SamplePeriT/3

T T

ParalleltoSerialConverter

lVLSIArchitectures

WhenPipeliningWhenPa

Pipelinetechniqueisusedwhenth(Number1,2,3,4)

ParallelismisusedwhenthecriticacommunicationorI/Obound.(Numb

Pipeliningdoesnothelpinthiscas a.k.a CommunicationBounded

3

Comb.Logic

1

2

531

4

M.Shabany,Digital

arallelism?

hecriticalpathisinthedesign

alpathisboundedbytheber5)se!

Comb.Logic

lVLSIArchitectures