Download - CS250 VLSI Systems Design‣ Column decoder used to select one or more columns for input/output of data 13 Storage cell could be either static or dynamic Lecture 09 CS250, UC Berkeley

CS250, UC Berkeley Fall ‘20Lecture 09

CS250 VLSISystemsDesign

Fall2020

JohnWawrzynek

with

AryaReais-Parsi


ProjectUpdate‣ Sixteamsformed‣ Assignmentsmadebasedonyourpreferences‣ AllworkingtowardsoneFPGAdesign:‣ Hybridarraywithfine-grainedlogicblocks,widemultiply/

accumulateblocks,blockRAMs

‣ Toolsupport,Completelayout

2

Fabric

InterconnectConfiguration

CLB

MAC

SRAMInteraction/coordination Graph


ProjectTeams‣ Fabric:JinyueZhu,Philip,Tan,(Arya)‣ High-levelfabricarchitecture‣ clocks,power,metallayerassignments‣ FPGAtoolflow(Yosys/NextPnRorVTR)‣ Testcircuits/benchmarks‣ Chiplevelsimulationandlayoutintegration

‣ MAC:RyanLund,Anson‣ hardblockdesignandimplementation‣ multiply/add,ALUfunctions?‣ configurabledata-pathwidth

3


ProjectTeams‣ SRAM:Rohan,Adhiraj‣ DenseRAMblockdesignandimplementation‣ Configurablewidth/depth‣ “openRAM”(UCSD),firstoption

‣ CLB:Kareem,RyanThornton‣ considerseveraldesignalternatives(s44isagoodbet)‣ includecarrylogic‣ considerbothstandardcellandcustomlayout

4


ProjectTeams‣ Interconnect:Yukio,Nate‣ Twodesigns:‣ Traditional“islandstyle”withconnectionboxesandswitch

boxes.Perhaps“Wilton”switchboxdesign.

‣ Novel(columnoriented)‣ Sharelayoutpieces(programmableinterconnection

points)

‣ Configuration:Josh,Aled‣ programminginterface,internalstructure‣ granularity/mechanismsforpartialreconfiguration

5


ProjectTeamPresentations‣ Inclass,Oct1(nextThursday)‣ Target10minutes(withdiscussion)‣ Slideswithillustrations(powerpoint,…)‣ Onepresentationeachfrom:config,CLB,SRAM,MAC

teams

‣ Twointerconnectionpresentations‣ Threefabricteampresentations:‣ toolsupport‣ high-levelfabricarchitecture‣ simulation,testing,integrationplan

‣ Followingweek,privategroupmeetingswithArya&Johntogetfeedbackandbrainstormideas

6


ProjectTeamPresentations‣ Thepointistogetthediscussiongoingonthefunctionand

implementationofyourpiece.

‣ Youareresponsiblefora“straw-man”/draftproposal‣ Okaytoleavesomeissuesopenfornow‣ Outline‣ Onlyonepersonneedstospeakbut,introduceteam

members

‣ Describeyourproposedfunction/featuresandstructure(blockdiagram/circuit)ofyourpiece

‣ Describehowyouplantorefinethedefinitionsoffunction/structureandtooptimizethedesign

‣ Saysomethingaboutimplementationstrategy‣ Saysomethingaboutwhatinformationyouwillneedfrom

otherteamsandwhatotherteamswillneedfromyou 7


ProjectTeams‣ Ifyouhavequestionsabouthow

youendedupinwhichteam,mailmeorsetupappointment

‣ Ifyouhavequestionsaboutyourteam’sroleandresponsibility,asknow,ormailuslater

‣ Ifyoudon’thaveemailcontactsforyourotherteammembers,asknow,ormailuslater

‣ Toprepareforthepresentationsnextweek,notnecessaryrightnowtoreachouttoothergroups,butfeelfreetodoso

8

Fabric

InterconnectConfiguration

CLB

MAC

SRAMInteraction/coordination Graph


CircuitsTopics:Basic(review?)

‣ Processing/devices:planar,finfets,GDR‣ Devicemodels:switch,RC,Vth‣ Logiccircuits:gates,muxes,transmissiongates,FFs‣ CircuitDelay:gatedelay,wiredelay,FETsizing‣ CircuitPower:formulation/factors‣ SystemDelay:factors,optimization‣ SystemPower:factors,optimization

9

What you need to know as a VLSI Systems designer.


LogicCircuit‣ Logicgatesintransistors‣ TransmissionGates‣ Tri-stateBuffers‣ MultiplexorCircuits‣ Latch/Flip-flopcircuits‣ SRAMcircuits

10


LatchesandFlip-flopsPositiveLevel-sensitivelatch:

LatchTransistorLevel:Positive Edge-triggered flip-flop

built from two level-sensitive latches:

11

clk’

clk

clk

clk’

LatchImplementation:


SRAMCellArrayDetails

12

Mostcommonis6-transistor(6T)cellarray.wor

bit bit wor

bit bit wor

bit bit

wor

bit bit wor

bit bit wor

bit bit word line

bit bit

Wordselectsthiscell,andallothersinarow.

Forwriteoperation,columnbitlinesaredrivendifferentially(0onone,1ontheother).Valuesoverwritescellstate.

Forreadoperation,columnbitlinesareequalized(settosamevoltage),thenreleased.Cellpullsdownonebitlineortheother.


GenericMemoryBlock‣ Wordlinesusedtoselecta

rowforreadingorwriting

‣ Bitlinescarrydatato/fromperiphery

‣ Coreaspectratiokeepcloseto1tohelpbalancedelayonwordlineversusbitline

‣ Addressbitsaredividedbetweenthetwodecoders

‣ Rowdecoderusedtoselectwordline

‣ Columndecoderusedtoselectoneormorecolumnsforinput/outputofdata

13

Storage cell could be either static or dynamic


CircuitDelay‣ RCbasedgatedelay‣ WireDelay‣ TransistorSizing

14


TransistorsasConductors‣ ImprovedTransistorModel:nFET • We refer to transistor "strength" as the

amount of current that flows for a given Vds and Vgs.

• The strength is linearly proportional to the ratio of W/L.

pFET

15


GateDelayistheResultofCascading• Cascaded gates:

“transfer curve” for inverter.

16


GateDelaySummary

17

inverter

2-NAND2-NOR

tp

f

The y-intercepts for NAND and NOR are both twice that of the inverter. The NAND line has a gradient 4/3 that of the inverter (steeper); for NOR it is 5/3 (steepest).

What about gates with more than 2-inputs?

Look at 4-input NAND:

interceptslope


DelayinFlip-flops• Setuptimeresultsfromdelaythroughfirstlatch.

clk

clk’

clk

clk’

clk

clk’

clk

clk’

18

ClocktoQdelayresultsfromdelaythroughsecondlatch.


WireDelay‣ Eveninthosecaseswherethe

transmissionlineeffectisnegligible:

‣ Wirespossesdistributedresistanceandcapacitance

‣ TimeconstantassociatedwithdistributedRCisproportionaltothesquareofthelength

• For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important.

– Typically around half of C of gate load is in the wires.

• For long wires on ICs: – busses, clock lines, global control

signal, etc. – Resistance is significant, therefore

distributed RC effect dominates. – signals are typically “rebuffered” to

reduce delay:

v1 v2 v3 v4

19

v1

v4v3

v2

time


GateDrivinglongwireandothergates

20

tp = 0.69RdrCint + 0.69RdrCw + 0.38RwCw + 0.69RdrCfan + 0.69RwCfan= 0.69Rdr(Cint + Cfan) + 0.69(Rdrcw + rwCfan)L + 0.38rwcwL2

Rw = rwL, Cw = cwL


DrivingLargeLoads‣ Largefanoutnets:clocks,resets,memorybitlines,

off-chip

‣ Relativelysmalldriverresultsinlongrisetime(andthuslargegatedelay)

‣ Strategy:

‣ Howtooptimallyscaledrivers?‣ Optimaltrade-offbetweendelayperstageandtotal

numberofstages?

StagedBuffers

21


CircuitPower‣ SwitchingEnergy/Power‣ ShortCircuitcurrent‣ Leakagecurrent

22

CS250, UC Berkeley Fall ‘20Lecture 09 23


SwitchingEnergy:FundamentalPhysics

24

Every logic transition dissipates energy.

Howcanwelimit

switchingenergy?

(1)Reduce#ofclocktransitions.Butwehaveworktodo...

(2)ReduceVdd.ButloweringVddlimitstheclockspeed...

(3)Fewercircuits.Butmoretransistorscandomorework.

(4)ReduceCpernode.Onereasonwhywescaleprocesses.

Spring 2003 EECS150 – Lec10-Timing Page 10

Gate Switching Behavior

• Inverter:

• NAND gate:

Vdd

12

C VddE0->1= 2

Vdd

12

C VddE1->0= 2

C

Strong result: Independent of technology.


Chip-Level“Dynamic”Power

25

Psw = 1/2 α C Vdd2 F

“activity factor”, average percentage of

capacitance switching per cycle (~ number of

nodes to switch)

Total chip capacitance to be

switched

Clock Frequency


SystemDelay‣ CriticalPath‣ OptimizationTechniques‣ Clockdistribution‣

26


InGeneral...

T ≥ τclk→Q + τCL + τsetup

‣ Howdoweenumerateallpaths?– Anycircuitinputorregisteroutputtoanyregisterinputorcircuit

output?

• Note:– “setuptime”foroutputsisafunctionofwhatitconnectsto.– “clk-to-q”forcircuitinputsdependsonwhereitcomesfrom.

27

For correct operation:

for all paths.


ComponentsofPathDelay

1. #oflevelsoflogic2. Internalcelldelay3. wiredelay4. cellinputcapacitance5. cellfanout6. celloutputdrivestrength

28

How do we optimize?Tackle “critical path”

Synthesis tools approximate path delay and attempt to optimize by rearranging logic network and choosing appropriately sized cells.

“Logical Effort” method for hand sizing of transistors.

Place and route tools attempt to minimize wire delay on critical paths.


Treesforoptimization

29

+ + + + + + +x0

x1 x2 x3 x4 x5 x6 x7

T = O(N)

+ +

+ + +

+

+

T = O(log N)

(( x0 + x1 ) + ( x2 + x3 )) + (( x4 + x5 ) + ( x6 + x7 ))

((((((x0 + x1 ) + x2 ) + x3 ) + x4 ) + x5 ) + x6 ) + x7

❑ What property of “+” are we exploiting? ❑ Other associate operators? Boolean operations? Division? Min/Max?

Same number of operations (N-1)


Pipelining‣ Generalprinciple:

‣ CuttheCLblockintopieces(stages)andseparatewithregisters:

Assume T=8ns TFF(setup +clk→q)=1ns F = 1/9ns = 111MHz

Assume T1 = T2 = 4ns

30


SystemPower‣ Chip/blocklevelPower‣ Optimizationforpowerandenergyefficiency‣ Powerdistribution

31


EnergyandPower

‣ Handheldandportable(batteryoperated):❑ EnergyEfficiency-limitsbatterylife❑ Power-limitedbyheat

‣ Infrastructureandservers(connectedtopowergrid):❑ EnergyEfficiency-dictatesoperationcost❑ Power-heatremovalcontributestoTCO

32

Energy Efficiency: energy per operation

P =dWdt

Energy is the ability to do work (W).Power is rate of expending energy.

Remember: reducing power is easy - just slow down. Improving energy efficiency is difficult.

Heat is a byproduct of computation. Heat dissipated is proportional to the energy used per unit time, P.


Fivelow-powerdesigntechniques

33

Power-down idle transistors

Parallelism and pipelining

Slow down non-critical paths

Clock gating

Thermal management

CS250, UC Berkeley Fall ‘20Lecture 09 34

Gate delay roughly linear

with Vdd

This magic trick brought to you by Cory Hall ...

3636

Active Power ReductionActive Power Reduction

Slow Fast Slow

Lo

w S

up

ply

Vo

ltag

e

Hig

h S

up

ply

Vo

ltag

e

Multiple Supply

Voltages

Logic BlockFreq = 1

Vdd = 1

Throughput = 1

Power = 1

Area = 1

Pwr Den = 1

Vdd

Logic Block

Freq = 0.5

Vdd = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Den = 0.125

Vdd/2

Logic Block

Replicated DesignsAnd so, we can transform this:

Block processes stereo audio. 1/2 of clocks for “left”, 1/2 for “right”.

P ~ F ⨯ Vdd2

P ~ 1 ⨯ 1 2

Into this: Top block processes “left”, bottom “right”.

3636


Slow Fast Slow

Lo

w S

up

ply

Vo

ltag

e

Hig

h S

up

ply

Vo

ltag

e

Multiple Supply

Voltages

Logic BlockFreq = 1

Vdd = 1

Throughput = 1

Power = 1

Area = 1

Pwr Den = 1

Vdd

Logic Block

Freq = 0.5

Vdd = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Den = 0.125

Vdd/2

Logic Block

Replicated Designs

CV2 power only

P ~ #blks ⨯ F ⨯ Vdd 2

P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4


Cell(PS3Chip):1CPU+8“SPUs”

35

PowerPC

L2 Cache512 KB

Synergistic Processing

Units(SPUs)

8


Circuit Techniques ReduceCircuit Techniques ReduceSource Drain LeakageSource Drain Leakage

Body BiasBody Bias

+ + VeVe

VddVddVbpVbp

VbnVbn

- - VeVe

2 - 10X2 - 10X

Sleep TransistorSleep Transistor

2 - 1000X2 - 1000X

Stack EffectStack Effect

5 - 10X5 - 10X

Logic Logic

BlockBlock

Equal LoadingEqual Loading

LeakageLeakage

ReductionReduction

Add“sleep”transistorstologic...

36

Example:Floatingpointunitlogic.

Whenrunningfixed-pointinstructions,putlogic“tosleep”.

+++When“asleep”,leakagepowerisdramaticallyreduced.

---Presenceofsleeptransistorsslowsdowntheclockratewhenthelogicblockisinuse.


Fact:Mostlogiconachipis“toofast”° Aproductthat

37

From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

netlist. Of these, 1 2 1 7 1 3 were top-level chip global nets,and 2 1 7 1 1 were processor-core-level global nets. Againstthis model 3 .5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9 .8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 1 2 GB to 1 4 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6 0 0 0 *Model S8 0 configured with 6 4 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as

well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2 .5 –3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1 .9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 2 4 -hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.

The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.

SummaryThe 1 7 4 -million-transistor !1 .3 -GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4 , emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.

Figure 25

POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.

VIM

Timer files ReportsAsserts

Spice

Spice

GL/1

Reports

< 12 hr

< 12 hr

< 12 hr

< 48 hr

< 24 hr

Non-uplift timing

Noiseimpacton timing

Upliftanalysis

Capacitanceadjust

Chipbench /EinsTimer

Chipbench /EinsTimer

Extraction

Core or chipwiring

Analysis/update(wires, buffers)

Notes:• Executed 2– 3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late

Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs

Figure 26

Histogram of the POWER4 processor path delays.

!40 !20 0 20 40 6 0 80 100 120 140 16 0 180 200 220 240 26 0 280Timing slack (ps)

Lat

e-m

ode

timin

g ch

ecks

(th

ousa

nds)

0

50

100

150

200

IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2 J. D. WARNOCK ET AL.

47

Most wires have hundreds of picoseconds to spare.The critical path


3636


Slow Fast Slow

Lo

w S

up

ply

Vo

ltag

e

Hig

h S

up

ply

Vo

ltag

e

Multiple Supply

Voltages

Logic BlockFreq = 1

Vdd = 1

Throughput = 1

Power = 1

Area = 1

Pwr Den = 1

Vdd

Logic Block

Freq = 0.5

Vdd = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Den = 0.125

Vdd/2

Logic Block

Replicated Designs

Useseveralsupplyvoltagesonachip...

38

Whyusemulti-Vdd?Wecanreducedynamicpowerbyusinglow-powerVddforlogicoffthecriticalpath.

Whatifwecan’tdoamulti-Vdddesign?Inamulti-Vtprocess,wecanreduceleakagepowerontheoffcriticalpathlogicbyusinghigh-Vthtransistors.


ClockGatingReducesClockLoad

39

“Upto70%powersavingsattheblocklevel,forapplicablecircuits”SynopsisDataSheet


Keepchipcooltominimizeleakage

40

Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment

WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7

R

Optimizing Designs for Power Consumption through Changes to the FPGA Environment

To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.

Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.

Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.

X-Ref Target - Figure 3

Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C

-40 -20 200 40 6 0 80 100 120 140

25°C

50°C

WP285_03_021208

25

50

80°C

100°C

I CC

INT

Q L

eaka

ge C

urre

nt(N

orm

aliz

ed to

25°

C)

Junction Temp °C

JunctionTemperature

(TJ °C)

NormalizedStatic Poweror ICCINTQ

Typical

85

100

1.00

1.46

2.50

3.14

1

2

3

4

5

6

7



R







-40 -20 200 40 6 0 80 100 120 140

25°C

50°C

WP285_03_021208

25

50

80°C

100°C

I CC

INT

Q L

eaka

ge C

urre

nt(N

orm

aliz

ed to

25°

C)

Junction Temp °C

JunctionTemperature

(TJ °C)


Typical

85

100

1.00

1.46

2.50

3.14

1

2

3

4

5

6

7



R







-40 -20 200 40 6 0 80 100 120 140

25°C

50°C

WP285_03_021208

25

50

80°C

100°C

I CC

INT

Q L

eaka

ge C

urre

nt(N

orm

aliz

ed to

25°

C)

Junction Temp °C

JunctionTemperature

(TJ °C)


Typical

85

100

1.00

1.46

2.50

3.14

1

2

3

4

5

6

7

A recipe for thermal runaway


CircuitsTopics:Advanced‣ Clocksandclocking:‣ clockdriversanddistribution‣ skeweffects‣ hold-time‣ clockdomainsandsynchronization‣ Phase-lockedLoops(PLL)/Delay-lockedLoops(DLL)‣ GloballyAsynchronouslocallySynchronous(GALS)clocking

‣ Powersupplyanduse‣ Powerdistributionanddecouplingcapacitors‣ DynamicVoltageandFrequencyScaling(DVFS)‣ voltageregulators‣ devicestacking,powergating,clockgating,multi-threshold‣ Multi-voltagesystems‣ chargepumps‣ latch-up/wellplugs

‣ Input/Output‣ ElectrostaticDischarge(ESD)suppression/pad-drivers‣ High-speedI/O,Serializer/Deserializer(SerDes)‣ packaging

41

CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

EndofLecture9

42