CS250, UC Berkeley Fall ‘20Lecture 09
CS250 VLSISystemsDesign
Fall2020
JohnWawrzynek
with
AryaReais-Parsi
CS250, UC Berkeley Fall ‘20Lecture 09
ProjectUpdate‣ Sixteamsformed‣ Assignmentsmadebasedonyourpreferences‣ AllworkingtowardsoneFPGAdesign:‣ Hybridarraywithfine-grainedlogicblocks,widemultiply/
accumulateblocks,blockRAMs
‣ Toolsupport,Completelayout
2
Fabric
InterconnectConfiguration
CLB
MAC
SRAMInteraction/coordination Graph
CS250, UC Berkeley Fall ‘20Lecture 09
ProjectTeams‣ Fabric:JinyueZhu,Philip,Tan,(Arya)‣ High-levelfabricarchitecture‣ clocks,power,metallayerassignments‣ FPGAtoolflow(Yosys/NextPnRorVTR)‣ Testcircuits/benchmarks‣ Chiplevelsimulationandlayoutintegration
‣ MAC:RyanLund,Anson‣ hardblockdesignandimplementation‣ multiply/add,ALUfunctions?‣ configurabledata-pathwidth
3
CS250, UC Berkeley Fall ‘20Lecture 09
ProjectTeams‣ SRAM:Rohan,Adhiraj‣ DenseRAMblockdesignandimplementation‣ Configurablewidth/depth‣ “openRAM”(UCSD),firstoption
‣ CLB:Kareem,RyanThornton‣ considerseveraldesignalternatives(s44isagoodbet)‣ includecarrylogic‣ considerbothstandardcellandcustomlayout
4
CS250, UC Berkeley Fall ‘20Lecture 09
ProjectTeams‣ Interconnect:Yukio,Nate‣ Twodesigns:‣ Traditional“islandstyle”withconnectionboxesandswitch
boxes.Perhaps“Wilton”switchboxdesign.
‣ Novel(columnoriented)‣ Sharelayoutpieces(programmableinterconnection
points)
‣ Configuration:Josh,Aled‣ programminginterface,internalstructure‣ granularity/mechanismsforpartialreconfiguration
5
CS250, UC Berkeley Fall ‘20Lecture 09
ProjectTeamPresentations‣ Inclass,Oct1(nextThursday)‣ Target10minutes(withdiscussion)‣ Slideswithillustrations(powerpoint,…)‣ Onepresentationeachfrom:config,CLB,SRAM,MAC
teams
‣ Twointerconnectionpresentations‣ Threefabricteampresentations:‣ toolsupport‣ high-levelfabricarchitecture‣ simulation,testing,integrationplan
‣ Followingweek,privategroupmeetingswithArya&Johntogetfeedbackandbrainstormideas
6
CS250, UC Berkeley Fall ‘20Lecture 09
ProjectTeamPresentations‣ Thepointistogetthediscussiongoingonthefunctionand
implementationofyourpiece.
‣ Youareresponsiblefora“straw-man”/draftproposal‣ Okaytoleavesomeissuesopenfornow‣ Outline‣ Onlyonepersonneedstospeakbut,introduceteam
members
‣ Describeyourproposedfunction/featuresandstructure(blockdiagram/circuit)ofyourpiece
‣ Describehowyouplantorefinethedefinitionsoffunction/structureandtooptimizethedesign
‣ Saysomethingaboutimplementationstrategy‣ Saysomethingaboutwhatinformationyouwillneedfrom
otherteamsandwhatotherteamswillneedfromyou 7
CS250, UC Berkeley Fall ‘20Lecture 09
ProjectTeams‣ Ifyouhavequestionsabouthow
youendedupinwhichteam,mailmeorsetupappointment
‣ Ifyouhavequestionsaboutyourteam’sroleandresponsibility,asknow,ormailuslater
‣ Ifyoudon’thaveemailcontactsforyourotherteammembers,asknow,ormailuslater
‣ Toprepareforthepresentationsnextweek,notnecessaryrightnowtoreachouttoothergroups,butfeelfreetodoso
8
Fabric
InterconnectConfiguration
CLB
MAC
SRAMInteraction/coordination Graph
CS250, UC Berkeley Fall ‘20Lecture 09
CircuitsTopics:Basic(review?)
‣ Processing/devices:planar,finfets,GDR‣ Devicemodels:switch,RC,Vth‣ Logiccircuits:gates,muxes,transmissiongates,FFs‣ CircuitDelay:gatedelay,wiredelay,FETsizing‣ CircuitPower:formulation/factors‣ SystemDelay:factors,optimization‣ SystemPower:factors,optimization
9
What you need to know as a VLSI Systems designer.
CS250, UC Berkeley Fall ‘20Lecture 09
LogicCircuit‣ Logicgatesintransistors‣ TransmissionGates‣ Tri-stateBuffers‣ MultiplexorCircuits‣ Latch/Flip-flopcircuits‣ SRAMcircuits
10
CS250, UC Berkeley Fall ‘20Lecture 09
LatchesandFlip-flopsPositiveLevel-sensitivelatch:
LatchTransistorLevel:Positive Edge-triggered flip-flop
built from two level-sensitive latches:
11
clk’
clk
clk
clk’
LatchImplementation:
CS250, UC Berkeley Fall ‘20Lecture 09
SRAMCellArrayDetails
12
Mostcommonis6-transistor(6T)cellarray.wor
bit bit wor
bit bit wor
bit bit
wor
bit bit wor
bit bit wor
bit bit word line
bit bit
Wordselectsthiscell,andallothersinarow.
Forwriteoperation,columnbitlinesaredrivendifferentially(0onone,1ontheother).Valuesoverwritescellstate.
Forreadoperation,columnbitlinesareequalized(settosamevoltage),thenreleased.Cellpullsdownonebitlineortheother.
CS250, UC Berkeley Fall ‘20Lecture 09
GenericMemoryBlock‣ Wordlinesusedtoselecta
rowforreadingorwriting
‣ Bitlinescarrydatato/fromperiphery
‣ Coreaspectratiokeepcloseto1tohelpbalancedelayonwordlineversusbitline
‣ Addressbitsaredividedbetweenthetwodecoders
‣ Rowdecoderusedtoselectwordline
‣ Columndecoderusedtoselectoneormorecolumnsforinput/outputofdata
13
Storage cell could be either static or dynamic
CS250, UC Berkeley Fall ‘20Lecture 09
CircuitDelay‣ RCbasedgatedelay‣ WireDelay‣ TransistorSizing
14
CS250, UC Berkeley Fall ‘20Lecture 09
TransistorsasConductors‣ ImprovedTransistorModel:nFET • We refer to transistor "strength" as the
amount of current that flows for a given Vds and Vgs.
• The strength is linearly proportional to the ratio of W/L.
pFET
15
CS250, UC Berkeley Fall ‘20Lecture 09
GateDelayistheResultofCascading• Cascaded gates:
“transfer curve” for inverter.
16
CS250, UC Berkeley Fall ‘20Lecture 09
GateDelaySummary
17
inverter
2-NAND2-NOR
tp
f
The y-intercepts for NAND and NOR are both twice that of the inverter. The NAND line has a gradient 4/3 that of the inverter (steeper); for NOR it is 5/3 (steepest).
What about gates with more than 2-inputs?
Look at 4-input NAND:
interceptslope
CS250, UC Berkeley Fall ‘20Lecture 09
DelayinFlip-flops• Setuptimeresultsfromdelaythroughfirstlatch.
clk
clk’
clk
clk’
clk
clk’
clk
clk’
18
ClocktoQdelayresultsfromdelaythroughsecondlatch.
CS250, UC Berkeley Fall ‘20Lecture 09
WireDelay‣ Eveninthosecaseswherethe
transmissionlineeffectisnegligible:
‣ Wirespossesdistributedresistanceandcapacitance
‣ TimeconstantassociatedwithdistributedRCisproportionaltothesquareofthelength
• For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important.
– Typically around half of C of gate load is in the wires.
• For long wires on ICs: – busses, clock lines, global control
signal, etc. – Resistance is significant, therefore
distributed RC effect dominates. – signals are typically “rebuffered” to
reduce delay:
v1 v2 v3 v4
19
v1
v4v3
v2
time
CS250, UC Berkeley Fall ‘20Lecture 09
GateDrivinglongwireandothergates
20
tp = 0.69RdrCint + 0.69RdrCw + 0.38RwCw + 0.69RdrCfan + 0.69RwCfan= 0.69Rdr(Cint + Cfan) + 0.69(Rdrcw + rwCfan)L + 0.38rwcwL2
Rw = rwL, Cw = cwL
CS250, UC Berkeley Fall ‘20Lecture 09
DrivingLargeLoads‣ Largefanoutnets:clocks,resets,memorybitlines,
off-chip
‣ Relativelysmalldriverresultsinlongrisetime(andthuslargegatedelay)
‣ Strategy:
‣ Howtooptimallyscaledrivers?‣ Optimaltrade-offbetweendelayperstageandtotal
numberofstages?
StagedBuffers
21
CS250, UC Berkeley Fall ‘20Lecture 09
CircuitPower‣ SwitchingEnergy/Power‣ ShortCircuitcurrent‣ Leakagecurrent
22
CS250, UC Berkeley Fall ‘20Lecture 09 23
CS250, UC Berkeley Fall ‘20Lecture 09
SwitchingEnergy:FundamentalPhysics
24
Every logic transition dissipates energy.
Howcanwelimit
switchingenergy?
(1)Reduce#ofclocktransitions.Butwehaveworktodo...
(2)ReduceVdd.ButloweringVddlimitstheclockspeed...
(3)Fewercircuits.Butmoretransistorscandomorework.
(4)ReduceCpernode.Onereasonwhywescaleprocesses.
Spring 2003 EECS150 – Lec10-Timing Page 10
Gate Switching Behavior
• Inverter:
• NAND gate:
Vdd
12
C VddE0->1= 2
Vdd
12
C VddE1->0= 2
C
Strong result: Independent of technology.
CS250, UC Berkeley Fall ‘20Lecture 09
Chip-Level“Dynamic”Power
25
Psw = 1/2 α C Vdd2 F
“activity factor”, average percentage of
capacitance switching per cycle (~ number of
nodes to switch)
Total chip capacitance to be
switched
Clock Frequency
CS250, UC Berkeley Fall ‘20Lecture 09
SystemDelay‣ CriticalPath‣ OptimizationTechniques‣ Clockdistribution‣
26
CS250, UC Berkeley Fall ‘20Lecture 09
InGeneral...
T ≥ τclk→Q + τCL + τsetup
‣ Howdoweenumerateallpaths?– Anycircuitinputorregisteroutputtoanyregisterinputorcircuit
output?
• Note:– “setuptime”foroutputsisafunctionofwhatitconnectsto.– “clk-to-q”forcircuitinputsdependsonwhereitcomesfrom.
27
For correct operation:
for all paths.
CS250, UC Berkeley Fall ‘20Lecture 09
ComponentsofPathDelay
1. #oflevelsoflogic2. Internalcelldelay3. wiredelay4. cellinputcapacitance5. cellfanout6. celloutputdrivestrength
28
How do we optimize?Tackle “critical path”
Synthesis tools approximate path delay and attempt to optimize by rearranging logic network and choosing appropriately sized cells.
“Logical Effort” method for hand sizing of transistors.
Place and route tools attempt to minimize wire delay on critical paths.
CS250, UC Berkeley Fall ‘20Lecture 09
Treesforoptimization
29
+ + + + + + +x0
x1 x2 x3 x4 x5 x6 x7
T = O(N)
+ +
+ + +
+
+
T = O(log N)
(( x0 + x1 ) + ( x2 + x3 )) + (( x4 + x5 ) + ( x6 + x7 ))
((((((x0 + x1 ) + x2 ) + x3 ) + x4 ) + x5 ) + x6 ) + x7
❑ What property of “+” are we exploiting? ❑ Other associate operators? Boolean operations? Division? Min/Max?
Same number of operations (N-1)
CS250, UC Berkeley Fall ‘20Lecture 09
Pipelining‣ Generalprinciple:
‣ CuttheCLblockintopieces(stages)andseparatewithregisters:
Assume T=8ns TFF(setup +clk→q)=1ns F = 1/9ns = 111MHz
Assume T1 = T2 = 4ns
30
CS250, UC Berkeley Fall ‘20Lecture 09
SystemPower‣ Chip/blocklevelPower‣ Optimizationforpowerandenergyefficiency‣ Powerdistribution
31
CS250, UC Berkeley Fall ‘20Lecture 09
EnergyandPower
‣ Handheldandportable(batteryoperated):❑ EnergyEfficiency-limitsbatterylife❑ Power-limitedbyheat
‣ Infrastructureandservers(connectedtopowergrid):❑ EnergyEfficiency-dictatesoperationcost❑ Power-heatremovalcontributestoTCO
32
Energy Efficiency: energy per operation
P =dWdt
Energy is the ability to do work (W).Power is rate of expending energy.
Remember: reducing power is easy - just slow down. Improving energy efficiency is difficult.
Heat is a byproduct of computation. Heat dissipated is proportional to the energy used per unit time, P.
CS250, UC Berkeley Fall ‘20Lecture 09
Fivelow-powerdesigntechniques
33
Power-down idle transistors
Parallelism and pipelining
Slow down non-critical paths
Clock gating
Thermal management
CS250, UC Berkeley Fall ‘20Lecture 09 34
Gate delay roughly linear
with Vdd
This magic trick brought to you by Cory Hall ...
3636
Active Power ReductionActive Power Reduction
Slow Fast Slow
Lo
w S
up
ply
Vo
ltag
e
Hig
h S
up
ply
Vo
ltag
e
Multiple Supply
Voltages
Logic BlockFreq = 1
Vdd = 1
Throughput = 1
Power = 1
Area = 1
Pwr Den = 1
Vdd
Logic Block
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
Vdd/2
Logic Block
Replicated DesignsAnd so, we can transform this:
Block processes stereo audio. 1/2 of clocks for “left”, 1/2 for “right”.
P ~ F ⨯ Vdd2
P ~ 1 ⨯ 1 2
Into this: Top block processes “left”, bottom “right”.
3636
Active Power ReductionActive Power Reduction
Slow Fast Slow
Lo
w S
up
ply
Vo
ltag
e
Hig
h S
up
ply
Vo
ltag
e
Multiple Supply
Voltages
Logic BlockFreq = 1
Vdd = 1
Throughput = 1
Power = 1
Area = 1
Pwr Den = 1
Vdd
Logic Block
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
Vdd/2
Logic Block
Replicated Designs
CV2 power only
P ~ #blks ⨯ F ⨯ Vdd 2
P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4
CS250, UC Berkeley Fall ‘20Lecture 09
Cell(PS3Chip):1CPU+8“SPUs”
35
PowerPC
L2 Cache512 KB
Synergistic Processing
Units(SPUs)
8
CS250, UC Berkeley Fall ‘20Lecture 093434
Circuit Techniques ReduceCircuit Techniques ReduceSource Drain LeakageSource Drain Leakage
Body BiasBody Bias
+ + VeVe
VddVddVbpVbp
VbnVbn
- - VeVe
2 - 10X2 - 10X
Sleep TransistorSleep Transistor
2 - 1000X2 - 1000X
Stack EffectStack Effect
5 - 10X5 - 10X
Logic Logic
BlockBlock
Equal LoadingEqual Loading
LeakageLeakage
ReductionReduction
Add“sleep”transistorstologic...
36
Example:Floatingpointunitlogic.
Whenrunningfixed-pointinstructions,putlogic“tosleep”.
+++When“asleep”,leakagepowerisdramaticallyreduced.
---Presenceofsleeptransistorsslowsdowntheclockratewhenthelogicblockisinuse.
CS250, UC Berkeley Fall ‘20Lecture 09
Fact:Mostlogiconachipis“toofast”° Aproductthat
37
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
netlist. Of these, 1 2 1 7 1 3 were top-level chip global nets,and 2 1 7 1 1 were processor-core-level global nets. Againstthis model 3 .5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9 .8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 1 2 GB to 1 4 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6 0 0 0 *Model S8 0 configured with 6 4 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as
well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2 .5 –3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1 .9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 2 4 -hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.
The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.
SummaryThe 1 7 4 -million-transistor !1 .3 -GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4 , emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.
Figure 25
POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.
VIM
Timer files ReportsAsserts
Spice
Spice
GL/1
Reports
< 12 hr
< 12 hr
< 12 hr
< 48 hr
< 24 hr
Non-uplift timing
Noiseimpacton timing
Upliftanalysis
Capacitanceadjust
Chipbench /EinsTimer
Chipbench /EinsTimer
Extraction
Core or chipwiring
Analysis/update(wires, buffers)
Notes:• Executed 2– 3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late
Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs
Figure 26
Histogram of the POWER4 processor path delays.
!40 !20 0 20 40 6 0 80 100 120 140 16 0 180 200 220 240 26 0 280Timing slack (ps)
Lat
e-m
ode
timin
g ch
ecks
(th
ousa
nds)
0
50
100
150
200
IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2 J. D. WARNOCK ET AL.
47
Most wires have hundreds of picoseconds to spare.The critical path
CS250, UC Berkeley Fall ‘20Lecture 09
3636
Active Power ReductionActive Power Reduction
Slow Fast Slow
Lo
w S
up
ply
Vo
ltag
e
Hig
h S
up
ply
Vo
ltag
e
Multiple Supply
Voltages
Logic BlockFreq = 1
Vdd = 1
Throughput = 1
Power = 1
Area = 1
Pwr Den = 1
Vdd
Logic Block
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
Vdd/2
Logic Block
Replicated Designs
Useseveralsupplyvoltagesonachip...
38
Whyusemulti-Vdd?Wecanreducedynamicpowerbyusinglow-powerVddforlogicoffthecriticalpath.
Whatifwecan’tdoamulti-Vdddesign?Inamulti-Vtprocess,wecanreduceleakagepowerontheoffcriticalpathlogicbyusinghigh-Vthtransistors.
CS250, UC Berkeley Fall ‘20Lecture 09
ClockGatingReducesClockLoad
39
“Upto70%powersavingsattheblocklevel,forapplicablecircuits”SynopsisDataSheet
CS250, UC Berkeley Fall ‘20Lecture 09
Keepchipcooltominimizeleakage
40
Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment
WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7
R
Optimizing Designs for Power Consumption through Changes to the FPGA Environment
To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.
Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.
Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.
X-Ref Target - Figure 3
Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C
-40 -20 200 40 6 0 80 100 120 140
25°C
50°C
WP285_03_021208
25
50
80°C
100°C
I CC
INT
Q L
eaka
ge C
urre
nt(N
orm
aliz
ed to
25°
C)
Junction Temp °C
JunctionTemperature
(TJ °C)
NormalizedStatic Poweror ICCINTQ
Typical
85
100
1.00
1.46
2.50
3.14
1
2
3
4
5
6
7
Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment
WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7
R
Optimizing Designs for Power Consumption through Changes to the FPGA Environment
To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.
Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.
Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.
X-Ref Target - Figure 3
Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C
-40 -20 200 40 6 0 80 100 120 140
25°C
50°C
WP285_03_021208
25
50
80°C
100°C
I CC
INT
Q L
eaka
ge C
urre
nt(N
orm
aliz
ed to
25°
C)
Junction Temp °C
JunctionTemperature
(TJ °C)
NormalizedStatic Poweror ICCINTQ
Typical
85
100
1.00
1.46
2.50
3.14
1
2
3
4
5
6
7
Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment
WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7
R
Optimizing Designs for Power Consumption through Changes to the FPGA Environment
To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.
Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.
Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.
X-Ref Target - Figure 3
Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C
-40 -20 200 40 6 0 80 100 120 140
25°C
50°C
WP285_03_021208
25
50
80°C
100°C
I CC
INT
Q L
eaka
ge C
urre
nt(N
orm
aliz
ed to
25°
C)
Junction Temp °C
JunctionTemperature
(TJ °C)
NormalizedStatic Poweror ICCINTQ
Typical
85
100
1.00
1.46
2.50
3.14
1
2
3
4
5
6
7
A recipe for thermal runaway
CS250, UC Berkeley Fall ‘20Lecture 09
CircuitsTopics:Advanced‣ Clocksandclocking:‣ clockdriversanddistribution‣ skeweffects‣ hold-time‣ clockdomainsandsynchronization‣ Phase-lockedLoops(PLL)/Delay-lockedLoops(DLL)‣ GloballyAsynchronouslocallySynchronous(GALS)clocking
‣ Powersupplyanduse‣ Powerdistributionanddecouplingcapacitors‣ DynamicVoltageandFrequencyScaling(DVFS)‣ voltageregulators‣ devicestacking,powergating,clockgating,multi-threshold‣ Multi-voltagesystems‣ chargepumps‣ latch-up/wellplugs
‣ Input/Output‣ ElectrostaticDischarge(ESD)suppression/pad-drivers‣ High-speedI/O,Serializer/Deserializer(SerDes)‣ packaging
41
CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2
EndofLecture9
42
Top Related